Sunday, May 21, 2006

Multiple Failures

Yesterday (Saturday) was a very bad day... today isn't much better, and tomorrow is looking pretty gloomy too. Yesterday, had a major power failure at our main data center. That was the first failure, UPS all kicked in and started working, but notification system to tell everybody that the UPS were working failed, nobody knew we were on UPS. Second failure. Generator kicks in as it is supposed to, but fuel line from the main tank disconnects from the generator, generator runs out of petrol and stops. Third Failure. UPS systems ran out after a few hours and everything went down as hard as hard can be. As it was the weekend, user load was low, but when the systems went "blank" and didnt' come back, the users noticed. They go to our online emergency contact page to get numbers - gone, servers have no power. They go to the fancy "in-phone" directory listing for our new VOIP phones, gone, server has no power. They can't send emails, email system is in the data center. New system, the redundant backup system is still sitting beside the production system for final testing. Somebody looks up the managers home phone via 411 and gives him a call, and the word goes out. Holiday long weekend, NOBODY around. Manager is calling relatives looking for people. Generator gets repaired and fired up. We are at about 4 hours of downtime now. Everything comes back and the UPS start to charge. We do nothing until the UPS charge fully... 2 hours more hours, start to bring everything back up. Our new SAN won't come back. Our new NAS won't come back. Both have been rigoursly tested and guarenteed by the manufacturer. Won't come back. A manager is driving around the city looking for the senior system administrator's car hoping he is out for dinner or shopping instead of gone for the weekend. His cell phone died last week and it hasn't been fixed yet. Believe it or not, he finds him and his family. Only takes an hour more and everything including the SAN and NAS is back and functioning. Time for the DBA's to take over. 23 databases come back no problem, automatic crash recovery, no issues. Three don't. We get one back up and running with a rman restore via control files (MAN I love RMAN). The second one is actually our primary rman catalog database, it is gone, history. It of course doesn't back up to a catalog server, entire blade is corrupted. No hope of fixing. Rebuild it and restore from an export on tape, back and running in 3 hours. The third one is easily restored via rman. Find out during the restore of the export, that the brand new very expensive tape array is not functioning. That is being rebuilt as I type. New lesson learned. When you are configuring and testing your redundant systems. Don't build them at your local site and move them when done, instead ship them direct to where they are going and build them there. Turns out the mail notifications on the UPS being kicked on didn't get sent because the UPS on the network gear the data center's SMTP server is on failed and the server went down hard. UPS has a "certified working" sticker from 22 days ago. No data loss to any of the databases thank goodness for that. The meetings next week on this are going to be very bad indeed.

2 comments:

Peter K said...

Yikes, sounds similar to our setup.

Herod T said...

It is supposed to be a very good back. Very redundant. But everything failed.

Boy oh boy I learned a few new words and phrases last week.

Turns out construction workers at another site on the same grid caused a massive feedback that blew the really big transformers/ power station things that you see around the industrial parks. They apparently backed a 20 ton crane over a transformer.

Peter, I would expect the Government to have a slightly better setup that that. We pay enough taxes in Canada, Especially in BC :) Joke.. not starting a flame.