No, I haven't lost my job. This is just one of those topics that's good to get out there. There are honestly few things that can cause you to question if you still have a job the next day, but those things still exist. There are ways to mitigate RGEs, but do you know what they are in your area?
Service Level Agreements (SLA)
Gathering information on exactly how important a database is required for a real disaster. Disaster recovery planning is reliant on knowing your SLAs. You need to know how long a database can be down, which one is the most important, at what point do you have to start calling 3rd party people (customers, vendors, app support), how much data can we afford to lose and at what point you should stop fixing the broken server and failover to other hardware. I know, I know... No one likes giving these answers. My experience normally goes like this.
Question: How long can the database be down?
Answers: It should never be down. As little as possible. Why? Are we having issues?
Question: Which one is most important?
Answers: They're all important. Can't you bring them all up at once?
Question: What point do I start calling 3rd party people?
Answers: What's wrong now? That's case by case, call me first.
Question: How much data can we afford to lose?
Answers: None. None. None. Why? What have you done? None. We should never lose data.
Question: What point should I just stand up a new server?
Answers: We don't have spare servers. Why would you need a new one?
What can I do now?
Well, we can take some preventive action. Some of this is harder than you'd expect without first knowing what your SLA's truly are. Here's a few things you can do anyways today to help until you get these answers.
Find where your backups are stored.
Make sure the backups are stored on a different physical medium than the databases.
Make sure you test your backups occasionally to see if they're even good.
Make sure you have all your drivers for anything that's not standard.
Keep a log of what all databases are on a particular server.
Keep a log of the average size (uncompressed) of your databases per server.
Keep a log of the standard settings you might use for that server. (Ram, drive structure, version number)
Update the phone logs or at least your personal contacts with everyone you need to call if a 2AM incident happens.
Is there some sort of form that can be used?
My next post will include a list of the questions I'd want answered for each database as well as a short list of questions I need to ask myself. Having a printed list for each database, or set of databases if they have the same requirement, can be a career saver.
I plan on making a form to make this a bit easier. I will at the very least create an Excel or Word list with examples. I think this is good to have from your highest Sr. DBA to your multi-hat Network Admin who's being forced to manage a rogue database. Having this signed off by your boss may make the difference of keeping your job during a major outage. A little CYA never hurt anyone.
Another preventive action that is often overlooked is to make your entire team document what changes they have made on your server. Team meaning Sys Admin, Network Admin, Database Admin, Developers, etc. That can help pinpoint an issue faster than digging through your logs and finding out what went wrong.
ReplyDeleteI need a resume generating thingamachinga ASAP!
Company wide documentation and coordination on outages and changes would prevent a large majority of failures. A second pair of eyes is normally a good thing. RGE's seldom are good things. ^.^
Delete