|Red Hat Enterprise Linux 4: Introduction to System Administration
|Chapter 8. Planning for Disaster
As a quick thought experiment, the next time you are in your data center, look around, and imagine for a moment that it is gone. And not just the computers. Imagine that the entire building no longer exists. Next, imagine that your job is to get as much of the work that was being done in the data center going in some fashion, some where, as soon as possible. What would you do?
By thinking about this, you have taken the first step of disaster recovery. Disaster recovery is the ability to recover from an event impacting the functioning of your organization's data center as quickly and completely as possible. The type of disaster may vary, but the end goal is always the same.
The steps involved in disaster recovery are numerous and wide-ranging. Here is a high-level overview of the process, along with key points to keep in mind.
A backup site is vital, but it is still useless without a disaster recovery plan. A disaster recovery plan dictates every facet of the disaster recovery process, including but not limited to:
What events denote possible disasters
What people in the organization have the authority to declare a disaster and thereby put the plan into effect
The sequence of events necessary to prepare the backup site once a disaster has been declared
The roles and responsibilities of all key personnel with respect to carrying out the plan
An inventory of the necessary hardware and software required to restore production
A schedule listing the personnel to staff the backup site, including a rotation schedule to support ongoing operations without burning out the disaster team members
The sequence of events necessary to move operations from the backup site to the restored/new data center
Disaster recovery plans often fill multiple looseleaf binders. This level of detail is vital because in the event of an emergency, the plan may well be the only thing left from your previous data center (other than the last off-site backups, of course) to help you rebuild and restore operations.
While disaster recovery plans should be readily available at your workplace, copies should also be stored off-site. This way, a disaster that destroys your workplace will not take every copy of the disaster recovery plan with it. A good place to store a copy is your off-site backup storage location. If it does not violate your organization's security policies, copies may also be kept in key team members' homes, ready for instant use.
Such an important document deserves serious thought (and possibly professional assistance to create).
And once such an important document is created, the knowledge it contains must be tested periodically. Testing a disaster recovery plan entails going through the actual steps of the plan: going to the backup site and setting up the temporary data center, running applications remotely, and resuming normal operations after the "disaster" is over. Most tests do not attempt to perform 100% of the tasks in the plan; instead a representative system and application is selected to be relocated to the backup site, put into production for a period of time, and returned to normal operation at the end of the test.
Although it is an overused phrase, a disaster recovery plan must be a living document; as the data center changes, the plan must be updated to reflect those changes. In many ways, an out-of-date disaster recovery plan can be worse than no plan at all, so make it a point to have regular (quarterly, for example) reviews and updates of the plan.
One of the most important aspects of disaster recovery is to have a location from which the recovery can take place. This location is known as a backup site. In the event of a disaster, a backup site is where your data center will be recreated, and where you will operate from, for the length of the disaster.
There are three different types of backup sites:
Cold backup sites
Warm backup sites
Hot backup sites
Obviously these terms do not refer to the temperature of the backup site. Instead, they refer to the effort required to begin operations at the backup site in the event of a disaster.
A cold backup site is little more than an appropriately configured space in a building. Everything required to restore service to your users must be procured and delivered to the site before the process of recovery can begin. As you can imagine, the delay going from a cold backup site to full operation can be substantial.
Cold backup sites are the least expensive sites.
A warm backup site is already stocked with hardware representing a reasonable facsimile of that found in your data center. To restore service, the last backups from your off-site storage facility must be delivered, and bare metal restoration completed, before the real work of recovery can begin.
Hot backup sites have a virtual mirror image of your current data center, with all systems configured and waiting only for the last backups of your user data from your off-site storage facility. As you can imagine, a hot backup site can often be brought up to full production in no more than a few hours.
A hot backup site is the most expensive approach to disaster recovery.
Backup sites can come from three different sources:
Companies specializing in providing disaster recovery services
Other locations owned and operated by your organization
A mutual agreement with another organization to share data center facilities in the event of a disaster
Each approach has its good and bad points. For example, contracting with a disaster recovery firm often gives you access to professionals skilled in guiding organizations through the process of creating, testing, and implementing a disaster recovery plan. As you might imagine, these services do not come without cost.
Using space in another facility owned and operated by your organization can be essentially a zero-cost option, but stocking the backup site and maintaining its readiness is still an expensive proposition.
Crafting an agreement to share data centers with another organization can be extremely inexpensive, but long-term operations under such conditions are usually not possible, as the host's data center must still maintain their normal production, making the situation strained at best.
In the end, the selection of a backup site is a compromise between cost and your organization's need for the continuation of production.
Your disaster recovery plan must include methods of procuring the necessary hardware and software for operations at the backup site. A professionally-managed backup site may already have everything you need (or you may need to arrange the procurement and delivery of specialized materials the site does not have available); on the other hand, a cold backup site means that a reliable source for every single item must be identified. Often organizations work with manufacturers to craft agreements for the speedy delivery of hardware and/or software in the event of a disaster.
When a disaster is declared, it is necessary to notify your off-site storage facility for two reasons:
To have the last backups brought to the backup site
To arrange regular backup pickup and dropoff to the backup site (in support of normal backups at the backup site)
In the event of a disaster, the last backups you have from your old data center are vitally important. Consider having copies made before anything else is done, with the originals going back off-site as soon as possible.
A data center is not of much use if it is totally disconnected from the rest of the organization that it serves. Depending on the disaster recovery plan and the nature of the disaster itself, your user community might be located miles away from the backup site. In these cases, good connectivity is vital to restoring production.
Another kind of connectivity to keep in mind is that of telephone connectivity. You must ensure that there are sufficient telephone lines available to handle all verbal communication with your users. What might have been a simple shout over a cubicle wall may now entail a long-distance telephone conversation; so plan on more telephone connectivity than might at first appear necessary.
The problem of staffing a backup site is multi-dimensional. One aspect of the problem is determining the staffing required to run the backup data center for as long as necessary. While a skeleton crew may be able to keep things going for a short period of time, as the disaster drags on more people will be required to maintain the effort needed to run under the extraordinary circumstances surrounding a disaster.
This includes ensuring that personnel have sufficient time off to unwind and possibly travel back to their homes. If the disaster was wide-ranging enough to affect peoples' homes and families, additional time must be allotted to allow them to manage their own disaster recovery. Temporary lodging near the backup site is necessary, along with the transportation required to get people to and from the backup site and their lodgings.
Often a disaster recovery plan includes on-site representative staff from all parts of the organization's user community. This depends on the ability of your organization to operate with a remote data center. If user representatives must work at the backup site, similar accommodations must be made available for them, as well.
Eventually, all disasters end. The disaster recovery plan must address this phase as well. The new data center must be outfitted with all the necessary hardware and software; while this phase often does not have the time-critical nature of the preparations made when the disaster was initially declared, backup sites cost money every day they are in use, so economic concerns dictate that the switchover take place as quickly as possible.
The last backups from the backup site must be made and delivered to the new data center. After they are restored onto the new hardware, production can be switched over to the new data center.
At this point the backup data center can be decommissioned, with the disposition of all temporary hardware dictated by the final section of the plan. Finally, a review of the plan's effectiveness is held, with any changes recommended by the reviewing committee integrated into an updated version of the plan.