13.5. Disaster SurvivabilityThe ability of a network to survive isolated equipment and link failures is called survivability . It's a subject that, like networking itself, is addressed at every layer of the OSI model. Backup power supplies , dynamic routing, and remote- survivable dial-plans (groups of phones that can call each other even when their access to the PBX is cut off) operate at the physical, network, and application layers , respectively. 13.5.1. Surviving Power FailuresThe most fundamental survivability measures occur at the physical layer, starting with backup power. Without backup power, your site's primary source of electricity may go dead, taking your phone system with it. Whether you use standalone battery systems or a combination of batteries, a generator, and a transfer switch, backup power is a requirement in all data centers and at all crucial network connection points.
13.5.1.1 Multiphase powerMost small offices and residences receive their AC power in the form of a single 120/240-volt connection. This connection feeds a circuit-breaker block that distributes individually limited power circuits throughout the premises. When the power fails at the breaker block, the power fails for the entire premises. But when power is delivered in multiphase, it can create redundancy. Multi-phase power means that the same connection to the electric company can deliver two or three AC supplies to the subscriber's premises. The supplies are connected to sections of the breaker block or to different breaker blocks. So, when a single phase fails, the other phases are still intact, and equipment on the failed phase can be moved to them. This won't eliminate all failures, but it can protect you against certain kinds of failures that occur within the electric company's facilities. 13.5.1.2 Uninterruptible power suppliesIn order to survive a power failure, all of your network equipment must remain runningswitches in phone closets, servers at the data center, and IP phones themselves . This means you either have to back up every device individually, using uninterruptible power supplies (UPS) or create a centralized power distribution system. One way to do this is to place a backup switch with battery and/or generator at a central location and then pull AC wiring from the backup system to each of your phone closets. This way, each critical phone closet has an AC source that's backed up centrally . For IP phones, use PoE, and make sure the powered switches or injectors are backed up, too. The moral is this: it won't do any good to have your Linux PBX server on a high-quality backup power system if your phones and switches aren't on one, too. 13.5.2. Surviving Network Link FailuresRedundancy is your best defense against network link failuresthose that affect only an individual link like a single T1 or an Ethernet switch. If a network link is absolutely critical, there should be, if at all possible, a redundant alternate link that provides an identical logical path . Point-to-point T1s can be made more resilient to failure by bonding them together into multilink bundles. This way, one of the T1s can fail without totally downing the networking pathway . Moreover, two T1s running through two different providers' networks are more resistant to failure than a pair that runs through only one network. But redundancy costs money. It may be tough to justify a completely redundant network and even tougher to manage one so that, when failures occur, it behaves as originally envisioned . Moore's Law infers that whatever capacity you make available, your application will become dependent upon it and grow to exhaust iteven if it's placed there for backup reasons to begin with. So, even if you have double the capacity needed for every linkin the name of redundancyyou may still find yourself in a state of panic when that capacity is merely reduced . 13.5.2.1 PSTN trunk failuresSome types of network links are easier to make redundant than others. IP links can be automatically failed over using dynamic routing at the network layer, but voice T1s and phone lines aren't so simple. A PRI, for example, may go downand when it does, all of its DID numbers and inward signaling configuration will become unavailable to the PBX. Even if a second PRI exists that the PBX can use for outbound calls, some emergency switch at the telephone company will have to occur in order to reroute inbound calls to the second circuit. The same is true of POTS and Centrex lines. If you have 10 POTS lines in a hunt group and the line with the published number (the lead line) experiences a failure, you'll have to contact the phone company to get all calls to that line forwarded to the next line in the group, until the problem with the first line is resolved. Phone companies do offer high-availability solutions for these scenarios at your expense, so contact your local phone company to see what it offers.
13.5.3. Remote Site Survivability and ALSWhile dynamic routing and good network design can make it less likely for link failures to disrupt your voice or videoconferencing application at remote sites, such measures are often too expensive or complex. Moreover, they're low-level measures that don't address the needs of telephony applications during a link failure.
Fortunately, many IP telephony manufacturers have stepped up with solutions at the application layer that are effective guards against the symptoms of link failures that won't break the bank. 13.5.3.1 The survivability problemAt a remote office location that has eight IP phones linked together by Ethernet, a single WAN link might provide a connection to the PBX server at HQ for all eight phones. If that link were to fail, the phones would lose their centralized signaling, directory, and call-switching server. The IP phone users wouldn't be able to call other users in the organization, much less PSTN users. They wouldn't even be able to call each other! To solve this problem, one could put a PBX at this site, but that would be an expensive and wasteful solution. It's also hard to justify putting in a redundant WAN link for such a small office. And what if there were an emergency at the site and public safety services were needed in a hurry? An emergency 911 call couldn't be originated from the remote site without access to the PBX at HQ. 13.5.3.2 The survivability solutionThe solution to all of these problems is application layer survivability, or ALS, a generic term that describes a number of technologies that allow sites without full PBX services on-site to survive an unexpected disconnect from the PBX and remain functional until WAN access is restored. Usually, this means redirecting PSTN-bound calls to a locally connected POTS or Centrex trunk. It could also mean redirecting calls that are ordinarily transported over the WAN to the HQ's PBX over the PSTN instead. These on-the-fly changes are handled by a device often referred to as a remote site gateway. This device may be incorporated into the site's WAN access router or IAD, or it could be a standalone, VoIP-oriented Ethernet switch.
Multitech, Avaya, Cisco, Zoom, and others provide integrated devices for small remote offices that can handle ALS. Zoom's X5 device, for example, provides a WAN router, a place to connect to the PSTN, and a place to connect both analog and IP phones. It can also bypass any VoIP trunks and send 911 and other local calls right to an attached POTS line. Cisco's embedded media gateways can be programmed for just about any ALS scenario using IOS voice commands. If you use analog voice interface cards to hook up POTS lines to these boxes, they can do local 911 bypass, too. Plus, they integrate very easily with Cisco's softPBX, CallManager. Each device handles ALS in a proprietary way, though all ALS solutions apply the distributed computing model to solve the remote survivability problem. Those that are integrated with a central PBX make a local copy of the dial-plan so that local users can still call one another when the link to HQ is broken. Those that have an FXS port allow it to be used for emergency 911 calls regardless of the state of the central PBX. Some vendors have special nicknames for ALS. Cisco calls its ALS technology SRST (Survivable Remote Site Telephony), and Avaya calls it LSP (Local Survivable Processor). Most ALS solutions include:
|