Section 13.5. Disaster Survivability


13.5. Disaster Survivability

The ability of a network to survive isolated equipment and link failures is called survivability . It's a subject that, like networking itself, is addressed at every layer of the OSI model. Backup power supplies , dynamic routing, and remote- survivable dial-plans (groups of phones that can call each other even when their access to the PBX is cut off) operate at the physical, network, and application layers , respectively.

13.5.1. Surviving Power Failures

The most fundamental survivability measures occur at the physical layer, starting with backup power. Without backup power, your site's primary source of electricity may go dead, taking your phone system with it. Whether you use standalone battery systems or a combination of batteries, a generator, and a transfer switch, backup power is a requirement in all data centers and at all crucial network connection points.

Unlike analog phones on the PSTN, VoIP systems don't get their power from the PSTN. VoIP systems will fail during power outages unless they have adequate backup power.


13.5.1.1 Multiphase power

Most small offices and residences receive their AC power in the form of a single 120/240-volt connection. This connection feeds a circuit-breaker block that distributes individually limited power circuits throughout the premises. When the power fails at the breaker block, the power fails for the entire premises.

But when power is delivered in multiphase, it can create redundancy. Multi-phase power means that the same connection to the electric company can deliver two or three AC supplies to the subscriber's premises. The supplies are connected to sections of the breaker block or to different breaker blocks. So, when a single phase fails, the other phases are still intact, and equipment on the failed phase can be moved to them. This won't eliminate all failures, but it can protect you against certain kinds of failures that occur within the electric company's facilities.

13.5.1.2 Uninterruptible power supplies

In order to survive a power failure, all of your network equipment must remain runningswitches in phone closets, servers at the data center, and IP phones themselves . This means you either have to back up every device individually, using uninterruptible power supplies (UPS) or create a centralized power distribution system. One way to do this is to place a backup switch with battery and/or generator at a central location and then pull AC wiring from the backup system to each of your phone closets. This way, each critical phone closet has an AC source that's backed up centrally .

For IP phones, use PoE, and make sure the powered switches or injectors are backed up, too. The moral is this: it won't do any good to have your Linux PBX server on a high-quality backup power system if your phones and switches aren't on one, too.

13.5.2. Surviving Network Link Failures

Redundancy is your best defense against network link failuresthose that affect only an individual link like a single T1 or an Ethernet switch. If a network link is absolutely critical, there should be, if at all possible, a redundant alternate link that provides an identical logical path .

Point-to-point T1s can be made more resilient to failure by bonding them together into multilink bundles. This way, one of the T1s can fail without totally downing the networking pathway . Moreover, two T1s running through two different providers' networks are more resistant to failure than a pair that runs through only one network.

But redundancy costs money. It may be tough to justify a completely redundant network and even tougher to manage one so that, when failures occur, it behaves as originally envisioned . Moore's Law infers that whatever capacity you make available, your application will become dependent upon it and grow to exhaust iteven if it's placed there for backup reasons to begin with. So, even if you have double the capacity needed for every linkin the name of redundancyyou may still find yourself in a state of panic when that capacity is merely reduced .

13.5.2.1 PSTN trunk failures

Some types of network links are easier to make redundant than others. IP links can be automatically failed over using dynamic routing at the network layer, but voice T1s and phone lines aren't so simple. A PRI, for example, may go downand when it does, all of its DID numbers and inward signaling configuration will become unavailable to the PBX. Even if a second PRI exists that the PBX can use for outbound calls, some emergency switch at the telephone company will have to occur in order to reroute inbound calls to the second circuit.

The same is true of POTS and Centrex lines. If you have 10 POTS lines in a hunt group and the line with the published number (the lead line) experiences a failure, you'll have to contact the phone company to get all calls to that line forwarded to the next line in the group, until the problem with the first line is resolved. Phone companies do offer high-availability solutions for these scenarios at your expense, so contact your local phone company to see what it offers.

Hot failoverinstant, user -transparent switching from one telco circuit to anotheris difficult to achieve. Some trunk bypass switches can redirect private trunks from one T1 to another, but this can create challenges for DID, caller ID signals, and call routing. Plus, it isn't exactly cheap to maintain backup PRIs merely for the sake of failover.

Here's what to do if your PRI or POTS trunk goes down:

  • Have the phone company forward calls from your lead number to a backup line.

  • If the failure is in a POTS hunt group, have them "busy out" the failed lines so calls will roll to the next line in the group, which is presumably still working fine.

  • Some phone companies let you manage your Centrex groups by software or web interface. Make the appropriate changes yourself.


13.5.3. Remote Site Survivability and ALS

While dynamic routing and good network design can make it less likely for link failures to disrupt your voice or videoconferencing application at remote sites, such measures are often too expensive or complex. Moreover, they're low-level measures that don't address the needs of telephony applications during a link failure.

Minimizing the Havoc of a PBX Crash

Few disaster scenarios are more frightening than the loss of a single, critical server... except, perhaps, a single critical PBX server. In the age of distributed computing and PC components , the PC chassis is becoming the new home of the private branch exchange.

The PC brings its well-known characteristics to telephony: cheapness, modularity, extensibility, and, unfortunately , instability. Better PC servers equal better stability, of course, but PC backplanes will never be constructed with the untouchable reliability of old-school PBX systems.

The mere fact that PC servers rely upon hard disks means that PC-based PBXs have a pretty good chance of a downtime-inducing crash. So what can be done to prevent your next-generation dial-tone from dying unexpectedly?

  • Back up your dial-plan regularly and have a standby server ready to go in the event of a failure.

  • Use redundant, mirrored hard drive arrays on your softPBX servers or a central, redundant network-attached drive array to eliminate the threat of hard disk failures.

  • If you use a commercial VoIP platform like Meridian or Avaya Media Server, invest in failover equipment. The biggest advantage of commercial systems over open source ones is that they have reliable, well- tested automatic failover ability.

  • If using Asterisk, you can create emergency contexts in a secondary server's dial-planone that matches the active dial-plan of a primary server. This way, when the primary server goes down, you can "promote" a secondary just by including the emergency context in its dial-plan. If you wanted to get fancy, you could use the Asterisk Manager API (described in Chapter 17) to trigger the failover automatically and notify an administrator by doing a Dial( ) to his cell phone.

  • Use a distributed call-switching technology such as DUNDi (discussed later in this chapter) to minimize the effect of a single PBX server's downtime.

  • Use IP-based connections to the PSTN rather than PRIs or POTS lines. If a PBX crashes, it's easier to redirect IP-based connections than it is PRIs or POTS lines.

  • If you're using a PRI attached to a crashed PBX, you can automatically redirect it to a secondary PBX by way of a mechanical T1 failover switch, also called a trunk bypass switch.

  • If an H.323 gatekeeper crashes, it's easy to fail over to a backup. When using multicast locate requests from IP phones, you can configure a backup gatekeeper to listen for requests on the same multicast address as the primary gatekeeper but enable it to respond to those requests only when the primary server has failed.


Fortunately, many IP telephony manufacturers have stepped up with solutions at the application layer that are effective guards against the symptoms of link failures that won't break the bank.

13.5.3.1 The survivability problem

At a remote office location that has eight IP phones linked together by Ethernet, a single WAN link might provide a connection to the PBX server at HQ for all eight phones. If that link were to fail, the phones would lose their centralized signaling, directory, and call-switching server. The IP phone users wouldn't be able to call other users in the organization, much less PSTN users. They wouldn't even be able to call each other!

To solve this problem, one could put a PBX at this site, but that would be an expensive and wasteful solution. It's also hard to justify putting in a redundant WAN link for such a small office. And what if there were an emergency at the site and public safety services were needed in a hurry? An emergency 911 call couldn't be originated from the remote site without access to the PBX at HQ.

13.5.3.2 The survivability solution

The solution to all of these problems is application layer survivability, or ALS, a generic term that describes a number of technologies that allow sites without full PBX services on-site to survive an unexpected disconnect from the PBX and remain functional until WAN access is restored. Usually, this means redirecting PSTN-bound calls to a locally connected POTS or Centrex trunk. It could also mean redirecting calls that are ordinarily transported over the WAN to the HQ's PBX over the PSTN instead. These on-the-fly changes are handled by a device often referred to as a remote site gateway. This device may be incorporated into the site's WAN access router or IAD, or it could be a standalone, VoIP-oriented Ethernet switch.

ALS isn't a concept that's exclusive to VoIP. Non-telephony systems have been programmed to survive network failures for a long time. Resumable FTP is one example.


Multitech, Avaya, Cisco, Zoom, and others provide integrated devices for small remote offices that can handle ALS. Zoom's X5 device, for example, provides a WAN router, a place to connect to the PSTN, and a place to connect both analog and IP phones. It can also bypass any VoIP trunks and send 911 and other local calls right to an attached POTS line.

Cisco's embedded media gateways can be programmed for just about any ALS scenario using IOS voice commands. If you use analog voice interface cards to hook up POTS lines to these boxes, they can do local 911 bypass, too. Plus, they integrate very easily with Cisco's softPBX, CallManager.

Each device handles ALS in a proprietary way, though all ALS solutions apply the distributed computing model to solve the remote survivability problem. Those that are integrated with a central PBX make a local copy of the dial-plan so that local users can still call one another when the link to HQ is broken. Those that have an FXS port allow it to be used for emergency 911 calls regardless of the state of the central PBX.

Some vendors have special nicknames for ALS. Cisco calls its ALS technology SRST (Survivable Remote Site Telephony), and Avaya calls it LSP (Local Survivable Processor). Most ALS solutions include:

  • A local cache of the dial-plan so that phones at the ALS-equipped remote site can still call each other without the central PBX server.

  • Instructions on how to handle calls to the private voice network from the remote site during times when the IP path to HQ is down. These instructions might mean diverting the calls over the PSTN and into the HQ site's PSTN trunks

  • Parameters on how long to wait after an outage occurs before attempting to give call control back to the central PBX at the HQ

  • Parameters that describe when and how often to replicate the local cache of the dial-plan

  • The ability to route 911 calls to a locally connected POTS line or PRI, even when the upstream link is working correctly.

  • The ability to act as the last private signaling point in an LCR call path. For example, if Cleveland users want to call PSTN destinations local to Miami, the WAN can trunk calls from a Cleveland PBX to the Miami remote office, and then the ALS-equipped gateway device can dial these calls on the local POTS line, so they can be local PSTN calls instead of expensive LD calls. (LCR features aren't necessarily tied to failures. Most ALS-equipped gateways also enable LCR with or without regard to survivability.)



Switching to VoIP
Switching to VoIP
ISBN: 0596008686
EAN: 2147483647
Year: 2005
Pages: 172

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net