A.6 Redundancy and Wireless

only for RuBoard - do not distribute or recompile

A.6 Redundancy and Wireless

Around Vineyard.NET's third anniversary, I started worrying about the possibility of a fire and what it would do to our company. At that time, Vineyard.NET was located completely within my house, a 150-year-old wood frame building. One careless night with wine and some candles, and the whole ISP could be history.

I had heard of something called business continuity insurance, so I called my insurance agent and asked what it was all about. He said that I would need to prepare a description of Vineyard.NET for the underwriters, the sort of problems that we could encounter, how much lost revenue each month we would have, and how we would recover after a loss. I chuckled; he was asking me for the very sort of disaster recovery plan that we advocate in earlier chapters of this book.

Vineyard.NET's first disaster recovery plan was pretty pathetic. "Well, basically we would set up shop in a new building, buy all new equipment, and have the phone company pull in new circuits," I explained to my insurance agent.

"So you would be down for a month? How much money would you lose?" he asked.

"Well, we would probably be down for 45 days, because these high-speed circuits can take a long time to get installed," I said. "But by that time, all of our customers would have left and gone elsewhere. So it would basically wipe out the business."

I realized that we didn't have an insurable risk. Before I could expect an insurance company to stand behind my company, we needed to improve the company's disaster planning so that there was something to stand behind.

Given that my primary concern was the possibility of fire, and my secondary concern was the possibility of theft, the logical thing for Vineyard.NET to do was to set up a second machine room in a second building. We had a location: the basement of our largest reseller, Educomp. All we needed to do was to get the phone company to pull a second 100-pair network cable to that location, put some spare equipment there, and be all ready for the eventual fire that we hoped would never come.

A quick call to the phone company revealed that this plan was more complicated than it seemed. NYNEX[B] said that it would not put facilities in a location without an order. Furthermore, once we placed the order, getting the facilities installed would require that a new conduit be installed between our new building and the manhole in the street; the operation would cost thousands of dollars and require shutting down the street again. And this time, we would need to pay for the work ourselves, as the lines were being installed in a commercial facility.

[B] Bell Atlantic and NYNEX had merged to become NYNEX.

We thought of this expense as the first installment of our insurance policy.

There Are Disasters and Disasters

Three retired men in their 50s are all lounging on a beach in Miami, Florida. They start talking and discover they had all lost their businesses a few years ago and taken the insurance as an early retirement. "We had a big store with a big attached warehouse," says the first man. "Then one night there was a huge fire. It took out everything there was no way to rebuild. Fortunately, nobody was hurt."

"Wow, that same thing happened to me," says the second man. "But it wasn't a warehouse fire it was my restaurant that burned down. It was really nice we had spent a fortune trying to achieve a certain ambiance. Thankfully, all of the art had been appraised the previous year."

Then the third man speaks. "I had a small factory that built custom furniture. We were down by the river. There was a big flood and it ruined all of our inventory."

The first guy looks at the third. "How did you arrange a flood?"

A.6.1 Linking Primary to Backup

While the phone company was working on providing facilities to our new location, we started working on the second part of the problem figuring out a way to tie together the two machine rooms. A number of approaches presented themselves:

  • We could have a T1 installed between the two locations. This would certainly be the simplest approach, but it would cost at least $600 per month, indefinitely.

  • We could have a fiber optic cable strung between the two locations on the utility lines. I checked around and learned that pole rentals were cheap typically $5/month for each pole passed. The problem with pulling our own fiber was the up-front cost: we would have to have an engineer create a wiring plan and get it approved by the electric company, the phone company, the town, or all three. And once we had the wires up, we would be responsible for the maintenance. If there ever needed to be a repair, we would need to have that done as well. The more we looked into this approach, the more it was clear that this was not the way to go.

  • We could set up a wireless link using unlicensed wireless hardware. After all, the two buildings were less than 1000 feet apart, and we could clearly see one from the other.

We decided to go with the wireless approach. The first equipment that we tried came from an Israeli company named Breezecom. This equipment operated at 2.4 GHz using the 802.11 frequency-hopping standard. After a few months of trials, we gave up on the Breezecom equipment: it simply was not reliable enough. Our next try was with hardware from a company called C-Spec. The hardware was basically a 486 PC with a Lucent 915 MHz frequency-hopping Wavelan card and special software that C-Spec had written. The C-Spec equipment cost more than the Breezecom, but it worked without problems.

A.6.2 Building the Backup Site

Vineyard.NET's largest reseller was extremely happy with its new high-speed wireless connection to the Internet; for the previous three years, Educomp's only connection to the Internet had been multiple dialup connections. But for Educomp, getting the wireless to work had been quite easy: the wireless system was an Ethernet bridge, so all we needed to do was to plug one wireless system into Vineyard.NET's Ethernet and plug the second one into Educomp's. Getting the wireless system to be usable for Vineyard.NET required considerably more work.

The first question that we were faced with, of course, was "What do we want to do with the backup site?" We knew that we wanted it to be our backup site, so we decided that we needed a backup computer system there. We took an old PC that we had upgraded, put some big SCSI hard drives on it, and put it in a rack in Educomp's basement. We added to the setup a rack of 16 modems and a Cisco 2516 router. Normally, the Cisco would simply be an access server. But if our main building ever burned, we would have the phone company jumper the T1 to the new location and we could use the 2516 as our upstream router as well.

Once we had the computer at the backup site operational, our next order of business was to make it truly functional. We set up a series of jobs on our primary computer that would automatically back up the hard drives to the backup system on a daily basis. Then we set up another job that would copy over our most critical files the accounting files, people's email, and so on on an hourly basis.

Although the backup system was designed to help us survive a fire, we quickly realized that having a secondary system would also make it possible for us to survive a server crash, something that was far more likely. In the event that our primary server crashed, we wanted the backup server to be able to take over from the primary. This meant that it needed to be able to serve web pages, accept mail, and generally pretend to be the primary system.

To make this illusion successful, we gave the backup computer the IP address of our secondary nameserver. We set up a copy of our web server so that the backup computer could serve the web pages for all of our customers. We further modified the system so that some of the scripts would notice if they were running on the backup system and, if so, not execute. We decided that it was simply easier to prevent users from changing their passwords or account options while they were running on the backup server, rather than try to figure out how to propagate the changes from the backup systems back to the primary system.

Finally, we waited.

A.6.3 Failover and Back!

Over the following two years we had very little use for an online backup server. Whenever we accidentally deleted a file, we could get the backup from the backup computer.

Then in the fall of 1998, our backup system got its first real test. One afternoon, everything on our primary computer started to go haywire. Our ls and du commands were dumping core. We thought that we were either under attack or had suffered a really serious hardware problem. But then we noticed that other, dramatically more accomplished subsystems were working fine: we were able to log into the computer using ssh, and emacs still worked perfectly.

We tried to debug the problem with BSDI, but nobody had heard of the problems that we were having. We explained that we clearly had an operating system bug; we needed help. The best help that BSDI could give us, I said, was the source code to the ls program. I could then compile the program with debug symbols, see where the crash was, and figure out what we had done to trigger the problem.

But BSDI refused to give us the source code to the system that we had: "We just don't do that," I was told. Vineyard.NET could purchase the source code, but they would not give it to us, not even to help us find an operating system bug.

After an hour of screwing around with BSDI, we decided that we were on our own. For the first time since we had built the system, we switched over to run completely off the backup system. Doing the switchover was far easier than I thought it would be: we simply copied the current mail files from our primary system to the backup system, then we halted the primary system and gave its IP address to the backup. Suddenly Vineyard.NET was up and running again. Our customers never found out that we were running on a machine with a fraction of the capacity of the primary.

An hour later, an engineer at BSDI called me back. He said that he couldn't give me the source code, but he could give me a specially compiled version of the ls command with debug symbols left in. I ran the program, it crashed, and I examined the core dump. According to the core dump, the program had crashed when attempting to access a function called the getgrent( ) a function that reads through the /etc/group file. I examined the file and discovered that it had some trailing blank lines. I removed the blank lines and the problem went away. Apparently the extra blank lines had tickled a bug in the BSDI shared library. Programs like emacs and ssh were not affected because our copies of these programs had been compiled and linked before we had upgraded our system to the 3.0 release of the operating system, so they were using the 2.0 shared library, which did not have the bug.

With the problem diagnosed, we now could switch back to our primary system. There was only one problem we couldn't figure out a way to do this without interrupting service. At 4:00 a.m. that night, we turned off our SMTP server, copied everybody's mail files back to the primary system, and moved back the IP addresses.

only for RuBoard - do not distribute or recompile


Web Security, Privacy & Commerce
Web Security, Privacy and Commerce, 2nd Edition
ISBN: 0596000456
EAN: 2147483647
Year: 2000
Pages: 194

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net