Contingency Planning | The Concise Guide to DNS and BIND

A question often asked is "Why do I need redundant DNS servers? When my services fail they don't work anyway so DNS is of no use." This is false. RFC 2182 (alias BCP 16) is about the placement, selection, and operation of redundant nameservers. You should read it. Section 3.3 in it deals with this myth:

"An argument is occasionally made that there is no need for the domain name servers for a domain to be accessible if the hosts in the domain are unreachable. This argument is fallacious.

"Clients react differently to inability to resolve than inability to connect, and reactions to the former are not always as desirable.
"If the zone is resolvable yet the particular name is not, then a client can discard the transaction rather than retrying and creating undesirable load on the network.
"While positive DNS results are usually cached, the lack of a result is not cached. Thus, unnecessary inability to resolve creates an undesirable load on the Net.
"All names in the zone might not resolve to addresses within the detached network. This becomes more likely over time. Thus a basic assumption of the myth often becomes untrue.

"It is important that there be nameservers able to be queried, available always, for all forward zones."

The most notable reaction of clients is that mail will bounce if there are no MX records available. Lost emails are lost opportunities, although some might find the effective way it stops spam a relief.

Regarding the second item: If a DNS query first fails the DNS server will try and try again. It will retry it every which way it can in a determined effort to succeed. This causes extra network traffic. If a lot of clients are retrying in this manner it can add up to a lot of extra load. And then perhaps the client decides to retry the query as well, and causes the DNS server to go through the whole retry dance again.

While new versions of BIND does cache the lack of results, avoiding storms of retries, this caching is in itself disruptive when your service comes up. It will be unavailable due to negative caching even if it is in fact available.

The last point goes to the fact that over time most companies will place one or more services outside their own network for example, a secondary mail server, a Web server in Web hotel, or perhaps an application at an ASP. Thus the servers are in fact available even if your own network is cut off. In these cases your provider will likely be willing to be a DNS slave for your zones, so this is easily solved.

Internal Redundance

All in all, you should provide several nameservers. Internally in your company you might want to set up servers that are used as resolvers by the machines on the inside. Having at least two of them eliminates a single point of failure, which is simply good design. Internal name service is not a heavy task in most cases. You can give the job to almost any machine. Just list up to three of them in /etc/resolv.conf (or equivalents on other OSes) and your OS resolvers will try each one in turn until they find one that works.

If you have a large network inside one large building or across several buildings, you might want to section your network and set up each section to be as autonomous as possible, providing as many as possible of the network services each section needs inside the section. Setting up just one nameserver in each section, and using the servers outside the section as secondary servers would make sense in such a case. But if you have a large, or huge, organization please see the section on practical uses of forwarding later in this chapter.

Having at least DNS work will allow people to get a lot of things done that they might not have been able to do if DNS did not work. Even if other, independent things in your network are nonfunctional. As long as DNS and your external link are up they can at least keep themselves entertained while the file servers are down.

External Redundance

Externally you should also have redundant name service. Your ISP is a likely candidate for having secondaries, but an out-of-town, or even a foreign company you work or partner with is better. The less infrastructure, public or Internet, the servers share the better. They will be less likely to fail all at once. Another important factor is that the Internet does in fact suffer from failures quite often, even though most of us don't notice it. This can, for example, cut a nameserver connected to Qwest in Norway off from the rest of the Internet for hours, or in bad cases for days. You should plan for bad cases such as back-hoe operator error, or, say, irresponsible anchoring practices close to undersea cabling, or what insurance companies refer to as acts of God. These occurrences can keep things down for many days, or even weeks. Norway, and a large part of the world, is blessed with extremely stable power, few floods, and few quakes or volcano eruptions. But you might live in, or be affected by, parts of the world where normalcy is disrupted for days or weeks by floods or storms, which can bring down a lot of things that normally never have problems. You should plan for this; your DNS should be available all through these kinds of catastrophes.

For these reasons it should be easy to understand that having your "redundant" nameservers on one box with two IP addresses defeats the purpose of having redundant servers. Sure, you fulfill the requirement, but only to the letter. Placing your external nameservers on one and the same LAN segment is also unwise as is placing them in the same building where power can go out all at once.

In closing I'd like to recommend, once again, RFC 2182. Read it. Live it. It's not even expensive! An old 486 junker running Linux or a BSD can handle large amounts of DNS traffic.

Extended Outages

Extended outages cause problems normally not encountered or considered. Consider the SOA record, it has one extremely important field, the expires field. When the time set by the expiry field has passed without any contact with the master server a slave server will consider the zone expired, null and void. The zone will cease to exist. A good, high value for the expiry field is important. In Chapter 2 we saw this SOA record for the penguin.bv zone:

 @       3600    SOA     ns.penguin.bv.  hostmaster.penguin.bv. (                 2000041300      ; serial                 86400           ; refresh, 24h                 7200            ; retry, 2h                 3600000         ; expire, 1000h                 172800          ; minimum, 2 days )

A value of 1,000 hours is about 42 days. It has been usual practice to have the expiry field set to one week. A week is next to no time in the face of major breakdowns of civilization. Of course a company can go broke after 30 days without being able to restore itself, but at least you will not have contributed to it by setting the expiry value too low.