High-Level Design Goals | Citrix Access Suite 4 for Windows Server 2003: The Official Guide, Third Edition

In a perfect world, network administrators would be allowed the luxury of designing a server-based computing infrastructure from the ground up. As this is seldom the case, some fundamental design goals are necessary. The design goals used to successfully baseline an infrastructure to support server-based computing are speed, scalability, resiliency, manageability, auditability, and cost-effectiveness.

Speed

Initially, the concept of "speed" as a critical design criterion may seem contrary to one of the thin-client mantras: reduced bandwidth. Speed in this case refers to speed within the network core, not to clients on the network edge. Server-to-server communications within the network core must be as fast (in terms of raw speed) and as clean (in terms of controlling broadcast and superfluous traffic) as economically possible. In a load-balanced multiserver environment, users must still log on to the network, and roaming profiles are essential in providing users with a consistent, seamless experience, independent of which server actually services their application needs. These profiles must be retrieved from a central location at logon, before the user's application is available. Any delay in the initial logon and profile download process will be perceived as poor application performance. Application server to backend server (database server, file server, mail server) calls need the same rapid response for file opens, database queries, mail messages, and the like. Again, any delay in moving data from server to server is perceived by the user as an application performance problem, when in reality it may be a network core bandwidth bottleneck. Finally, insulating the servers from superfluous network traffic (broadcasts, routing protocols, and so on) improves server performance by eliminating network-driven CPU interrupts. Every Layer 2 broadcast frame (ff-ff-ff-ff-ff-ff) forces a network-driven interrupt for every server that "hears" the frame. Common sources of this event are older AppleTalk protocols, Microsoft networking with improper NetBIOS name resolution, and Novell Server Advertisement Protocol (SAP) broadcasts.

Scalability

It is important not to underestimate network growth requirements. Capacity planning of network infrastructure is often overlooked or sacrificed to budgetary constraints. Adding an additional server to an existing server farm can be relatively simple (from a technical standpoint) and easy to justify (in terms of hardware and software budgets). Adding additional servers clearly ties to increases in company requirements or users' demands (new applications, more offices, and more users). Justifying infrastructure hardware (LAN switches, routers, and so on) upgrades or purchases often proves more difficult. From a budgetary view, infrastructure tends to be less visible and perceived as a "one-time" cost. A decentralized environment migrating to a server-based environment will necessarily require increased resources in terms of servers, LAN capacity, and potentially WAN bandwidth. This chapter provides guidelines for estimating the various parts of the network, but every organization must gauge for itself how much its IT requirements will increase, and how much corresponding capacity should be designed into the network. There are two financially equivalent methods for incorporating expandability into the network. A company either can purchase components that are scalable, or it can choose vendors that provide generous trade-in policies on old equipment.

Resiliency

Resiliency is the ability to easily recover from and adjust to misfortune or change. This is certainly a desirable end state for an enterprise network. Each component should have its own ability to recover from failure or should be part of a larger system of failure recovery. Network resiliency incorporates concepts of both outage mitigation and disaster recovery. Determining just what level of resiliency must be incorporated into the network design requires a careful process of balancing three factors: level of cost (how much will it cost to build in resiliency versus how much could be lost without it?); level of effort (how much effort is required to implement and manage the resilient network versus how much effort to recover from a failure in a non-resilient network?); and level of risk (what is the probability that a specific type of failure will occur versus level of cost and level of effort to include failure mitigation in the network design?). As a general rule, unacceptably high risks are usually mollified by outage mitigation (designed-in redundancy, survivability, or fault tolerance). When risk does not warrant building in redundancy (that is, planning a hardware solution to mitigate damage from a 500-year flood), disaster recovery planning is usually required. Chapter 19 discusses disaster recovery and business continuity planning in detail.

Outage Mitigation

Outage mitigation is really just a fancy term for fault tolerance. When looking at server hardware, system administrators usually assume RAID for hard drives to make them fault tolerant. Similar features can be designed into network hardware, connectivity, and services. The end goal is to eliminate the potential for failures that impact the production environment. For hardware, consider redundant power sources and supplies, redundant Layer 2 connectivity (dual network cards and switch ports), and redundant network hardware (Layer 2 and Layer 3 processors). In terms of connectivity, consider redundant or self-healing WAN connectivity as well as redundant Layer 2 paths. For services, critical services such as directory services, name resolution, and authentication must be fault tolerant so that a single server failure does not cripple the production environment.

Disaster Recovery

A catastrophe or even a serious mishap that could include losing access to the data center calls for disaster recovery. In such cases, data moved offsite is prepared and put into production at another site. Engineering this capability into the network design at an early stage can save time and prevent you from having to ask for a budget increase later. An example of this type of technology is offsite data replication. If the storage system is replicating some or all of your corporate data to a recovery facility, the loss of your main data center is not likely to be catastrophic for the company. You can use this data along with spare hardware and software to get users back online in a timely manner.

Manageability

The extensive work that network equipment vendors have done during the last few years to simplify their equipment's administration requirements makes this design goal almost a given, but it still bears mentioning. Can the IT staff easily access the component's settings? How does this work—through a Web-enabled GUI or perhaps as a Microsoft Management Console plug-in? Is management of the component self-contained or does it fit into an overall management architecture, such as HP OpenView or CA UniCenter? The component should make it easy to do the following tasks:

Check and back up the current settings or configurations to disk.
Copy and make changes to the current settings without altering them, then later activate the changes either manually or on a schedule.
Provide real-time reporting on important system metrics—for example, bandwidth utilization and port statistics such as error rates, retransmissions, and packet loss. Ideally, this information is provided through SNMP, RMON, or some other well-known management protocol.
If using multiple units of the same type, provide a method to create a standard configuration for each and a method to address and manage all of them centrally. For example, if using Windows terminals, they should allow the downloading of firmware images and settings from a central location.

Auditability

Even components that are well designed for both resiliency and manageability are not impervious to occasional unexpected crashes. The components should provide enough detailed system and transaction information to make troubleshooting relatively simple. On many systems, such as routers and switches, troubleshooting is facilitated by detailed logging information. The log should include

Security validations and violations (access denials)
Detailed error information
Detailed transaction information
Crash dump of the operating system kernel (or the equivalent) to aid in troubleshooting

Cost-Effectiveness

An organization may decide it needs the latest, cutting-edge network technology to make its system "really fly." However, unless they have unusual or very business-specific needs for this technology, they may find that the added expense of acquiring it is not justified. Just a short while ago, Gigabit Ethernet switches were prohibitively expensive. Could every organization benefit from the extra speed? Possibly, but is the benefit worth the price? The average company that runs word processors and spreadsheets, and accesses data from legacy databases would not realize the same benefit as a special-effects company that needs to move digital film files through the network. When comparing components that are similar in nearly every way, it may come down to answering the question, "Which gives the most bang for the buck?"