Scaling Issues

Some of the most commonly asked questions about IOS Mobile IP are directly related to the scalability of Mobility Agents. As luck would have it, these questions are also the most difficult to answer. Seemingly simple questions such as "How many users can platform X support?" cannot be answered until numerous other questions are asked and answered. Scalability of Mobility Agents is best thought of like a balloon, as shown in Figure 8-7. An uninflated balloon can be stretched far in any direction, but when you inflate the balloon, it can never stretch so far. The same goes for Mobility Agents. For example, you can max out the number of bindings on a Mobility Agent (Mobility Agent or you can max out traffic forwarding, but when you put everything together, each piece impacts the other, and you never see the isolated maximums. Mobility Agent scaling in IOS depends on the following four main factors:

Number of Mobile Nodes
Frequency of movement of the Mobile Nodes
Number of tunnels
Amount of traffic that is being forwarded

Figure 8-7. Scalability of Balloons and Home Agents

Building a Call Model

The best approach to addressing scalability concerns is to build a call model and then apply it in lab testing. Call model is a term borrowed from the voice world that is similar to a traffic model, but more encompassing. A traffic model would simply identify how many bits each user would pass across the network in a given period of time. Although this is an important factor in the call model, you must also take other factors into account.

Call models are generally built toward the busy hour. The idea is that humans are creatures of habit, and in many cases, our habits are similar to those around us. The hour when all users are at their most active is referred to as the busy hour. From a network engineering perspective, if enough capacity exists in the busy hour, enough capacity is available all day long. A one-hour window is chosen because with a large number of users, it provides a good tradeoff between instantaneous and average use.

NOTE

The tradeoff between instantaneous and average use is a common problem in systems analysis. It is often seen in link utilization. For example, consider quantifying the percentage utilization of a specific link between two routers. If the measurement interval is instantaneous, the link is either in use (100%) or not in use (0%). These are not useful numbers. Because link speed is expressed as a quantity (bits) over time, the ideal measurement is to count bits that are sent in a specific interval and compare that to the maximum.

The problem is that if the interval is too long, information is lost in the average. However, as the interval gets smaller, it becomes more binary, which is also difficult to interpret. Specifically, looking at a link over a 24-hour interval clearly highlights the problem. A link that is full for the entire eight-hour business day, but otherwise unutilized, would show up as 33% utilization. No hard-and-fast rule exists for selecting intervals.

A busy-hour profile for one or more classes of users needs to encompass all four factors, as discussed in the sections that follow. You can generally classify users based on the application. For example, a mobile sales force and a service team likely have different busy-hour usage profiles, but within each class, the profile is similar. After the classes are established and the profiles are defined, the two can be merged based on the quantity of users in each profile.

Number of Nodes

The most often asked scalability question is "How many Mobile Nodes are supported by a specific platform?" This needs to be looked at as a foundation for scalability and not the entire answer. This number is not only the root of the call model development discussed in the sections that follow, but a scaling factor on its own. From a simple "maximums" perspective, this number is typically limited by the amount of available random-access memory (RAM) on the agent platform. Mobile IP visitors on the Foreign Agent (FA) and mobility bindings on the Home Agent each require a fixed amount of available memory on the agent. Unfortunately, no specific number quantifies memory usage. The number varies with each release and is based on which features are used. Although memory use is the largest factor in relation to the number of Mobile Nodes, it isn't the only factor impacted. The number of nodes also impacts features such as routing table lookup speed and other specific services. The more services, the more impact the total number of nodes has on the performance of the agent. It might be hard to imagine, but Home Agents have some of the largest routing tables because of the use of host routes.

Frequency of Mobility

Processing each RRQ also requires a specific amount of resources, this time from the CPU. Routers, on which Mobility Agents run, are designed to get data traffic in and out of the box as fast as possible. Control-plane operations are not as speed critical. Thus, whereas custom hardware is put in the forwarding path, routing protocols are not hardware assisted. Mobile IP, although designed to handle routing updates extremely efficiently, is still limited by the CPU's capabilities.

The frequency of routing updates, or RRQs as they are specifically called in Mobile IP, is determined by several factors. Each Mobile Node sends a RRQ when it powers on and possibly when it powers off. It also sends a RRQ when the lifetime expires and each time it changes its access link. In most networks, the power-on and power-off registrations are of negligible impact. Access-link changes are the hardest to quantify, especially when the roaming network is not completely known, as is the case with public network roaming. This is easiest to quantify with data from a real-world trial, but can be derived if physical movement habits can be correlated to network deployment.

Mobility binding lifetime is a controllable factor but one that is often difficult to address. What lifetime should be used? Generally, the longer the better, but failures must be detected and resources reclaimed in a timely fashion. One factor to consider is setting lifetimes in relation to roam frequency. If roaming is frequent, reregistration because of lifetime should be greater than the average plus the first standard deviation. This means that 68% of the reregistrations occur because of roaming and not lifetime expiration.

Amount of Data Traffic

The amount of data traffic the agent must forward is also an important factor, but not as important as in other routing applications. In a mobile environment, data traffic is often limited by the capabilities of the end device and the speed of the access links. As mobile solutions mature, forwarding rate is becoming more of a concern, but as with the slow-access networks of the past, it has not been a limiting factor. In terms of a call model, this should be expressed as the number of packets or bytes in the busy hour.

For devices in which forwarding occurs in software, the tunnel encapsulation adds a significant amount of overhead when compared to unprocessed packet forwarding. Newer platforms offer support for encapsulation in hardware but ensure that the tunnel type being used is explicitly supported.

Number of Tunnels

The final scaling factor, the number of tunnels used, might seem odd, but given that users sharing the same Care-of Address (CoA) share the same tunnel, it can have an impact on rollout. Each tunnel that the router builds contributes to the amount of memory used and the amount of processing required. The maximum number of tunnels supported in IOS is limited per platform and varies from release to release. Each tunnel requires the use of an interface description block (IDB), which is a key component of IOS. The maximum number of IDBs supported on a specific platform is documented on Cisco.com. Use care when Colocated Care-of Address (CCoA) mode is used because each Mobile Node requires a tunnel. Also, remember that a Mobile Router uses two tunnels when attached to a FA. The number of free IDBs on the Home Agent typically limits the number of mobility bindings.

NOTE

According to Cisco.com, an IDB is a special control structure internal to the Cisco IOS software that contains such information as the IP address, interface state, and packet statistics. Cisco IOS software maintains one IDB for each interface present on a platform and one IDB for each subinterface (http://www.cisco.com/en/US/partner/products/sw/iosswrel/ps1835/products_tech_note09186a0080094322.shtml).

An Example Call Model

Consider the two user classes that we identified previously, a mobile sales force and a mobile service team. Table 8-1 shows the profile for each class and indicates how that profile can be combined into a call model. We first determine that an equal number of sales and service workers exist, so each profile is weighted at 50% of the total call model.

Table 8-1. Example Call Model
	Sales Force	Service Team	Per User	Busy-Hour Total (MB)
Number of Users	100	100		200
Registrations	2	1	(2 * 50%) + (1 * 50%) = 1.5	300
Data Traffic	200 KB	900 KB	(200 * 50%) + (900 * 50%) = 550 KB	107
Tunnels	1	1	(1 * 50%) + (1 * 50%) = 1	200

Next, the mobility and traffic profiles are quantified. Each member of the sales team visits eight to ten customers a day. At each customer, either public wireless local-area network (LAN) or public wireless wide-area network (WAN) connectivity queries the order-entry system. Enterprise applications such as e-mail and web browsing can also be used. These can all be quantified in terms of bandwidth and total data utilization.

Each service worker visits four customers a day and uses public wireless LAN and WAN connectivity. The service applications include work-order management, streaming video for just-in-time training, and the download of highly detailed service manuals. Although the work-order management system is used often, it has very low data usage. Conversely, the video and service manuals use large quantities of data but are used less frequently.

Assigning specific numbers to these profiles requires estimation, but after the application is understood, this should be relatively easy. For this example, you can assume that each user performs two registrations per customer site, one when entering the site and one when leaving.

This is an example call model that cannot be generalized to all applications. For example, the registration rate in this example is 1.5 registrations per user in the busy hour. If the example involved delivery people in a campus environment using only wireless LAN, the registration rate might be significantly higher. If a Layer 2 (and Layer 3) handover occurs on every floor of every building and the delivery person visits 12 floors an hour, the registration rate would be 12 events in the busy hour. This makes registration rate a much larger factor in the call model.