Clustering and HA Performance Tuning | The Best Damn Firewall Book Period

If you have gotten this far after configuring and testing your cluster, you'll want to know what you can do in terms of improving your cluster's performance. A great deal of performance tuning on firewalls depends on how well you know the type of traffic that goes through the firewall and then tuning the firewall to handle the most common type of traffic more efficiently. In a clustering environment, you need to expand on the concept of tuning considerations, all the way down to hardware, depending on the clustering solution you have implemented. In this section, we discuss the main considerations for optimizing your cluster solution.

Data Throughput or Large Number of Connections

Firewall load-sharing clustering solutions are very good at increasing the overall data throughput of your firewall; the higher the throughput you require, the more members you add in your cluster. However, you will soon reach a stage where adding more members to your cluster just doesn't make any performance difference, because the bottleneck moves somewhere else on the data path—either the line speed of connecting equipment or cables or routers. Furthermore, consider the fact that a two-member load sharing of fast machines with fast network cards for cluster members will probably scale better than slower machines with slower network cards but more cluster members. This is where the price that you pay for hardware is probably significantly lower than paying for an extra enterprise license FireWall-1 module. However, if you are looking for higher resilience, more members in the cluster might be the way to go.

In large numbers of connections, clustering is less help than you might think. The reason for this is that if you have 50,000 connections going through one member and 50,000 connections going through the other member, and you only have two members in the cluster, and one member fails, you will have 100,000 connections going through one member. However, both members will still have a connections table showing 100,000 connections (assuming that the connections are synchronized between members), even when both members were online. The point here is that the connections tables are going to be large the more connections that you push through the cluster, because every member in the cluster should have its connections table synchronized with every other member in the cluster in case of a member failure. The rate of change of new connections makes the situation worse because it has a strong impact on the amount of data that is synchronized between cluster members. High rates of change of connections need to be identified wherever possible because these are prime services that you would target for not synchronizing across the cluster members.

Based on these two definitions of load on a firewall cluster, let's look at each type of load and what can be done to improve performance.

Improving Data Throughput

Improving data throughput is probably the easiest of the two performance areas that can be addressed. It can be addressed in the following ways:

Use good fast networking cards—100Mbps Ethernet full duplex or gigabit Ethernet cards—in the cluster members. Make sure that surrounding hubs and routers from the origin of the data through to the destination of the data have fast physical networking hardware. These are the key areas that will give you high throughput.
Use fast single-processor members in the cluster, with lots of memory.
Use a load-sharing cluster as opposed to an HA cluster. Traffic can be shared across the members in the cluster, which will give higher data rates of throughput.
Keep your rule base short and compact. Larger numbers of rules will slow throughput. This applies to NAT rules and the security rule base.

You need good networking cards, and your hubs and routers—all the way from data source through the cluster to the data destination—need to be as good as you can get. This will define your maximum throughput, and it is this line speed that you will aim for.

Using fast single-processor members and plenty of memory is good practice. It enables the member in the cluster to deal with highly processor-intensive services, such as VPN connections, as quickly as possible. Different members in the load-sharing cluster will take different VPN connections between the cluster and the remote sites, so this means that one member will not be dealing with all the VPN traffic. If you just have one VPN set up between the cluster and the remote site, only one member in the cluster will take the load. If you have several VPNs set up, multiple members in the cluster will be dealing with the VPN connections. This will be based on the load-sharing algorithm used.

In addition, if you are using the security servers for passing traffic, such as FTP, HTTP, or Telnet, this is load shared across the cluster as well and will also give you efficiencies because it can also be CPU intensive. If you are using security servers, make sure that the DNS resolver on each member of the cluster is pointing at a high-speed DNS server or servers (which preferably have a very rich cache) so that DNS lookups do not hold up the performance.

Lots of memory will prevent your host from writing too much to the swap memory area, although some operating systems use their swap space regardless of how much physical memory you install.

If you are going for high throughput, you have to use a load-sharing clustering solution. This gives you scalability and allows big benefits for VPNs and security server connections. It gives big benefits for normal connections as well.

You can do many things with rule base tuning that will make a big difference to increasing the throughput of a member. Tuning the rule base will also give you some major connections-based performance as well. The types of things you need to do to a rule base to make it more efficient are as follows:

Reduce the number of rules to a minimum.
Try not to have rules that are sourced with group objects, destination group objects, because this will multiply out into individual rules when the policy is compiled. Instead, use network objects subnetted appropriately.
Do not use group objects nested inside one another. Again, this causes the compiled rule base to have a large number of rules in it.
Reduce the number of NAT rules to a minimum.
Reduce the number of objects you reference in the rule base.
Don't use resource rules or user authentication unless you need to. The throughput of the security servers is not as fast as a straight stateful connection through the FireWall-1 kernel.
Place the most commonly accessed rules as close to the top of the rule base as you can get away with.
Avoid using domain objects.
Keep logging to a minimum on rules.

Tuning VPNs for throughput is a special case. You can always increase the overall performance of a VPN by making the member do less work to encrypt and decrypt packets, but this is usually at the price of security. For example, using weaker encryption strengths will reduce the security of encrypted packets, but it will mean that the firewall members have to do less work. Using perfect forwarding secrecy also causes a significant performance overhead, but changing this setting will reduce security.

If no compromise of security versus throughput is possible, you have two other options open to you. One is to use the Check Point Performance Pack, which will give you VPN acceleration. The other possibility is to use a hardware accelerator in each member of the cluster, which will aid DES and 3DES calculations for VPNs.

To summarize, anything that you can do on a single firewall member to improve performance is also true of a FireWall-1 member in a clustered environment.

Improving for Large Number of Connections

In many ways, improving for a large number of connections requires more thought than tweaking your cluster for maximum data throughput because it is less dependent on hardware. The first thing you need to be aware of that will reduce the performance of a cluster as far as a large number of connections is concerned is the rate of change of new connections. If this is very high, these particular types of connections are good candidates for not being synchronized between cluster members. On clusters, you need to reduce the number of connections in the connections state table, and you also need to reduce the number of connections that are synchronized statefully.

For example, DNS lookups through a member will be done often. These are small packets, which are often responded to very quickly, and most DNS resolvers are quite patient about waiting for a response. Many DNS lookups are done, especially by any HTTP clients, FTP clients, and the FireWall-1 management server itself if logging has been told to resolve host names.

DNS is a classic service for which you would turn off state table sync. It is a very transient UDP-based service, so synchronizing the state makes little sense. By default, the service is synchronized across the cluster members.

To do this, start the SmartDashboard GUI, log in, click Manage | Services, and select the service domain-udp, as shown in Figure 21.88. Click the Edit button, and then click the Advanced button. Uncheck the Synchronize on cluster check box, and then click OK and install the policy.

Figure 21.88: Turning Off State Synchronization for a Specific Service

There are a large number of services to which you might want to do this. The more you reduce the state synchronization required, the better your members in your cluster will perform for connections.

The other weapon you have for reducing the number of connections in the state table is reducing the virtual session timeout for each service. This especially applies to UDP services, but it can also apply to many TCP-based services, such as HTTP.

Most HTTP sessions are short and transient, so unless you are hosting a Web site where it is vital that each HTTP session opened is longer than 3600 seconds (or 1 hour), it is a good idea to reduce this in the service itself. This means that if the session did not finish normally, the timeout will clear more quickly than the default of one hour. You can do this by clicking Virtual Session Timeout in the Advanced area of each service definition, as shown in Figure 21.89.

click to expand
Figure 21.89: Advanced Settings of the DNS UDP Service

Once you have done as much as you can do to reduce the number of connections that each member will have and you have reduced the number of connections that will be synchronized across the cluster, you need to tune each member in the cluster to accept more than 25,000 connections and tune the kernel memory and NAT table sizes as well to cater for the increase in connections.

This process used to be a manual process of hacking text files previous to FireWall-1 NG FP3, but now it can all be done from the SmartDashboard GUI. Navigate to the Manage menu, choose Network Objects, then locate the Cluster Gateway Object of your cluster, and click Edit. On the left side of the pop-up window, select Capacity Optimization.

From Figure 21.90, you can see that you can modify all the parameters mentioned earlier. The automatic setting for memory pool size and connection hash table size is usually fine, but you might want to monitor these parameters (which we discuss next). If you need to manually tweak the hash table size and the memory pool size, you can also do this from this screen. Note that after policy install, the size of the connections table changes will take effect.

click to expand
Figure 21.90: Configuring Capacity Optimization of Your Cluster

You'll want to monitor the connections table sizes, the memory pool size, and the table hash sizes. How can you do this? The best way is to get a console connection to one of your modules and run the diagnostic commands to reveal this information.

Monitoring the Connections Table

The first thing you will want to do is examine the connections table of a module to determine the current maximum limit for number of connections. This can be done with the fw tab –t connections command from one of the firewall modules in the cluster.

At the top of this command's output are the parameters of this table, which you need to take note of—including the maximum number of connections parameter.

-------- connections -------- dynamic, id 8158, attributes: keep, sync, expires 60, refresh, limit      25000, hashsize 32768, kbuf 16 17 18 19 20 21 22 23 24 25 26 27 28         29 30, free function 707138a0 0

Altering the number of connections up to 50,000 and then running the command will show the new table size for connections and a new hash value:

-------- connections -------- dynamic, id 8158, attributes: keep, sync, expires 60, refresh, limit      50000, hashsize 262144, kbuf 16 17 18 19 20 21 22 23 24 25 26 27          28 29 30, free function 707138a0 0

Note that when you change the connections size, you will also see that the SmartView Tracker logs show that the connections table has changed, the connections table hash has changed, and the memory pool size has been changed.

If you want to monitor the number of connections going through a member at any one time, use the command fw tab –t connections –s. This will give you statistics of the current number of connections in the table (#VALS column) and the peak number of connections (#PEAK column):

fw1 # fw tab -t connections -s HOST                  NAME                        ID #VALS #PEAK #SLINKS localhost             connections               8158     5    20       8

You could get to the stage where you would like to identify a specific connection on a module and check that you can see that connection synchronized to another module in the cluster. To look at the connections table to make sure that it makes sense, use the command fw tab –t connections –f:

10:49:12        192.168.11.131 >     ----------------------------------- (+); Direction: 0; Source: 192.168.1.100; SPort: 4990; Dest: 192.168.1. 130; DPort: telnet; Protocol: tcp; CPTFMT_sep: ;; Type: 114689; Flags:  8405120; Rule: 2; Timeout: 3600; Handler: 0; Uuid: 3e37b13c0c3a610837b6;  Ifncin: 4; Ifncout: 4; Ifnsin: -1; Ifnsout: -1; Bits: 0000000002000000;  NAT_VM_Dest: 192.168.1.131; NAT_VM_Flags: 100; NAT_Client_Dest: 192.168.1 .130; NAT_Client_Flags: 100; NAT_Server_Flags: 0; NAT_Xlate_Flags: 32836;  SeqVerifier_Kbuf_ID: 1076676608; Expires: 3495/3600; product: VPN-1 &  FireWall-1;     10:49:12        192.168.11.131 >     ----------------------------------- (+); Direction: 1; Source: 192.168.1.131; SPort: telnet; Dest: 192.168.1. 100; DPort: 4990; Protocol: tcp; CPTFMT_sep_1: ->; Direction_1: 0;  Source_1: 192.168.1.100; SPort_1: 4990; Dest_1: 192.168.1.130; DPort_1:  telnet; Protocol_1: tcp; FW_symval: 5; product: VPN-1 & FireWall-1;

Normally, the fw tab –t connections –f command would show all connections, but you can filter it down by piping into the grep command (such as fw tab –t connections –f | grep telnet, which was done in the preceding example).

The connection we are interested in is the connection that has an Expires: parameter. This shows the TCP timeout of the connection and so is a good method to prove that your changes to a service's virtual session timeout are working (see Figure 21.86). The other connection we can see is present for the reply from the cluster IP address (as the session initiated was a Telnet from host 192.168.1.100 to the VIP address of 192.168.1.130).

The Telnet service is state synchronized, so we should see exactly the same connection in the connections table of fw2 in the cluster. State table synchronizes an update at least every 100ms to all members in the cluster.

Monitoring Pool Memory

Pool memory is fairly easy to monitor in FireWall-1 NG FP3. You need to make sure that kernel memory for the firewall kernel is not exhausted, or you could end up with halloc memory allocation error messages in the system logs of your operating system. This can lead to the host becoming unresponsive and intermittently locking up—including locking up console access to the member.

You can monitor the kernel memory situation using the command fw ctl pstat on the firewall module:

fw2 #fw ctl pstat     Hash kernel memory (hmem) statistics:   Total memory allocated: 20971520 bytes in 5118 4KB blocks using 2 pools   Initial memory allocated: 6291456 bytes (Hash memory extended by        14680064 bytes)   Memory allocation  limit: 83886080 bytes using 10 pools   Total memory bytes  used:   348308   unused: 20623212 (98.34%)   peak:        369584   Total memory blocks used:      114   unused:     5004 (97%)   peak:           126   Allocations: 71973 alloc, 0 failed alloc, 66671 free     System kernel memory (smem) statistics:   System  physical   memory: 255074304 bytes   Available physical memory: 59908096 bytes   Total memory  bytes  used: 31724112   peak: 31869120     Blocking  memory  bytes   used:  1531912   peak:  1636904     Non-Blocking memory bytes used: 30192200   peak: 30232216   Allocations: 3645229 alloc, 0 failed alloc, 3644952 free, 0 failed free     Kernel memory (kmem) statistics:   Total memory  bytes  used: 11088212   peak: 11826720       Allocations: 81792 alloc, 0 failed alloc, 76215 free, 0 failed free     Kernel stacks:        262144 bytes total, 16384 bytes stack size, 16 stacks,        2 peak used, 4124 max stack bytes used, 1028 min stack bytes used,        0 failed stack calls     INSPECT:         13746 packets, 2698521 operations, 43174 lookups,         0 record, 702731 extract     Cookies:         2309961 total, 0 alloc, 0 free,         21 dup, 863658 get, 1243 put,         1458553 len, 0 cached len, 0 chain alloc,         0 chain free     Connections:         4019 total, 436 TCP, 3381 UDP, 201 ICMP,         1 other, 5 anticipated, 7 recovered, 10 concurrent,         26 peak concurrent, 861843 lookups     Fragments:         0 fragments, 0 packets, 0 expired, 0 short,         0 large, 0 duplicates, 0 failures     NAT:         215/0 forw, 1021/0 bckw, 1214 tcpudp,         22 icmp, 1268-1410 alloc sync new ver working sync out: on  sync in: on sync packets sent: total: 9302 retransmitted: 0 retrans reqs: 0 acks: 49 sync packets received: total 4911 of which 0 queued and 0 dropped by net also received 0 retrans reqs and 38 acks to 17 cb requests callback average delay 1 max delay 6

The area for kernel memory you should keep an eye on is the total memory bytes used, unused, and the peak usage. The peak usage will tell you whether in the past there has not been enough kernel memory. You will get some statistical count in the failed alloc field of hash kernel memory and system kernel memory if there is a memory allocation problem for connection load.

The output of this command also gives you connections statistics, fragmented packets stats, and NAT stats. It provides the state synchronization statistics as well.

Final Tweaks to Get the Last Drop of Performance

We have by no means covered everything you can do to the members in your cluster to maximize their performance. One particular area of note is optimizing the operating system that the members use. This varies considerably from one operating system to another in terms of the types and extent to which you can do this, but it is thoroughly worth doing.