|< Day Day Up >|| |
If there are server problems or failures, you will want to restart the server as quickly as possible, but you will also want to collect the necessary documentation in order to resolve the problem. Setting fault recovery and running Notes System Diagnostic (NSD) will help you to accomplish these objectives.
Notes System Diagnostic (NSD) is a diagnostic script that gathers information. It is used to troubleshoot problems and verify that the server is configured correctly. NSD log files can be used as a tool to determine the cause of a server crash. NSD and memcheck are now bundled with the core Lotus Domino 6 product, and NSD is now the default debugger. The output of the NSD tool can be sent to Lotus Support to help diagnose server problems.
NSD output is in plain text and can be viewed with any text file viewer. It contains some basic configuration information and current processes running on the system (ps output). It also contains notes.ini and general system information (Linux version, local disks, and so on), as well as a memcheck portion.
A number of options can be used with the NSD tool, depending on the level of detail required. Following are some of the options available:
Displays the nsd help list
Report system info
Run the Notes memory checker only
Stops all Notes processes and cleanup-related IPC resources
The command nsd -info will skip attaching to the processes with a debugger and obtaining a trace. This is useful when you are gathering only system information and do not need any process-level information for diagnosis.
Issue nsd -memcheck to run memcheck. Memcheck is a utility that obtains information on the current state of the Domino memory pools. It is installed by default in Domino 6 and is called by the NSD script, but it can also be run manually. For details about the memcheck command options, run memcheck -h
If Memcheck information is not needed, use the -nomemcheck option. This can reduce the total running time of the NSD script.
If you cannot shut down the server with a quit from the console, then nsd -kill should be run to ensure that the environment is clean for a server restart. Issuing nsd -kill will cancel all Notes processes and clean up IPC resources related to those processes.
The NSD report contains the following:
System info section-Linux version, swap info, local disks, VM stats
When you encounter a hang condition on your server, it is advisable to run NSD so that you can send it to Lotus Support.You must be in the Domino data directory to run NSD, and you should run NSD as the notes account.
The nsd file has information on the tasks which were running when the server crashed, as well as general system information. By default, nsd files are created in the IBM_TECHNICAL_SUPPORT directory located beneath the Notes/Domino data directory.
To change the directory where the nsd is created, set the following option in the notes.ini:
To add which programs nsd should attach to and kill, set the following option in the notes.ini:
We recommend that you enable fault recovery to automatically restart the server after a Domino server crash. The server will shut down, release all associated resources, and then restart automatically, without any administrator intervention. If you are using multiple partitions, only the partition which has the error is terminated and restarted. You can enable fault recovery in the server document under Basics tab - Fault Recovery, as shown in Figure 8-24.
Figure 8-24: Enabling fault recovery
Run This Script After Server Fault/Crash
The name of an optional script that runs after a crash and before any other cleanup takes place. Enter the complete path and script name, including file extension.
Run NSD To Collect Diagnostic Information
Specifies whether to run NSD.
Automatically Restart Server After Server Fault/Crash
Specifies whether the server automatically restarts following a crash.
Cleanup Script/NSD Maximum Execution Time
Specifies the time, in seconds, that the cleanup script is allowed to run. If the script does not complete within the specified interval, it is stopped. The default execution time is 300 seconds (5 minutes). The maximum is 1800 seconds.
Maximum Fault Limits
The number of times the server is allowed to restart during a specified time period, in minutes (for example, 3 crashes within 5 minutes). If the number of crashes exceeds the number of allowed restarts for the interval, the server exits without restarting.
Mail Fault Notification to
The name of a user or group that Domino sends mail to after fault recover restarts the server.
In summary, since NSD is invoked automatically and will collect the necessary PD information, it is not necessary to run an additional script. By enabling automatic restart, the server will automatically terminate all tasks and restart.
The default time of 300 seconds may not be enough time for NSD to complete on a larger server, so consider increasing this value. It is also a good practice to specify a group that gets notified when the server crashes.
Core dumps provide additional problem determination data to help resolve Domino problems. Enable core dumps (core.xxx) with the following notes.ini variable:
The default location for core dumps is in the Domino notesdata directory. Since they can bevery large, it is recommended that core dumps be directed to another directory with more space. This can be done with the following notes.ini variable:
You can put core dumps and NSDs in the same directory to better manage these files.
Every Domino server records information about server activities in the log database (log.nsf). This is helpful when you are doing problem determination on your system. Some of the information it includes:
Database usage by user
Mail routing, replication and other events
Usage of the system by user, including:
Number of documents read and written, by database
Amount of data transferred across the network
Number of transactions run
You can choose to collect replication and client session event records when you initially set up the server. You can change these settings later with the Log_Replication and Log_Sessions parameters in notes.ini.
Let us assume there is a problem with a backlog of mail on the mail server, DomServA/ITSO. The Domino Administrator would issue tell router show from the console. We show the response from this command, divided into three sections, and include an explanation on how to interpret the output of each.
The first section of the output of tell router show is shown in Example 8-5.
Example 8-5: tell router show command -partial output
Msgs State Via Destination 2 Retry(16) NRPC [$LocalDelivery] mail/cb123lmt Last error: File does not exist Next retry: 08/28/2003 10:39:04
This example provides details on messages that are pending Local Delivery using NRPC. These messages are currently showing a RETRY message state, which means the messages cannot be delivered. Note the next line: File does not exist. This could indicate that the file is not physically there, or that there may be a problem with that file.
The [$LocalDelivery] Destination indicates these messages are for a local user on this server. The protocol being used to deliver these messages is NRPC. There are two messages in Retry state attempting to route with NRPC to [$LocalDelivery]. The number immediately following the State, 2 Retry(16) indicates the number of threads available to use for delivery of these messages. If these files are backed up and cannot make the local delivery, then the Domino Admin can confer with Linux Admin to check the iostat for the response time of the DASD and perhaps discover a bottleneck.
The second section of the output of the tell router show is shown in Example 8-6.
Example 8-6: Continuation of tell router show command
0 Retry(21) NRPC CN=DOMSERVB/O=ITSO (Pull/Push) Last error: The server is not responding. The server may be down or you may be experiencing network problems. Contact your system administrator if this problem persists. Next retry: 08/28/2003 11:04:40 0 Wait NRPC CN=DOMSERVC/O=ITSO (Pull/Push) 0 Retry(21) NRPC CN=DOMSERVD/O=ITSO (Pull/Push) Last error: The server is not responding. The server may be down or you may be experiencing network problems. Contact your system administrator if this problem persists. Next retry: 08/28/2003 10:44:22
Immediately following the Message Status information, you will often see the reason why messages are in queue. If the message is in retry, along with Last Error message, you will also see the Next Retry Date and Time. The reason is because the destination server is not responding. This type of event would alert Domino Administrator to check with Linux Administrator to see if perhaps a network problem exists, or to see what is happening to the server that is not responding.
The third section of the output of tell router show is shown in Example 8-7.
Example 8-7: Final extract - tell show router
Transfer Threads: Max = 4; Total = 3; Inactive = 0; Max Concurrent = 2 Delivery Threads: Max = 4; Total = 1; Inactive = 0
The number of Transfer Threads is currently 4, 3 of which have already been spawned. One more will be started if additional threads are required. The Maximum Concurrent Transfer Threads is 2, which is half the number of configured Transfer Threads.
The router sets a default maximum number of transfer and delivery threads based on server memory. So here, if mail does appear slow, the statistics can aid in determining if the memory is sufficient to sustain the mail volume.
|< Day Day Up >|| |