NT2000, 300GB, RAID5, and Home Directories

 < Day Day Up > 



The Natural Progression of Troubleshooting

When troubleshooting, always start with the most obvious first, such as network cable, daemons, or filesystem problems (unless, of course, you have ruled out the obvious through your identification process). Usually, the most obvious place to start is also the least time-consuming. This is important to keep in mind, especially if you have certain service level agreements with your customers. You should plan your strategy for troubleshooting, then work that plan to fruition.

Because it seemed appropriate, we started looking at the error logs to determine what was happening internally to the backup application. This is an important exercise. If you do not understand how to read the error logs pertaining to the backup application, it is strongly suggested that you either take the time for some self-education or enroll yourself in the next available training class and be sure to inform the instructor that deciphering the error logs is one of your objectives.

Since our expertise comes from VERITAS NetBackup, we will continue to use it as our method of example when necessary. As we look through the logs, we need to have an idea what we are looking for. Remember, this is a cursory review of the most obvious place to start; therefore, we look for obvious errors, such as media errors from the bperror log or any physical errors that have been logged regarding the drives themselves in /var/ adm/messages.

Nothing glaring was uncovered during this discovery, so we completed this cursory review of the logs and began confirming the configuration. Since the problem the client described was not necessarily one pertaining to the configuration of the backup policies, we knew we could avoid that and focus squarely on the physical aspects of the configuration. NetBackup is actually composed of two components: NetBackup and Media Manager. The Media Manager component deals with physical devices and media, so this is a natural place to begin after reviewing the logs. The best way to display the configuration from the command line with NetBackup is to use the following:

 /usr/openv/volmgr/bin/tpconfig -d 

or

 /usr/openv/volmgr/bin/tpconfig -data 

This displays how the physical drives and robots are configured and is a good first step for your troubleshooting. Since this problem hinged on data storage to tape, it was reasonable to ensure that the correct device files were being used, and since it was a UNIX installation, we wanted to make sure that the compression device file was being used. Once we confirmed that, in fact, it was configured properly, the next step, since this was a Sun server, was to check the st.conf file. The st.conf file, found in /kernel/drv, is the configuration file for all SCSI tape devices that allows you to configure how the device file will interact with the physical device, including how it will handle compression.

When we reviewed the entries for the client's tape devices, we found that the compression settings were at default: no compression. The entire file looked to be at default; none of their custom settings were present. At that point, we asked about that 'fairly static' state that the server had been in the past several months. We found that they had installed a Sun jumbo patch to this server about 45 days prior. What the client didn't realize is that the jumbo patch will overwrite the st.conf file during its patching process. The solution was to restore a copy of the st.conf file before the patch and reboot the server. After a few weeks, it was apparent to the client that this was the problem after all.

When you are troubleshooting, you really take on the role of investigator, especially if you are not the primary administrator on the server. When you are asked to help out, you really need to be prepared to ask questions to help you determine the root cause. We recommend starting as though you know nothing of the application, within reason, and begin asking questions. Here's a sample of the questions that were asked in order to begin work troubleshooting the previous example:

  1. What exactly is (are) the problem(s) you are seeing?

  2. When did you notice the change or the errors?

  3. Have there been any changes to the main backup server? Media servers? Backup clients?

  4. What, if anything, have you done already to troubleshoot this problem?

  5. Do you have any site documentation to look at? Architecture topologies?

  6. What are your expectations once the problem has been ratified?

This should give you enough information to begin the troubleshooting.

Once you are done with your initial, cursory review of the site, you should have another set of questions based on what you find. It's always a good idea to document your plan in the event it takes longer than you anticipated so you can provide that to technical support or a consultant whom you may bring in to assist in the troubleshooting phase. Another reason you want to document is the mere fact that should this problem reoccur, this document may be used by another team member to resolve it much more quickly than before.

start sidebar
BACKUP OF 19 GB IN 10 TO 11 HOURS? SOMETHING'S WRONG

Another client example took 18 months to solve, mainly because the consultant on the project was called off to architect another backup solution, only to return to the original project to find that the problem was never remedied. This didn't involve failed backup jobs, crashing servers, or anything else so drastic. This was a simple case of backup speeds. The client assured the consultant that there was a full T3 between their buildings (45 Mb/sec), so there ought to be blazing speeds for backup. (We won't debate why they chose to backup across this T3 as opposed to putting another backup server in the other facility. Let's just say budget was the primary reason.) So it was on the consultant's shoulders to make this work and work well.

The numbers say that a full T3 should push 5.625 MB (that's megabytes) per second, 337.5 MB per minute, and 20 GB per hour. The network that the client had in place for the servers was 100BT, which should push 12.5 MB/second, 750 MB/minute, or 45 GB/hour. As you can see, it definitely looked like our T3 was going to be the bottleneck for our backup jobs. However, as we began testing and tweaking, we found that the backups were taking exceptionally long for the remote servers. We made sure that the servers were configured properly- specifically, that they had their NIC settings to 100 FULL and not AUTO, since we know that some switches and a certain server manufacturer's host-motherboard-Ethernet interface do not play well when set to AUTO-NEGOTIATE.

All of the network issues were worked out on our end; we made sure the administrators at the other location verified all of the network points as well, including server, switch, and router. They assured us everything was A-OK. After running the second test, the same performance numbers were realized. After making sure all of our network connection points were properly configured, we felt it was necessary to take it to the network group to have them look at the link between the buildings-maybe the backup T1 was actually primary and the T3 was secondary.

One thing we ought to mention, though, is that the IT team in the remote facility didn't get along very well with their counterparts in the primary facility. This is never a good thing. Teamwork and intercommunication are key to any successful IT organization.

After the network department assured us that the T3 is primary and wasn't even close to being fully utilized, we turned it back onto our network configurations between buildings. Naturally, the primary building where our consultant was sitting was reviewed a second time to make absolutely sure that it was configured properly. However, the IT team on the remote site didn't feel it was necessary to check their work, since they told us they had already done that.

About this time, our consultant was pulled off to work the architecture job; he had instructed the customer that the problem appeared to be on the remote site, but without having physical access, it was difficult to prove. During the absence of our consultant, the test backup policies went into production and the backup speeds continued to return very poor results. Unfortunately, the customer didn't pursue the issue of speed either, mainly because of personnel resource limitations.

Upon the consultant's return to complete the project, he found that the problem still existed and the client had simply grown accustomed to the performance and just figured that's the way it's going to be. Well, not for our consultant; he insisted that the network group turn a sniffer (network analyzer) on to see exactly what was happening between the main backup server, the switch, router, and backup client. So, at 10:00 P.M. the network administrator found the problem. Everything was connecting perfectly at the primary site, 100 FULL DUPLEX, until we got across the T3. The consultant noted that it was dropping down to 10 Mb/sec after the router, which would explain the performance numbers.

With hard information in our hands, we approached the remote site again and asked them to walk the wire, because somewhere on their end, it was dropping down to 10 Mbps. Fortunately, we found an administrator in the remote site who was more than willing to work with us. He tracked down the problem to the switch. There he found the switch was not set for 100 FULL, but AUTO-NEGOTIATE. Once that was changed, we saw our backup speeds increase from 19 GB in 10 to 11 hours to 19 GB in just a little over 90 minutes.

Sometimes it is not the software configuration, hardware configuration, or the backup administration; it is just an oversight in the network architecture somewhere. This problem was different from the previous one in that the consultant was the de facto backup administrator for the group, so he didn't necessarily have to ask any questions, but he did have to understand where the breakdown was occurring. Having an intimate understanding of how the backup product works is a considerable help when trying to troubleshoot these problems.

end sidebar



 < Day Day Up > 



Implementing Backup and Recovery(c) The Readiness Guide for the Enterprise
Implementing Backup and Recovery: The Readiness Guide for the Enterprise
ISBN: 0471227145
EAN: 2147483647
Year: 2005
Pages: 176

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net