Things to Check before Calling Your Dedicated Support Professional

 < Day Day Up > 



NT/2000, 300GB, RAID5, and Home Directories

The next real-life example involves an NT/2000 server backup client and a UNIX backup server. The customer indicated that this particular NT machine was having problems completing its backup job. The backup job was failing with a Network Timeout error. The information our consultant received was that it was an NT/2000 machine whose purpose was serving as a home directory server, it used RAID 5, and it had 300 GB of disk, of which 80 percent was in use. The backup infrastructure for this customer is quite impressive: large STK9310, 9840A, 9840B, and 9940A drives; ACS/LS library software; Sun servers for the backup servers; and GB Ethernet server network-a very nice environment to work in. However, this NT client had a return performance number of ~3 MB/second. Now before you jump to conclusions, this NT server was no slouch; it was nicely configured as well:

  • Windows 2000 SP2

  • 2-GB RAM

  • Two 10/100 Compaq NC3131

  • One Gigabit Ethernet Compaq NC6134

  • One Compaq Smart Array 431 Controller

  • Two Compaq StorageWorks HBA

  • Three Intel P3 550 MHz

As mentioned earlier, the storage on this server had a volume of 300 GB with 80 percent utilized, or approximately 240 GB of home directory data.

Before we get into the troubleshooting, let's talk about what we should expect to see out of this client. Gigabit Ethernet theoretically can give you nearly 100 MB/second transfer speeds. The 9840A, the slowest of the three, is rated at 10 MB/second native and 35 MB/second compressed. So as you can see, if we can push data even at 40 percent of the Ethernet rate, we still should be able to keep that drive streaming-the operative word being should. However, we were only seeing 3 to 5 MB/second to the tape drive for this particular client. Now, there are several components that we can review here:

  • Network

  • Client

  • Backup server

  • Tape hardware

  • Tape hardware connectivity to backup server

Any one of these items could be the culprit at our customer site. As troubleshooter, it is your job to narrow your scope in order to bring closure to this issue quickly. One of the ways we do that is by logical deduction.

Here's what we know about the problem:

  1. It is not pervasive; in other words, it is not affecting all clients across the board.

  2. Other clients are completing successfully to the same tape hardware as the failed client with relatively good performance numbers.

  3. Our cursory review did not conclusively pinpoint a particular tape drive as the potential culprit.

  4. It appears only to fail on full backup jobs; incremental jobs seem to finish, although they are beginning to fail as well.

  5. The NT/2000 servers are on a separate network from the UNIX servers, and while other NT/2000 servers are completing successfully, their performance could be better than what was observed.

So we can eliminate some of the items from the list initially while we plan our strategic troubleshooting procedure. Based on what we know, the backup server seems to be the least likely candidate causing the problem because other client backup jobs are completing successfully and with relatively good performance numbers. The tape hardware and the connectivity seem to be eliminated based on the success rate of the majority of the backup jobs and the fact that there was inconclusive evidence to prove that there was a particular drive having detrimental problems-not to mention the fact that the failing backup client has yet to complete a FULL backup successfully. This leads us to focus our troubleshooting efforts on the network and the NT/2000 client machine. Since the NT/2000 servers are on a separate network from the UNIX servers, which are not having this type of problem, it was reasonable that we would include that in our troubleshooting test plan. And, naturally, since we are looking at network timeout issues, the NT/2000 client machine is a candidate. So, of those two categories, what components will we begin troubleshooting?

  • Network

    • Speed from backup server to NT/2000 client machine

    • Speed from NT/2000 client machine to backup server

  • Client

    • NIC configuration-GIG/FULL DUPLEX

    • Disk speed

      • Copy

      • Use component of backup software to read disk ( i.e., bpbkar with NetBackup)

    • Disk fragmentation

      • ScanDisk

    • Number of directories

    • Number of files

As we review this list, we first need to prioritize what we want to accomplish and make sure that as we move through the troubleshooting process, the previous task leads to the next, when possible. So as we begin, let's prioritize and document what our tasks will involve. Remember, a lot of troubleshooting is based on gut feel, so as you apply this in your environment, keep this in mind: There is no perfect method for troubleshooting, simply various styles. Choose your style and run with it.

Our plan was to focus on the client first and not the network, especially since we knew RAID5 was involved. For those of you who don't know, RAID5 is great for writes but terrible for reads. With backup, we do lots of reads and very few writes. Knowing this made our troubleshooting job much easier. Picture, for example, a pebble in a pond. The initial break in the water by the pebble is our NT/2000 client, and each subsequent ripple is another one of our tests, ultimately leading us out to our backup server. A natural progression is a good practice to adhere to when possible.

From the client, our test plan looked something like this:

  1. Run ScanDisk, if we can get approval from the admin team for this box.

  2. Copy data from disk in question to a separate disk, note time started and time finished, size (minimum 1 GB of data).

  3. Use utility from backup software if available to read data from the disk in question. Note time started and time finished. Size should be a minimum of 1 GB of data.

  4. Investigate number of files/directories.

  5. View properties of several directories, note time.

We weren't able to run ScanDisk on the drive because the administration team thought it would take too long and didn't think that was going to be the problem anyway. We skipped to Step 2. Copying data from the command prompt to an entirely separate disk didn't reveal any serious determent; of course, we were transferring only 1 GB of data. Next (Step 3) we used one of the programs that come with the backup software, in this case NetBackup, which actually is responsible for the file collection process to test the speed of the disk. This program is called bpbkar.exe. We started a process that would copy the files to an infinitely fast device, the 'bit bucket,' eliminating the network and isolating it at the NT/2000 client machine:

 c:\Veritas\Netbackup\bin\bpbkar32 -nocont c:\ > NUL 2> e:\temp.f 

temp.f will contain all of the files that bpbkar has collected. This will grow considerably if there is a large filesystem or directory structure you are testing.

Warning 

The text file collecting all of the files is going to a separate disk. Do not send it to the same disk you are testing with; otherwise, your testing numbers may be skewed. Be sure to time this as well.

When this was done at the customer site, the results were staggering. It took literally hours for it to simply read the files/directories off of this server. We finally canceled the process and delivered the news to the administration team. However, this didn't mean our work was finished. We still had two other steps in our client plan, as well as our network plan, which we haven't even outlined for you yet. When the administration team saw the results, they reluctantly ran ScanDisk on the drive to see the results. Have you ever seen a ScanDisk report with all red? We did. Defragmentation ran over the weekend and finished Monday morning. We all breathed a sigh of relief when that happened. But our troubleshooting wasn't over yet. While we did see some improvement in the performance, it didn't meet our expectations, so we then followed our network test plan:

  1. NIC configuration

  2. FTP speed from backup server to backup client

  3. FTP speed from backup client to backup server

  4. Backup speeds from the backup client to backup server to bit bucket

During our review of the NIC configuration, we found that the NIC was set to AUTO-NEGOTIATE. Apparently, when the admin team applied a Compaq NIC patch, it reset all of their NICs to AUTO-NEGOTIATE. So not only did this server suffer because of it, but several others did as well. After we changed the NIC back to FULL DUPLEX, we tested our backup speeds. As we anticipated, the backup speeds were meeting and in some cases exceeding our expectations.

Now that the problem had been fixed, we really didn't need to complete our network test plan, but in order to maintain consistency, we did so. We performed FTP tests between the servers and found the speeds to be quite acceptable. We even set up a NULL device on the backup server to test the network speed from client to server, isolating it from the tape devices and server back plane. Our goals were met, and these test plans were successful in helping us not only fix the problem but also in documenting it for future administrators who may run into a similar situation.

Incidentally, we did recommend that they address the RAID5 issue and consider some other RAID, such as RAID0 or RAID0+1, but since the customer's expectations were met, they didn't have a compelling reason to make such a drastic change.



 < Day Day Up > 



Implementing Backup and Recovery(c) The Readiness Guide for the Enterprise
Implementing Backup and Recovery: The Readiness Guide for the Enterprise
ISBN: 0471227145
EAN: 2147483647
Year: 2005
Pages: 176

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net