| 
 | < Day Day Up > | 
 | 
There are a variety of tests one can perform to identify where the 'problems' exist within ones backup environment. This document only covers a few that I have found to be most useful in my capacity as a Consultant working with NetBackup. This document will briefly expose you to some of these ideas and concepts.
So you have identified that there is some performance problem with some of your backup clients. Perhaps you have noticed it because of some errors you have been receiving or you have been using my clnt_thruput.sh script and started to see a decline in performance for a particular client. Whatever the case may be, you are now here looking at ways to troubleshoot this issue. We have many connection points that could be potential bottlenecks for our backup client. From the client we have CPU, MEMORY, DISK, NIC, just to name a few with a NETWORK in between the client and the server. We could be seeing similar issues with the server as well. So simply saying that we have performance problems is a very broad statement and requires us to 'drill down' and narrow that statement a bit. The way we do this is to attempt to divide these areas up as much as possible in order to troubleshoot with as high a level of consistency as possible. Network performance is probably blamed for 90% of the backup performance problems. Why? Because network is the easiest target, well here's to the defense of the network groups out there, it may be your backup server, or the client's disk. It may also be the network, but this document will help us better discern where the problem lies.
Now there are a couple of thing that can be done to eliminate the many variables that exist between the client and the server. For our first example, we make an assumption that the client and server are two different machines and therefore will take the network and the server out of the equation and focus solely on the client. What we want to do is test how fast the client can 'read' data off of the disk. We do this by having it sent to an infinitely fast device, the bit-bucket or NUL.
To see if the client is the bottleneck in the network performance, perform the following using bpbkar, which is in the bin directory where your NetBackup software is installed. For information's sake, bpbkar is the backup/archive process or daemon that is responsible for file collection and creating the image that will eventually be put on the tape or disk being used for backup. In our example however, we will not be sending data across the network, but redirect STDOUT to NUL and STDERR to a file to be reviewed later. This file will give you the number of files that it processed and the size of the 'backup image' that was created, but ultimately sent to NUL. When you begin this process, you want to select a sizeable amount of data in order to get a fairly accurate representation and you want to simply track the amount of time it takes to process this command. Do not write to the same device you are reading from, this will skew your results.
NT/2000:
This test will use just bpbkar writing to /dev/null (the bit bucket), which eliminates the network portion of the equation.
c:\Veritas\Netbackup\bin\bpbkar32.exe -nocont > NUL 2> (for NT) i.e. c:\Veritas\Netbackup\bin\bpbkar32.exe -nocont c:\ > NUL 2> temp.ftemp.f will contain all of the files that bpbkar has collected. This will grow considerably if there is a large file system or directory structure you are testing with. BE WARNED.
From Unix:
/usr/openv/netbackup/bin/bpbkar -nocont / > /dev/null 2> /tmp/files.outSame idea.
So now we have taken the testing from the client to the server. Both of these tests will have left the network completely out of the respective equations. If your testing of the client has left you with reasonable results, then it is time to test the server. This may be moot especially if your overall backup performance is good with the exception of a few clients. But as good-natured System Administrators we want to make sure that we have adequate test documentation before we address our network group. Besides it's always good to perform these self-assessments every now and then.
There is an undocumented feature with NetBackup that allows you to eliminate the tape subsystems, disk and drive controllers from the equation and simply write your backups to, yes, the bit-bucket. As I mentioned before this is an infinitely fast device, if we could only write to NUL for all of our backups, our performance issues would go away, but alas we would have no data to show for it, so we can't. If you would like to do this, you must understand that there are several CAVEATS and WARNINGS that if you do not heed will have negative effects on your environment. Particularly with this one, will be your inability to restore any data from disk storage units, because the data will just simply not be there.
So here's the reason we want to run this type of test. First of all it will let us know how NetBackup is processing the data. Could there be anything inherently wrong with our NetBackup configuration? Hopefully part one of this test will help us to see that a bit clearer. The second reason we want to run this test, is after we validate our NetBackup Server configuration has been ruled out as the culprit, our client is performing well with its local test, we can test the backup from the client to the server across the network.
Here's how we begin:
Create a disk storage unit; even if you already have some created I recommend that you create a new one and call it DISK_STU_TEST or something similar.
Now touch /usr/openv/netbackup/bpdm_dev_null or for the WINTEL systems, create an empty file call bpdm_dev_null in the NetBackup directory.
Run your test backup.
I usually create a backup policy called special that I use for adhoc backups or testing. You may want to do something similar to avoid any modifications of your production backup policies. Whatever you decide to do, make sure you select the DISK_STU_TEST storage unit to be used by the test backup policy. When this backup job runs, the fragments will be created but the file length will remain 0 and bpdm will write the image to the bit-bucket (i.e. /dev/null).
When you back up to a disk storage unit, bptm (backup tape manager) daemon or process will not be invoked, rather bpdm (backup disk manager) will be. Therefore if the test proves to be successful, then we may be looking at a tuning issue with regard to shared memory and/or tape buffers.
Since we are sending all of the image data to NUL, we will not be able to restore any of this data that we backed up during our test. This is true for ALL disk storage units created on this particular media server. You must remember to DELETE bpdm_dev_null after your testing is complete.
If you touch the file, you should see the following in bpdm debug log for backups:
> really writing to /dev/null
This tells us that we have created the storage unit correctly for this test and that any other subsequent backups to a disk storage unit on this media server will really write to /dev/null.
Caveats:
Don't try restores.
Doing this affects ALL disk-based backups on the server (all go to null).
I have been told that while it has yet to be proven, it may cause problems with tape based backups as well. I personally haven't had this experience.
Doing this leaves extra information lying around in NB databases.
Expire any images that you create with this test after the test is over using the bpexpdate command.
When you are done remove the /usr/openv/netbackup/ bpdm_dev_null file.
The final test is simply using FTP. We have now tested and timed the backup at the client to the bit bucket, client to the server to the bit bucket and now we want to try the network, from the client to the server. This will tell us if we are looking at a client issue, server issue or network issue. If anything it will give us good information as to how to proceed resolving our performance problem.
Run this test from the client, then from the server and evaluate both findings to determine the next step.
FTP from the client to the server, then server to client.
FTP something large enough to measure (1 GB preferably).
While this may not seem like a real test, it does accomplish something for us. We are able to transfer data OUTSIDE of the backup application, allowing us to compare the times with FTP and the times with backup. With all of this information and testing complete, the profile of the client backup should be relatively clear and we should be able to at least determine where the root cause of the problem exists. If this is not the case, open a support call with VERITAS. Be prepared to submit all of your documentation to them, so you can avoid any lost time resolving your issue.
| 
 | < Day Day Up > | 
 | 
