Using Database Performance Data | UNIX Fault Management: A Guide for System Administrators

I l @ ve RuBoard

Possible causes of database performance problems include a poor design of the data model or a poorly written database application program. Programs should first be tested heavily in a development environment before going into production. However, resolving design and programming problems are beyond the scope of this book. You may want to consult documentation from the database vendors , such as Oracle's Oracle Server Concepts , to get more information on these topics.

Additional causes of performance problems can be an inappropriate system configuration or an ineffective use of database resources. To identify these types of situations, you need to collect performance statistics, using the metrics and tools discussed earlier in this chapter. With the database in production, performance monitoring tools should be used to monitor loads and collect performance data. By collecting performance statistics, you can establish a baseline to compare against when a major performance problem exists. Because application workloads change and usage varies, it is important to collect data over a period of time rather than rely on a single snapshot of data.

Avoiding Performance Issues

Proactive monitoring against established baselines can enable you to identify resource usage trends that could lead to performance problems. In doing this, you can correct a situation before the end- user experiences poor response times.

Other types of performance problems involve the inability to prioritize different workloads and allocate to each a designated amount of system resources. Two products can help to prioritize database workloads: the Process Resource Manager and Oracle Database Resource Manager.

The Process Resource Manager (PRM) can be used to control system resource allocations for users and applications. Both minimum and maximum entitlements can be specified. CPU, real memory, and disk I/O bandwidth can be controlled by PRM. PRM, however, is available only on HP-UX.

You can use PRM on HP-UX to isolate database applications from other applications. The database management system can be given resource entitlements for CPU, memory, and I/O to help reduce interference from other applications. If the system is busy, PRM ensures that the database applications get their configured share of system resources. If an application doesn't need all the resources it is entitled to, PRM can be configured to distribute its resources to other applications. PRM can also be configured to limit an application's maximum resource usage for capacity planning purposes.

In addition to isolating the database management system from other applications on the system, PRM has special Oracle support to allow resources to be allocated to specific databases. Oracle imbeds the database name in the processes that it creates to handle database accesses . By configuring processes for each database instance to be in separate PRM groups, you can manage each database separately. This is especially useful if some database applications have a higher priority than others. For example, if long-running database jobs are of a lower priority, but need access to the same disks accessed by interactive users, then PRM I/O bandwidth controls can be used to ensure that the jobs make continual progress without seriously impacting other users. A less sophisticated approach may be to have policies in place to ensure that long-running jobs are not executed during periods of heavy database or system usage. Another approach is simply to run those jobs at a lower priority than the other applications on the system.

Oracle also provides resource management capabilities with its Oracle8i release. The Oracle Database Resource Manager can be used to prioritize database workloads between different users. For example, users running OLTP applications could be given more of the CPU than decision-support users. Only the CPU usage associated with Oracle is being manipulated; the CPU entitlements for other applications running on the server should not be impacted. However, the Database Resource Manager behaves optimally if Oracle is the only application running on the server.

Checking for System Contention

If users are complaining of a performance problem, you may want to check the user call rate or database transaction rate first to get an idea of how much useful work, if any, is being done. This information is available in the RDBMS MIB for standard databases. In addition, Oracle provides metrics for the number of user calls and the number of user commits in its Private Database MIB. By monitoring this data over a period of time, you can calculate rates for these measures. Transaction rates can be compared to a known baseline to give you some idea of whether a performance problem exists. Transaction rates can be unreliable metrics, however, because low numbers could also indicate less end-user activity. Ideally , you would check the database response time, but no good way exists to track this metric directly.

If a performance problem is reported , you need to determine whether the cause is a system bottleneck, a disk bottleneck, or misconfigured database parameters. Potential causes of system resource bottlenecks are an untuned operating system, a system that is too small to accommodate the database workloads, or other applications running on the system. You may need to increase the capacity of the system by adding more memory or processors. As previously mentioned, PRM can help you prioritize resources among applications.

To check for real memory bottlenecks, you can first check the amount of free memory. It should not drop below 5 percent of the total available. If the system can't keep up with the demands for memory, it will start paging and swapping. Excessive paging and swapping may be a sign of a memory bottleneck. GlancePlus, MeasureWare, and BMC PATROL are some of the tools available to show you paging rates.

CPU resource contention can be identified by high utilization and a long run queue. This should be sustained over a period of time to indicate a serious CPU bottleneck. In a database environment, check for memory contention first, because much of the CPU's time is spent paging and swapping data between disk and main memory. High CPU usage can be an indication of an application problem, such as a missing database index. You should also look for runaway processes dominating the CPU usage.

Networking bottlenecks may be identified by high utilization and many collisions on shared media. They can be addressed by redesigning an application to reduce its network traffic. Other options are to use higher-capacity network links, or to move client and server systems closer together.

Checking for Disk Bottlenecks

Disk- related metrics can be used to identify unbalanced I/O activity. If this is the problem, you may need to restructure the database and relocate data files on different disks to avoid disk and controller bottlenecks. You should not have the database files on the disk used for swapping. Database tables and their indexes should be on different disks, because they are accessed concurrently. If the database is updated frequently, the redo logs should also be put on a different disk, because the logs are frequently updated as well. You should also try to place tables that are commonly used together onto separate disks. Ensure that OS disk striping and Oracle striping are being used appropriately. You should also see whether a system memory bottleneck exists, to determine whether adding more real memory to the system may fix the problem.

Database tables, indexes, and rollback segments are each created with an initial disk allocation. If these allocations are exceeded, then more space must be allocated dynamically, potentially all over the disk. This disk fragmentation slows down I/O operations and can lead to disk bottlenecks. An administrator may need to restructure the database periodically to repair problems, such as a significant amount of database fragmentation. PLATINUM's TSReorg tool can show database fragmentation graphically, so that you can quickly assess the size of the problem. TSReorg can be used together with DBVision so that actions such as a database reorganization can be taken in response to a database problem event.

Checking Database Buffer and Pool Sizes

For Oracle databases, some parameters in the init.ora configuration file can be adjusted to reduce system resource contention and improve memory performance. Another option is for the database administrator to limit or control user logins. The number of concurrent users can be controlled to help ensure adequate response times. Policies can be enforced to disconnect inactive sessions.

Sorting can be a very time-consuming part of a database query. Oracle provides several configurable parameters in the init.ora file, such as SORT_AREA_SIZE, SORT_WRITE_BUFFERS, and SORT_WRITE_BUFFER_SIZE. These parameters determine how much of a sort is done in main memory. Having a large sort area can make database queries faster; but, because the resource parameters are applied to each user session, this impacts contention for system memory. Having a small sort area may require writes to disk during the sort .

If periods during which jobs are run with few users logged in to the system are available, perhaps during evenings, you may want to switch between two sets of configuration parameters. After business hours, you can take advantage of more free memory to increase the sort area size, and then switch to a smaller size during peak usage to account for the larger number of users. Oracle keeps track of the number of sorts that go to disk in its Private Database MIB. Comparing this to the total number of sorts yields a sort overflow ratio, which should be zero for OLTP applications.

As mentioned earlier, a number of resources are allocated on a per-user basis. If a performance problem exists, you may want to check who is currently logged in, to determine if unauthorized users are logged in or if some users have logged in multiple times. You can use who or check the number of inbound associations using the RDBMS MIB to get an idea of system activity. Earlier in the chapter, Listing 8-2 showed how an SQL SELECT statement could be used to display the current database user processes. Monitoring the number of concurrent sessions over time can be useful in identifying any spikes in usage. Oracle7 and Oracle8 have the ability to establish resource limits for a user's CPU time and amount of disk I/O. These parameters are defined in the user's database profile.

Oracle has a single buffer cache and shared pool area that is shared among all database users. The DB_BLOCK_BUFFERS and SHARED_POOL_SIZE parameters in the init.ora file control the size of these resources. Data from database tables, indexes, clusters, and rollback segments may be stored in the buffer cache. The buffer cache hit ratio can be calculated for Oracle by using the number of consistent gets, number of block gets, and number of physical reads metrics stored in Oracle's Private Database MIB. Other tools, such as BMC PATROL's KM for Oracle, also keep track of the buffer cache hit ratio. By checking the buffer cache hit ratio, you can see whether the size of the buffer cache is appropriate. The hit ratio should be 70 to 85 percent. If the hit rate is low and free memory exists, consider increasing the size of the buffer cache, because this is likely to improve performance. You can increase the cache size until the hit ratio stops increasing, as long as free memory is available and no CPU bottleneck exists. In cases where a CPU bottleneck exists and no memory is available, reducing the size of the buffer cache may be appropriate, to transfer some of the work from the CPU to disk. Balancing system resources this way can lead to better overall response times for database users.

The shared pool area is used for dictionary information and shared SQL and PL/SQL code. A lot of SQL statement reloading may indicate that the shared pool area is too small. For Oracle, the metrics needed to calculate the library cache miss ratio are in the library cache table of the Oracle Private Database MIB.

Database checkpointing affects disk performance because it increases the I/O activity to the disk. The LOG_CHECKPOINT_INTERVAL parameter in Oracle's init.ora file determines how often changed data in the buffer cache is written to the database. Make sure the LOG_CHECKPOINT_INTERVAL is equal to or larger than the size of the redo logs.

Oracle's LOG_BUFFER parameter is used to determine the amount of modified data that should be buffered before writes are made to the redo log files. Increasing the amount buffered reduces the number of disk writes needed, by writing data in larger blocks. This may be useful if many changes are being made to the database. If system memory is a bottleneck and not many database updates are taking place, it could indicate that the log buffer is too large. You can observe the log buffer's effectiveness by monitoring the I/O rate of the disk containing the redo log and the number of times a user has waited for space in the redo log buffer, which ideally should be zero. In BMC PATROL, the RedoSpaceRequests metric will tell you this. From Oracle's Private Database MIB, you can use the redo log space requests and redo log entries to calculate a redo log space wait ratio, which should stay below 1/5000.

Contention may occur if multiple users try to access the redo log buffer latches. BMC PATROL monitors this metric as RedoAllocationLatch. If contention is occurring, decrease the LOG_SMALL_ENTRY_MAX_SIZE parameter, which indicates the largest log entry that can be written on a redo allocation latch without obtaining the redo buffer copy latch. This forces the larger copy latch to be used. You should also increase LOG_SIMULTANEOUS_COPIES to twice the number of CPUs on the system.

Database chaining occurs if a row is updated and becomes too large to fit in a single physical database block. Multiple block reads are necessary to read data from a chained row, causing performance to degrade. To determine whether this is a problem, you need to check periodically the percentage of chained row fetches being made. If this number is large, you may need to increase the database block size, which requires re-creating the database.

I l @ ve RuBoard