26.1.1 DASD subsystem
Linux disk support for the traditional zSeries disks is provided by the DASD driver code, dasd.c. It provides support for Count-Key-Data (CKD) and Fixed Block Address (FBA) disk devices, as well as VM minidisks. The DASD driver uses channel I/O to perform read and write operations.
There are currently three ways to format disks used by Linux:
The DASD driver in Linux for zSeries and S/390 comes with the dasdfmt utility to format the disks. It formats all tracks on the disk with a fixed block size. There is no support for this particular format in existing S/390 software.
The FORMAT program in CMS also formats the disk with fixed block size, but adds a special eye catcher in R3. This format is recognized by CMS and by CP.
With the CMS RESERVE command, a file is created to fill the entire (CMS-formatted) minidisk. The Linux file system is then built into this file so that the original CMS formatting of the disk is retained.
If you use IBM RAMAC Virtual Array (RVA), there would be a benefit if you use the "Instant format" function. The instant format function is part of the VM SnapShot function; an instant copy of a formatted (empty) disk is made on the extent to be formatted using the SnapShot feature of RVA. This copy is instantaneous and does not occupy back-end storage in the RVA.
For DASD (disk) storage, there are options to read-only share some amount of disk between servers. Using VM's minidisk cache to cache shared data once is significantly more effective than having Linux cache the same data.
For VM minidisks using the CMS RESERVE format, VM Diagnose I/O can be used to provide better performance. See "VM Diagnose I/O."
Linux can also use a disk that was only formatted by CMS. In this case, Linux will use all the blocks on the disk, so that CMS ACCESS will fail on this disk afterwards. Even when you do not need the ability to access the blocks from CMS, there may still be a reason to prefer this format over Linux dasdfmt. VM directory management products such as DirMaint can format the disk before making it available to the user ID.
There is no reason to use the dasdfmt command for Linux images on VM, except for the situation where you have a virtual machine running Linux, and you forgot to format the disks. Since Linux does not yet tolerate DETACH and LINK of minidisks very well, you would have no other option than to shut down the Linux system and get back to CMS to format it.
This would not occur if you do an automatic format with DirMaint.
XPRAM (see 20.4.7, "Expanded memory support") enables Linux to use expanded memory as disk space. For data that are used often in a session, using the XPRAM driver may help to enhance your performance. Another way to enhance performance is to use LVM for your DASD containing the Web content. LVM enables you to combine several DASDs on one logical volume. Stripping the logical volume can enhance DASD access times considerably.
Minidisk caching (MDC)
VM provides a feature that can provide a good performance benefit for physical disk access. VM minidisk caching allocates memory in VM to cache guest minidisks to accelerate disk access.
While Linux provides its own buffer cache, it is still advantageous to provide a "second-level" disk cache, because the Linux cache takes lower priority to real user processes in Linux's memory allocation. This means that there may not always be enough free memory available for Linux to provide an effective cache internally. The memory for the minidisk cache is pre-allocated, so there is always at least one level of caching available to the Linux guests.
Minidisk caching has been shown to dramatically improve the performance of Linux guests with file systems on minidisk. There is a trade-off, however, because allocating memory to minidisk cache means that there is less memory available to be allocated to guests. Adding more memory to the processor may be justified in order to be able to take advantage of the available performance gain.
Minidisk caching effectiveness is greater for minidisks that are shared between Linux guests. In those cases, a single copy of the data can reside in the MDC instead of multiple copies of data, a different one in each guest. Minidisk caching can be done at a track level or at a block level.
Minidisk cache guidelines:
Configure some real storage for MDC.
You need some real storage allocated to MDC to keep it from thrashing in real storage. The idea is that MDC must use main storage anyway unless the reads work out perfectly block-aligned. If you do not allocate real storage for MDC, you lose the benefit of those intermediate real storage buffers. MDC picks the amount of real storage for its needs in such a way that the average lifetime of an MDC page equals the average lifetime of a user page in the dynamic paging area (DPA).
In general, enable MDC for everything.
Disable MDC for:
For large storage environments, it is not advisable to use MDC.
In large storage environments (greater than 2GB) CP tends to overcalculate MDC pages needed. Use SET MDCACHE to apply a scale-down fudge factor to CP's calculation.
MDC is a better performer than vdisks for read I/Os.
"MDC faster than vdisk reads" is a path length statement. vdisk uses system utility address spaces which tend to be accessed via general-purpose (read "slower") macros.
VM Diagnose I/O
The biggest advantage of the CMS RESERVE format is that it is the only disk format for which the current Linux for S/390 DASD driver can use VM Diagnose I/O. Diagnose I/O is a high-level protocol that allows the user ID to access blocks on its minidisks with less overhead than pure S/390 channel programs with Start Subchannel (SSCH). To enable VM diagnose I/O in the DASD driver, you must configure the kernel to enable the "Support for DIAG access to CMS formatted Disks," which is not done in the default SuSE kernel. To enable the option, you first need to disable the "Support for VM minidisk (VM only)" configuration option (also known as the old mdisk driver).
DASD response time
When evaluating DASD response time, the response time value is usually the most significant; it shows how much time an average I/O takes to the device. If this value is large, the components of response are evaluated. Thes DASD components are evaluated as follows:
This is the amount of time it takes to start I/O on the channel; normally less than one ms.
This is the amount of time it takes for the control unit to access the data. This includes rotational delays, seek delays, and processing time inside the control unit. Disc (for disconnect) time is normally less than 2 to 3 ms on cache controllers.
This is the amount of time it takes to transfer the data on the channel, normally less than 2 ms for a 4 KB block of data.
This is the sum result of pend, disconnect and connect times added together. In some environments, installations may choose to use only the sum of disconnect and connect, but this is not typical.
This is the result of many users accessing the same device. If the device is already servicing another user when an I/O is started, the I/O sits in queue. The length of time in queue is queue time. This is the component of response time that, under load, is the most variable and the most serious.
26.1.2 Processor subsystem
Knowing how much processor is used, and by which servers, is information you need to know for efficient performance monitoring and problem analysis. Controlling the rate at which your servers access the processor is done by setting Shares. Share settings have a minimum value and a maximum value, with several options and variations of each. Using the CP monitor, you can capture over 99.9% of the processing power used. Building a processor map showing how much processor is used by LPAR, VM, Linux servers, VM servers, and CMS allows you to project future requirements based on new users or customers.
A virtual machine should have either all dedicated or all shared processors.
Dedicating is a bad idea if you are CPU-constrained. However, a dedicated processor gets the wait state assist and the minor time slice becomes 500 milliseconds. Do not mix dedicated and shared processors because it results in VM CPUs of drastically different speeds, which is problematic. Usually a higher absolute share is as good as dedicating a processor.
Use absolute if you can judge the percentage of resources required.
Use relative if it is difficult to judge the percent of resources required and if a lower share is acceptable when the system load increases.
Do not use LIMITHARD settings unnecessarily.
The downside of LIMITHARD is more scheduler overhead. Also, it masks looping users, which means you cannot discern them from legitimate work and debit their accounts.
Use a short minor time slice as it keeps CP reactive.
A long minor time slice blocks master-only work and increases the trivial response time. Additionally, if the minor time slice is long, managing absolute shares is problematic.
Set QUICKDSP ON to avoid eligible list. QUICKDSP ON allows us to overcommit storage. However, you must make sure the paging system can handle it. Maybe the 50% rule should be changed to 33%, and you should ensure that the paging system is ready for concurrency.
A higher SHARE setting. Maybe even absolute is appropriate.
SET RESERVED in directory option.
NOMDCFS in directory option. NOMDCFS suppresses CP's usual attempt to hold the VM back to "its share" of MDC pages. Since we are "eating for two" (or N) here, this is OK. You must be careful, though, as you are giving up your only throttle on MDC for this.
Exploit DASD Fast Write for servers that do synchronous writes. In DASD FW, the CPU gives CE and DE as soon as the write is safe in NVS. 3990-3 and early-6 needed this. Knobs are: SET DASDFW ON, SET NVS ON, and SET CACHE ON. Later on, control units will do this entirely under the covers, and you will no longer have any control, not even to turn it off.
Virtual Machine (VM) guidelines
Do not define more virtual CPUs than necessary. Too many virtual CPUs erodes share and increases path length through the dispatcher.
CP command guidelines
Using the CP Indicate command, you can interrogate many items. However, for detailed analysis over long periods, you are probably better off looking at reports from VMPRF or some other tool if you are trying to discern trends via repeated use of these commands.
LOAD: shows total system load.
IND LOAD shows processor utilization, XSTOR, PAGING, MDC summaries, dispatcher queue lengths, and potential load on storage (sum of working set size of virtual machines in the dispatch list divided by size of the DPA). The STORAGE value is unreliable as working set size often includes scheduler fudge factors as well as the QUICKDSP value (which causes storage to become overcommitted anyway).
IND USER EXP: more useful than IND USER.
This command provides a good sketch of one user's consumption history: pages resident, use of paging resources and processor time consumed. It is more useful than the USER command, because it shows every address space (data space) and the fields will not overflow.
IND QUEUES EXP: great for scheduler problems and quick state sampling.
This command shows the contents of the dispatch list and the eligible list, in priority order. It helps to show if an eligible list is forming. (If an eligible list forms, it means some resource is constrained.)
IND PAGING: lists users in page wait.
IND IO: lists users in I/O wait and the device they are waiting on.
IND ACTIVE: displays the number of active users over a given interval.
CP QUERY command:
USERS: number and type of users on system.
SRM: scheduler/dispatcher settings.
Using this command, you can look up the settings for the scheduler (DSPBUF, LDUBUF, STORBUF, and so forth). For a description of scheduler knobs, see HELP CPSET SRM.
SHARE: type and intensity of system share. Shows relative/absolute and maximum (if applicable).
FRAMES: real storage allocation.
Shows use of real storage frames: V=R, res NUC, DPA, pageable NUC, trace table. However, it does not show contention information; only how the frames are spread among these key areas.
PATHS: physical paths to device and status.
ALLOC MAP: Shows DASD allocation what's where on your DASD. Make sure to check to see if you left paging on your respack.
XSTORE: assignment of expanded storage.
MONITOR: current monitor settings.
MDC: MDC usage.
VDISK: virtual disk in storage use.
Using state sampling, you can record the state of some object (user, device) periodically and then observe the sequence of samples to try to discern a trend. You can take a snap view, a low-frequency view, or a high-frequency view.
INDICATE QUEUES tells you about the whole system's scheduler behavior and RTM. Display User can tell you about a particular user.
INDICATE QUEUES gives a snapshot view of the scheduling system. If we see an E-list over a long period, for example, we can suspect a resource contention issue (storage or paging, probably). We can then go to other sources to learn more.
RTM Display User: a snapshot of recent consumption history for the specified user. Viewed repeatedly over time, we can discern trends.
Low frequency view:
RTM Display SRC:
Over a very long period (hours), we can witness the percentage of time the whole set spent being active, in page wait, I/O wait, running, in E-list, and so forth.
High frequency view:
HF data are very useful for discerning short-lived phenomena. The counters and state variables are sampled every few seconds and are rolled up into the monitor data each time a sample is written. The roll up is a histogram of the sample period or the sum of the samples over the period along with the number of samples taken.
26.1.3 Storage subsystem
Lack of storage to meet the requirement results in paging, and paging causes delays in service. Monitoring the storage requirements and the impacts of applications provides necessary feedback for capacity planning and performance problem analysis. There are many ways to reduce storage, and on VM there are different types of storage. For storage capacity planning purposes, you should maintain a map of your storage to understand the requirements for VM, minidisk cache, Linux user storage (by customer), VM Servers (TCP/IP, management service machines), and CMS users, if any. This map should be maintained for both Expanded Storage and Real Storage.
For storage (memory), there are several options, each with different impacts on Linux, applications, and your global resources. Coming from a minicomputer or microcomputer environment, administrators have been taught that swapping is undesirable. When swapping to slow SCSI devices, this may be true, but on zSeries and S/390, there are many other options and these options can reduce your overall (global) resource requirements. For swapping, the alternative options are:
Use Virtual disk as a swap device. The benefit is a much smaller page space requirement, as well as a smaller requirement for real storage.
Use RAMdisk as a swap device. The benefit is a smaller requirement for real storage. When sharing storage between many servers, this is important.
Use SET RESERVE instead of LOCK to keep the user's pages in storage. LOCK is ill advised unless you have a very clear view of what is going on inside the virtual machine. Even locking page 0 of a VM guest is not recommended.
SET RESERVE holds a count whereas LOCK hits specific pages. Guidelines for SET RESERVE:
Define some processor storage as expanded storage.
The purpose is to provide paging hierarchy, even when running a 64-bit CPU. Unless you have absolutely no storage constraints (paging=0), leave some expanded storage because it helps even out paging waits. See article in: http://www.vm.ibm.com/perf/tips/
Exploit shared segments and SAVEFD where possible.
SFS use of VM data spaces saves storage because it shares FSTs.
DB/2 use of VM data spaces requires storage.
DB/2 uses MAPMDISK to improve its DASD performance and creates private spaces for its own purposes (for example, sorting). This trades storage for I/O mostly.
Keep DASD paging allocations less than or equal to 50%.
Allocate enough space so that your paging space is never more than 50% full. You can use QUERY ALLOC PAGE to see how full your paging space is. One of the VMPRF reports also shows this.
Watch blocks read per paging request (keep > 10).
If blocks read per page-in is less than 10, you do not have enough paging space allocated. The idea is to make it easy for CP to find large runs of unused slots so that it can do block paging easily.
With multiple volumes, remember the rule: One I/O per subchannel at a time.
With multiple paths, remember the rule: All paging devices on same string, one channel for whole string is a bad idea.
Do not mix with other data types.
Mixing makes VM stop seldom-ending channel program it runs for paging. However, the system is delivered this way; therefore, at installation time, use stand-alone ICKDSF to get rid of the paging on the SYSRES. At the same, move the spool, if possible.
In a RAID environment, enable cache to mitigate write penalty.
The consensus used to be "don't enable cache for paging devices" because it thrashed the caches for non-paging devices behind the controller. However, in RAID, one write can result in two reads (parity and data) and then two writes (parity and data). Enabling cache helps the controller deal with this. Also, recent controllers (3990 6) are much smarter about caching, and they have enormous (GBs) cache, so cache thrashing is less of an issue.