|< Day Day Up >|| |
The difficulty in planning for storage (memory) when moving a workload from Intel or other environments is that "common knowledge" from those platforms does not apply well to Linux on z/VM environments. Recommendations for configuring for performance from these other platforms will rightfully recommend large amounts of storage to avoid swapping. This results in very large storage sizes.
One of the options provided by z/VM is the ability for Linux servers to swap at a very high rate, with no degradation of service. Thus there is no need to increase storage to a size that will meet the largest requirement. CPU utilization should be monitored and alleviated with more storage as it approaches high levels, using both z/VM and Linux measurements.
VM memory is usually called "storage". Sometimes it is called "central procesor storage" or "real storage". It can be called "RAM" by those using Linux terms. All virtual machines, virtual disks, minidisk cache (MDC), and other address spaces exist in VM storage. When storage in a virtual machine is not referenced for a period of time, there is no reason to maintain that storage in VM storage. This storage can be paged out by z/VM when memory is overcommitted and there is a need for other pages to be resident. Paging is either to DASD or to expanded storage (a section of processor memory).
One benefit of z/VM is the ability to define memory size at the level that each virtual machine requires. Some physical environments limit this to increments of 256 MB. So if 300 MB would do under VM, you would have to define 512 MB in some non-VM environments.
There is a storage hierarchy that z/VM has implemented for performance reasons. When there is expanded storage, z/VM will steal pages from virtual machines, virtual disks, and other address spaces and move those pages to expanded storage. When that becomes full and even more storage is required, z/VM will then migrate pages from expanded storage to the paging devices.
This three-level storage hierarchy is very efficient and provides a very effective caching architecture for the paging devices. In addition, these paging devices may have more levels of cache inside the storage processor. See Figure 11-6 on page 279 for an overall view of this hierarchy.
Figure 11-6: Storage hierarchy
All virtual machine pages and virtual disk pages are first loaded into main storage. When these pages become idle, they become candidates for stealing and then for page migration, depending on current storage requirements. So the pages in main storage are meant to be the actively used pages—the pages currently in use. Because not all of the pages are active pages, and because some servers or virtual disks are idle at times, those pages can be paged out to make room for more work.
The term "overcommitment of storage" is sometimes used to describe how much storage has been defined as compared to how much storage is available. There is no good rule of thumb for the amount of over-commitment that provides good performance. As Linux and Linux applications become better players in this new shared storage configuration, the level of overcommitment that can be supported goes up. And there is a very large dependency on how many servers do go idle for periods of time and how long those periods of time are when the server does not need main storage.
There is also a concept of "working set" as compared to required pages. The working set refers to the active pages used by a server. Some of the pages might be used for initialization only or for error recovery. There is no need to leave pages in main storage that are never referenced.
Overcommitted is good; it is the only way to enforce sharing of resources. The paging algorithms in VM have been developed over many years, and very high page rates can be supported. However, you should not overcommit resources for a single Linux guest. The sum of virtual machine size and virtual disk space should not exceed what is available to z/VM.
In a virtualized environment where there is significant benefit in sharing resource, reducing storage requirements for each server benefits the total system capacity. When too much storage is provided to Linux, a Linux server will populate that storage with cached data. This can be a large misuse of valuable storage resource. To reduce this storage, Linux memory can be reduced, and Linux will in turn reduce the size of its cache.
One of the experiments that we ran during the project was to start with a smaller virtual machine. Figure 11-7 shows the storage analysis for a run with a large number of users.
Figure 11-7: ESAUCD2 Linux memory analysis, 256 MB run
The important numbers from this are:
Total real storage
The virtual machine size was 256 MB; Linux reports 249.2 MB of storage, reserving about 5% for the kernal, and of which 246.2 was in use. But this is not actually meaningful, as Linux will use all available storage for cache when available.
With a total of 242 MB available, 147 was in use.
Cache storage in use
192.4 MM. This is the main use of storage in this example.
Recognizing that the cache size was much larger than required, we reduced the virtual machine size to 196 MB. Figure 11-8 on page 281 shows that the swap was used slightly more, going from 147 MB to 165 MB, but more importantly, the cache size dropped significantly. The overall requirements of z/VM storage dropped from a total of about 400MB (256 + 147) to 360MB (196 + 165).
Figure 11-8: ESAUCD2 Linux Memory Analysis 196MB run
For this workload, no degradation in response time was measured when reducing storage. CPU usage of Linux and VM should be monitored to minimize high rates.
One more test was run, with storage reduced to 128 MB. That turned out to be too small to support this workload.
The standard Linux installation documentation often refers to a swap disk and places that device on real disk. It is important to understand why this can lead to poor performance if your storage requirements are larger than your virtual memory. As Linux starts using storage, and fills up RAM, it starts to move pages to the swap device. In times of heavy swapping, Linux could be attempting to swap hundreds of I/Os to a single device. This turns out to be a significant bottleneck. The only real caching hierarchy for this configuration is in the storage processor (DASD controller).
Figure 11-9: Linux swap to real disk
If there were 10 Linux servers and each of those servers had a real disk assigned as a swap device, there would now be 10 potential disks for serious performance problems. The load across these volumes would not be balanced, but would depend on each Linux server's current requirements.
If those same 10 disks are instead turned into z/VM paging devices, and Linux uses VM virtual disk for swapping, there is now a three-level caching hierarchy that is backed up by 10 disks over which z/VM will balance any paging activity. Figure 11-10 shows a simplified illustration of this concept. In actuality, paging is not done through expanded storage, but through VM main storage.
Figure 11-10: Linux swap to virtual disk
There are several advantages to using virtual disk. Most Linux and UNIX administrators are taught that swapping is bad and greatly impacts performance. But in the virtual disk case, running many tests with swapping at 1000 swap pages per second, very good performance was maintained. The reason for the difference is that virtual disks provide a storage hierarchy that has not been used before. Virtual disks really do avoid performance issues associated with swapping. However, there is a trade-off of increased CPU usage; see 11.9.4.
Another benefit of using virtual disk is that the storage is not allocated until referenced by the Linux server. The cost of having 10 virtual disks for swap of 100 MB each is very small. The overall storage/disk requirements are greatly reduced—more swap can be defined than needed, and it will magically appear as an allocated resource only when called for.
As virtual storage and overall storage requirements were dropped, CPU requirements to support swapping to the virtual disk had a cost. In the following measurement of the 256 MB virtual machine experiment with 1500 user load, swapping to virtual disk was about 1000 swap pages per second. The number of virtual disk I/Os is the measurement for swapping to virtual disk. For the four 15-minute intervals reported, there was an average of 850 K virtual disk I/Os, or almost 1000 swap pages per second over the 900 second reporting interval.
Figure 11-11: ESAUSR3 Report showing virtual disk I/O for VM guest for 196 MB run
The cost of performing this activity is charged to the kswapd daemon. In looking at the Host Application report in Figure 11-12 on page 284, the cost is about 9% of one of our processors to perform 1000 swap I/Os per second.
Figure 11-12: ESAHSTA report for 196 MB run showing kswapd CPU requirement
This report shows the Linux guest CPU usage. The z/VM CPU usage must also be monitored in order to get the total picture.
When using virtual disk, monitor your system and consider these trade-offs; see 12.5 for additional discussion on this subject.
Prior to z/VM 4.4.0, when a page fault occurs on a page of a virtual disk in storage, that page (and any other associated page for the fault) is brought into real memory below the 2 B address. This may add to the contention for storage below 2 GB on those systems. The page fault behavior was changed in z/VM 4.4.0, so that the page now is brought in above 2 GB where possible (as is done with general virtual server pages).
While the pages backing the virtual disk blocks themselves are not created until referenced, the architected control blocks for dynamic address translation (page tables, segment tables, and so on) are created at the time the virtual disk is created. Further, these control blocks are not pageable and must reside below the 2 GB address in real memory. This is another factor that can increase contention for storage below 2 GB.
The VM storage management steal processing has a hierarchy of pages that it uses in trying to select the most appropriate pages. In that hierarchy, normal idle guest pages are lower on the hierarchy than a virtual disk in storage page. (A virtual disk in storage page is really a system utility space and therefore is given preferential treatment.) Linux guests often appear to stay in the dispatch list, never becoming dormant. A side effect of this is that guest pages are given close to equal priority with virtual disk in storage pages.
However, over time as this is addressed and guests start to drop from the dispatch list, an undesirable effect may occur: Linux will determine an unused page and move it out to swap disk (a virtual disk in storage). If this guest goes idle, VM storage management will steal more aggressively from the guest pages (pages Linux decided it needed) and less aggressively from the virtual disk in storage (pages Linux decided it was safe to page out).
The optimal case is when your Linux server has so much memory that you hardly ever swap. This means you do not waste response time and CPU cycles on swapping. Unfortunately that does not scale well, because in real life you cannot afford to give z/VM so much memory that you can line up all these big Linux virtual machines. Therefore, z/VM needs to page out portions of these Linux machines. That is unpleasant because:
A page fault on the first level may prevent Linux from doing anything useful (despite asynchronous page fault handling).
Linux and z/VM both implement a type of LRU algorithm to allocate pages, and these do not combine well.
In conclusion, the sum of virtual machine size and Linux swap space depends on the application requirements. The virtual machine size depends on what you can afford (or are willing to give to) the Linux server. You cannot choose an optimal size without knowing the resources available on z/VM. Some tuning philosophies would just add more memory, but the complex interaction of Linux and z/VM could at times cause a high-cache, high-memory situation to result in more paging than if you limited each system to what it needed.
Our follow-on experiments were to reduce the machine size first to 196 MB, then to 128 MB. The run with 196 MB provided equivalent response time, with the swap rate averaging about 10% higher, and the kswap daemon also about 10% higher.
The 128 MB experiment proved the case that the swap rate is linearly proportional to the CPU required by the kswap daemon. At the one complete interval shown, swapping was 4 million for the interval, or over 4000 per second. The processor utilization was close to 200% across the two processors, as compared to about 130% across two processors for the previous run; see Figures 11-13 and 11-14.
Figure 11-13: ESAUSR3 showing virtual disk I/O 128 MB run
Figure 11-14: ESAHST1 process analysis showing kswapd CPU for 128 MB run
If your virtual machine size is too small, you will begin to swap at a high rate. To reduce swapping, you can either allocate a larger virtual machine size or allocate virtual disks as your swap devices. Too much swapping increases CPU (charged to the kswap daemon), but too large of a virtual machine will increase the overall real storage requirement.
This will be a constant area for analysis and planning to utilize your current resources most effectively. When storage is an issue, review the size of the cache. If the cache size is excessive, then reducing the virtual machine size will reduce the size of the cache, providing some storage back for performing other work.
Another trade-off is the amount of storage allocated to minidisk cache, the benefit received, and the amount of storage left for the virtual storage of the virtual Linux servers. The benefits increase as more of the referenced data is shared across multiple Linux guests.
In the report shown in Figure 11-15 from the run supporting three servers, the minidisk cache hit ratio was 47 to 48% of all read I/O being performed by the servers. The relationship between the percentage of minidisk cache hits and the amount of storage allocated to minidisk cache is very workload-dependent. No conclusions should be drawn from this specific data, other than to understand the benefit being provided and the cost of that benefit. In this case, the minidisk cache was much larger that one would see in most environments. Under main store, the average size of the minidisk cache was 380 MB, and under expanded storage, an additional 2 GB was used. Typical production environments should see a total of less than 500 MB.
Figure 11-15: ESAMDC analysis for three-server run
The size of the minidisk cache should be controlled, as z/VM will dynamically size the minidisk cache based on defaults that will not apply to these workloads. If paging becomes an issue, then reducing the size of the MDC from the default is normally a quick solution. The SET MDC command is used for this function.
Using the diagnose I/O driver will also impact the MDC storage requirement. When the volumes are formatted to allow the use of diagnose I/O, MDC can be tuned to use a record cache instead of a track cache. This is at the minidisk level. If MDC cache size requirement is too large, using the Diagnose driver with record cache is an option that should be evaluated.
The technology for supporting Linux under z/VM is improving rapidly. From a Linux view, ensure you are taking advantage of the timer patch. The normal Linux timer pops every 10ms by default, or 100 times a second. The impact of this is twofold—one, it uses processor time and two, it keeps the virtual machine active and less eligible to page. The patch switches this timer pop to run only when necessary.
When you share resources with other virtual machines, it is not wise to install applications or daemons that you do not need. See 10.3 for more information on installing the timer patch. Also see Linux on IBM zSeries and S/390: Performance Measurement and Tuning, SG24-6926 for information how to analyze the timer ticks when the virtual machine does not drop from the queue when idle.
The next requirement is for a more recent Qdrop APAR that applies to z/VM 4.3 and z/VM 4.4. The number is VM63282. The impact of this APAR allows servers using qdio, HiperSockets, and vctca for communications to drop from queue. This allows those servers to be paged out and their page working sets to be calculated effectively. The impact of this APAR has not been analyzed, but expectations are that overall storage requirements will drop significantly. This APAR was installed as part of the project, but not evaluated to determine its impact.
To summarize capacity planning for storage, it is a non-trivial task where assumptions made from the originating platform are not valid in this environment. If an Intel-based server is running 1 GB of RAM, it likely will run very well in a virtual machine that is much smaller.
In planning for storage, you should minimize the amount of storage used for cache. This storage is dedicated to Linux and will be active storage. With current DASD technology that can provide close to 1ms response time, there is less need for an internal data cache.
In addition, under z/VM, the minidisk cache also provides a dynamic cache of which Linux will take advantage. A benefit of the minidisk cache is that when workloads on one server are at a peak, more of the minidisk cache will be used for that server. Thus you have a shared resource that will be utilized by each server as they become heavily used, and then will be made available to other servers as heavily used servers become idle.
|< Day Day Up >|| |