11.8 Performance Optimized Page Sizes (POPS) | HP-UX CSE(c) Official Study Guide and Desk Reference

Performance Optimized Page Sizes (POPS) is a technique we can employ that will reduce the number of translations from a Virtual to a Physical address the processor makes while executing a thread. You may have heard of POPS by various other names including superpages and variable page size . This feature applies only to machines using a PA-8X00 processor or beyond. This processor range utilizes the PA-RISC 2.0 architecture that is capable of understanding POPS. Our first job is to establish whether we are using a processor with the PA-RISC 2.0 architecture. There are two ways we can do this:

Is the kernel variable cpu_is_arch_2_0 set to 1? If so, the processor is from the PA-8X00 range.

 root@hpeos003[]  echo "cpu_arch_is_2_0/D"  adb /stand/vmunix /dev/kmem  cpu_arch_is_2_0: cpu_arch_is_2_0:                1 root@hpeos003[]

Does the processor support a 64-bit address space? If so, this indicates that it is at least a PA-8000 processor.
```
 
```
```
 root@hpeos003[]  getconf HW_CPU_SUPP_BITS  64 root@hpeos003[] 
```
Some early PA-8000 processors return 32/64 from the above getconf command. This means that the processor will support either a 64-bit or a 32-bit operating system.

Every time a memory address is referenced, the hardware must know where to locate that memory. In Chapter 9, Swap and Dump Space, we discussed how the system translates a Virtual address into a Physical address. The hardware solution to quickly resolving a Virtual to Physical page numbers is to look in a high-speed cache called the Translation Lookaside Buffer (TLB). Before the processor can access the page, it must be able to perform the translation as well as check that the Physical page is located in the instruction/data cache. The problem we are looking at here concerns the TLB. Up until the PA-8X00 processor, a TLB entry referred to a 4KB page of memory. When we have large memory configurations and threads with large data sets, this can result in lots of activity in the TLB. With the initial release of the PA-8X00 processor, the number of entries in the TLB was smaller than the number of entries in the later PA-7X00 processor. When we cannot resolve the Virtual to Physical page number in the TLB, we must ask the kernel to load the TLB with the appropriate entry from the Page Directory. With fewer TLB entries and larger data sets, applications were spending a higher proportion of their time being stalled due to TLB misses. This can cause a considerable delay in the processor pipeline and have a direct, detrimental impact on overall performance.

Figure 11-15. Translating a Virtual addresses.

NOTE : To speed up the process of establishing whether a page is referenced in the TLB and in the cache, both the TLB and cache will be searched simultaneously .

The solution to get around the problem of fewer TLB entries is to utilize POPS. The design of the TLB on the PA-8X00 processor has an additional field not found on previous processors. This field specifies not only the translation from the Virtual to Physical page number, but also the size of the page to be found starting at that address. In this way, a single TLB translation can reference large chunks of data held in memory . This is the essence of the variable page size implementation on HP-UX. POPS can be used to reference any type of data, i.e., user text, user data, shared memory, shared libraries, memory mapped files, and so on. There are two ways that the operating system can make use of this feature. One is a kernel drive solution; the other is a user drive solution.

11.8.1 POPS using vps_ceiling and vps_pagesize

The kernel driven solution is known as transparent page size selection . This is where the kernel will decide on a suitable page size depending on system configuration, current memory usage, and the size of the object in use. Two kernel parameters control transparent selection: vps_pagesize and vps_ceiling :

vps_pagesize : This kernel parameter specifies the page size the kernel will use in determining the page size of objects such as data and text segments. This page size is used when a user has not specified the page size for individual programs using the chatr command. This parameter is specified in kilobytes and has a default value of 4 (KB).
vps_ceiling : This parameter is the maximum page size the kernel will use during transparent selection. The kernel will monitor the use of different pages of memory and modify the page size as it sees fit, up to the maximum specified by vps_ceiling . This parameter is specified in kilobytes and has a default value of 16 (KB).

With the use of transparent selection, the kernel will select what it thinks is a good page size up to the maximum specified by vps_ceiling . In doing so, it is trying to reduce the number of translations the processor needs to make when de-referencing Virtual page numbers on behalf of a thread. This should alleviate any issues regarding the processor having fewer TLB entries than other processors. The default setting of vps_ceiling has been seen to be appropriate with a varied mix of applications. Where an administrator knows that a specific application uses large data sets, specifying a page size for individual programs using the chatr command can result in even greater performance gains.

11.8.2 POPS using chatr

Selecting a page size for individual programs is not easy. First, you have to be sure that a particular program will use large data sets. If you specify a page size for a program that is not suitable, the kernel may be allocating pages to a thread that are being underutilized , i.e., allocating 64KB pages of memory when the application is reading in 16KB chunks of data. This is a waste of memory. In many situations, you will need to work with your application suppliers to try to determine the most suitable page size for a particular program. If we wanted to try to determine this value ourselves , we would have to know at least two pieces of information:

What is the size of segments of memory being requested by a process/thread?
How many TLB misses are we experiencing? This is an important question. If you are not experiencing any TLB misses, the overall gain in performance may be smaller than you first anticipated. If an application is spending only 1 percent of its time experiencing TLB misses, then we can improve application performance by only 1 percent. On the other hand, if an application is spending 20 percent of its time experiencing TLB misses, then we can improve performance by up to 20 percent. That's quite a difference.

There are few supported tools publicly available that measure the number of TLB misses experienced by a processor. The only one I know of is a tool used mainly by application developers to monitor the usage of specific functions inside a program. This tool is known as cxperf . If you had such a tool, you could monitor the rate of TLB misses every time you selected a particular page size. Successive tuning exercises would result in an optimal page size for your particular program. Unfortunately, these tools are few and far between. This means that we will have to experiment with various page sizes or work with your application supplier who has (I hope) benchmarked this aspect of performance for HP PA-8X00 processors.

Before we look at using the chatr command to implement POPS on a program-by-program basis, let me say a word about the likelihood of the kernel honoring the page size you specify. The page size we specify is known as a chatr hint . By this, we mean that we are suggesting what we think is a good page size. If a page fault results in a page number that is not aligned on a boundary that is a multiple of requested page size, the kernel will select a page size smaller than the chatr hint . Topics such as page alignment are not something we can go into here, but it's enough to say that application programmers need to be aware of it when designing data structures in a program. All we can do is hope that the application supplier has a good development organization and that they have maximized the efficiency in data design to match the underlying hardware architecture. Now we can consider using the chatr command on a program-by-program basis.

The page sizes we can specify with chatr are 4KB, 16KB, 64KB, 256KB, 1MB, 4MB, 16MB, 64MB, and 256MB. Any other sizes will be rounded down to the next supported value. We can also specify a size of " L ", which will use the largest page size available. Once we have decided on a page size, we can apply it to specific programs. In this example, I am specifying a chatr hint of 64KB for the main program of our finance application:

 root@hpeos003[]  chatr +pd 64K /finance/bin/finDB  /finance/bin/finDB:    current values:          shared executable          shared library dynamic path search:              SHLIB_PATH     disabled  second              embedded path  disabled  first  Not Defined          shared library list:              dynamic   /usr/lib/libc.2          shared library binding:              deferred          global hash table disabled          plabel caching disabled          global hash array size:1103          global hash array nbuckets:3          shared vtable support disabled          static branch prediction disabled          executable from stack: D (default)          kernel assisted branch prediction enabled          lazy swap allocation disabled          text segment locking disabled          data segment locking disabled          third quadrant private data space disabled          fourth quadrant private data space disabled          third quadrant global data space disabled   data page size: D (default)   instruction page size: D (default)          nulptr references disabled          shared library private mapping disabled          shared library text merging disabled    new values:          shared executable          shared library dynamic path search:              SHLIB_PATH     disabled  second              embedded path  disabled  first  Not Defined          shared library list:              dynamic   /usr/lib/libc.2          shared library binding:              deferred          global hash table disabled          plabel caching disabled          global hash array size:1103          global hash array nbuckets:3          shared vtable support disabled          static branch prediction disabled          executable from stack: D (default)          kernel assisted branch prediction enabled          lazy swap allocation disabled          text segment locking disabled          data segment locking disabled          third quadrant private data space disabled          fourth quadrant private data space disabled          third quadrant global data space disabled   data page size: 64K   instruction page size: D (default)          nulptr references disabled          shared library private mapping disabled          shared library text merging disabled root@hpeos003[]

The +pd option specifies a page size for data segments, while the +pi option specifies a page size for code/text segments. It should be noted that if an application makes use of a large number of shared libraries, you will need to use the chatr command on those libraries because they do not inherit the page size of the calling program.

The ability to specify a variable page size is not restricted to the root user. The owner of a program can use the chatr command to specify the page size for his or her own program. As a result, we may find some users acting as bad citizens , using an inordinately large page size and using more memory than they really need. There is a third kernel that controls the use of POPS. This third kernel parameter is vps_chatr_ceiling . This will limit the maximum value that can be specified by a user using the chatr command.

Some administrators are often tempted to use the " L " option when specifying the page size. While this initially seems like a good idea, it should not be encouraged for every program on the system because this can result in the kernel selecting a completely inappropriate page size for many programs. Not only can this waste system memory, but it can also affect process startup times because the kernel has to do more work to calculate the changes to the composition of an address space utilizing variable page sizes. If you are unsure whether this will have any effect on performance, it is worth either (1) spending considerable time testing different page sizes for individual programs while running them with a typical application workload, or (2) increasing vps_celing and allowing the kernel to select a page size transparently .

11.8.3 Conclusions on POPS

With variable page sizes, a larger portion of a virtual address space can be mapped using a single TLB entry. As a result, applications that use large working sets can be mapped using fewer TLB entries. This lessens the overutilization of scarce resources, i.e., TLB entries. With fewer TLB translations, we improve the chances of avoiding a TLB miss . Fewer TLB misses means better performance.