19.5. Improving Server Availability with Memory Technologies | HP ProLiant Servers AIS: Official Study Guide and Desk Reference

< Day Day Up >

Chapter 3 explained key memory concepts along with technologies used to detect and correct memory errors. This subsection explains methods and technologies relating to memory that can be used to improve server availability, including the following:

Online spare memory
Single-board mirrored memory
Hot-plug mirrored memory
Hot Plug RAID Memory

19.5.1 Online Spare Memory

Online spare memory technology enables a ProLiant server to remain available even if a DIMM records an excessive number of single-bit errors. Online spare memory increases availability by enabling an administrator to wait until a scheduled downtime to replace a faulty DIMM.

Online spare memory provides a spare memory bank for systems that detect excessive single-bit errors. When a memory bank with a faulty DIMM is detected, it automatically fails over to (is replaced by) a spare bank of DIMMs, as shown in Figure 19-1.

Figure 19-1. Online spare memory.

In earlier-generation servers and in servers without online spare memory, when a memory module experiences an excessive number of correctable single-bit errors, the system issues a prefailure warning and the DIMM continues to function in its degraded state. If this occurs, the recommended procedure is to power-off the server as soon as possible, replace the faulty DIMM, and restart the system.

With online spare memory, a memory bank with a faulty DIMM automatically fails over to a spare bank of DIMMs. One pair of DIMMs functions as the spare bank. Up to two other memory banks can be installed in the other slots.

When the faulty DIMM reaches a predefined single-bit error threshold, the ROM starts copying the contents of the failing bank to the spare bank in 128KB increments. During this time, the failing bank provides all read accesses. Data is written to both banks during the copy process.

Note

Because the spare bank must be able to hold all the information from a failing bank, the DIMMs in the spare bank must be the same size as, or larger than, the other banks.

After memory copying is complete, the system ROM makes the switch to the spare bank. At that point, no more reads or writes are made to the failing bank. All reads and writes are made to the spare bank.

During a scheduled shutdown, the faulty DIMM is replaced with a functioning DIMM. When the server restarts, the memory banks resume their normal functions.

The advantage of online spare memory is that the shutdown to replace the failing DIMM can be scheduled for a time when there is little activity on the server. When the server restarts, the memory banks resume their normal functions.

19.5.1.1 SPECIAL REQUIREMENTS

Online spare memory has the following requirements:

Online spare memory must be configured in the ROM-Based Setup Utility (RBSU). If you do not configure the server for online spare memory, all memory slots can be used for main memory up to the system maximum.
ProLiant G2 servers allow any bank to be defined as a spare bank.
If you are using online spare memory, the spare memory bank is not counted during the Power-On System Test (POST) and is not added to the system memory count reported to the operating system.
The DIMMs in the spare bank must be the same size as or larger than the other banks.

Note

You can find the complete special memory requirements in the user guide that shipped with the server.

19.5.1.2 CONFIGURING ONLINE SPARES

Before you configure an online spare, HP recommends that you perform the following steps to test the new memory:

1. From the Advanced Options screen in the ROM-Based Setup Utility (RBSU), disable POST speedup.

2. From the Advanced Memory Protection screen, disable Online Spare with ECC support.

3. Restart the system to begin testing the memory. This may take a few minutes, depending on how much memory is installed in the system. After the memory has been tested, you can enable POST speedup again for faster system starts.

After the memory has been tested, power-down the system and verify that bank C is populated with memory no smaller than either bank A or B. To configure the online spare, follow these steps:

1. Power-on the server. Online spare memory is disabled by default; therefore, all the memory is initially counted and configured as available primary memory.

2. At the prompt, press F9 to enter RBSU.

3. From the Advanced Memory Protection screen, enable Online Spare with ECC support. Press Esc twice to return to the main RBSU menu.

4. Press F10 to exit RBSU and restart your server. When your server restarts it will enable online spare memory and display the following message: xxxxMB System Memory and xxxxMB memory reserved for Online Spare.

Note

If the memory size requirements for proper operation are not met, RBSU will not allow you to enable online spare memory and will display this message: Caution: Current memory configuration does not support Online Spare.

19.5.2 Single-Board Mirrored Memory

Single-board mirrored memory provides a higher level of availability than online spare memory by adding protection against multibit errors using a mirrored memory bank. When a failure occurs, the system rereads the correct data from the mirrored bank. The system performs all future reads from the mirrored memory bank until the server can be shut down and the memory replaced.

Single-board mirrored memory in ProLiant servers protects against multiple noncorrectable multibit errors without degrading the performance of the memory system.

A single memory board contains two memory banks. One of the banks is designated as the primary bank and the other as the mirror. Data sent to memory is written by the memory controller to both banks simultaneously, but the system reads from the primary bank only, as illustrated in Figure 19-2.

Figure 19-2. How single-board mirrored memory works.

During a read operation, if a multibit error is detected on one or more DIMMs in the primary bank, the system reads from the mirrored bank instead. This process occurs without service intervention or server interruption. Service personnel can replace the failed DIMM during a regularly scheduled shutdown.

Note

Mirroring protects against multibit errors so long as system and mirrored DIMMs do not fail in the same cache line at the same time (a highly unlikely event). All read and write operations in a mirrored memory configuration are handled by the memory controller.

Although data is always written to both banks in the mirrored pair, data is read only from the primary bank. To ensure that all DIMMs are functioning properly, every 24 hours the system switches the primary and mirror designations and begins reading from the other bank.

19.5.2.1 SPECIAL REQUIREMENTS

The requirements for mirrored memory are as follows:

Mirrored memory must be configured in the RBSU. If you do not configure the server for memory mirroring, all memory slots on a single board can be used for main memory.
If you are using mirrored memory, only half of the physical memory is counted during POST and subsequently reported to the operating system.
Each DIMM in a bank must be the same size as its mirror in the other bank.

When implementing mirrored memory, follow these steps:

1. Start the server with POST speedup disabled and memory mirroring disabled. This enables the system to check the status of all the memory.

2. After the POST process verifies the memory, restart the server and configure memory mirroring and POST speedup.

19.5.3 Hot-Plug Mirrored Memory

Hot-plug mirrored memory provides a higher level of availability than online spare memory by adding protection against multibit errors. When a failure occurs, the system rereads the correct data from the mirrored bank on the other memory board. The system performs all future reads from the other memory board until the failed bank can be hot-plug replaced without shutting down the server, as shown in Figure 19-3.

Figure 19-3. How hot-plug mirrored memory works.

Hot-plug mirrored memory is targeted to customers who cannot afford to take a server offline to replace a failing DIMM.

Hot-plug mirrored memory works like single-board mirrored memory. The difference is that the primary banks and mirrored banks are located on different memory boards.

To use hot-plug mirrored memory, a server must have two identical memory boards, each containing several banks of DIMMs. The memory controller writes the same data to identically configured banks of DIMMs on both memory boards. The memory controller reads data from only one group.

If any bank of DIMMs has a multibit error, the system performs the following actions:

1. Rereads the correct data from the mirrored bank on the other memory board

2. Performs all future reads from the other memory board

3. Provides notification of the DIMM failure through the QuickFind diagnostic display, the memory board LEDs, the front-panel internal-health LED, and Insight Manager 7

If no errors occur, the system periodically switches which set of banks it reads from to ensure that both sets are monitored for memory errors.

! Important

If a DIMM exceeds the limit defined by HP for single-bit correctable errors, the system will not fail over to the redundant banks, but will notify you of the condition through Insight Manager 7.

Hot-plug mirrored memory also provides hot-plug replacement capability. You do not have to wait for a scheduled shutdown to replace the failed DIMM. You can remove a memory board that has a failed or degraded bank on it without shutting down the server.

After replacing the bank of DIMMs, you can reinstall the memory board when the server is still running. After the board is reinstalled, the system automatically returns to mirrored status. You can also perform a non-hot-plug replacement of failed or degraded DIMMs during a scheduled shutdown.

Note

The special requirements described in the "Single-Board Mirrored Memory" section also apply to hot-plug mirrored memory.

Hot-plug mirrored memory offers two main advantages over single-board mirrored memory. First, you can mirror more than one bank of DIMMs. Second, no server downtime is necessary to replace a defective DIMM or bank.

19.5.4 Hot Plug RAID Memory

Hot Plug RAID Memory enables a memory subsystem to withstand a complete DIMM failure and continue to operate normally in a manner similar to hot-pluggable hard drives in a RAID 4 configuration. The system corrects future reads from parity information until the failed DIMM can be hot-plug replaced without shutting down or restarting the server.

Originally, the term RAID was used to describe a fault-tolerant hard disk drive technology. The same RAID theory can be applied to DIMMs. HP calls this Hot Plug RAID Memory.

Hot Plug RAID Memory allows memory to be added, replaced, and upgraded without shutting down or even restarting the server.

Servers with Hot Plug RAID Memory use five memory controllers to control five SDRAM DIMMs. The memory controllers are part of a next-generation chipset designed by HP. Each one is an application-specific integrated circuit (ASIC).

To write data to memory, a cache line of data is split into four blocks and distributed among the controllers. Four of the memory controllers write one block of data on their associated DIMMs. A RAID engine calculates parity information, and the fifth memory controller stores the parity information on the fifth DIMM.

RAID memory takes memory protection and correction beyond the capabilities of ECC. With RAID memory, multibit errors and full DRAM failure can be corrected. Errors in the ECC detection system can also be detected, but not corrected.

Although a failure of the ECC detection system can allow memory corruption and ultimately a system failure, it might not cause problems in the system if there are no memory errors to detect or correct. The RAID memory system in the ProLiant 700 series servers allows for the detection of a failure of the ECC system, which can be corrected by replacing a memory cartridge.

< Day Day Up >