4.4. Memory Failures and Fault-Tolerant Memory | HP ProLiant Servers AIS: Official Study Guide and Desk Reference

< Day Day Up >

Because memory is an electronic storage device, it has the potential to return information different from what was originally stored. The probability of memory failures has increased in recent years, as shown in Figure 4-13.

Figure 4-13. Probability of server outages due to memory failure.

There is a direct correlation between increased memory capacity and an increased annualized failure rate (AFR), the probability of one memory failure in one year. Several factors are driving increased memory capacity:

DRAM density is growing.
More DRAM chips are populated on each DIMM.
DRAM chips are being stacked on top of each other to create higher-capacity DIMMs.
More DIMM slots are available in servers.

The probability of memory failures increases 3% for every 1GB of installed memory. The probability jumps to 48% when the memory is increased to 16GB. The AFR increase is true whether the memory increase is caused by increased DIMM capacity or additional slots in the system.

Recent moves to reduce power consumption in servers also contribute to memory failures. As memory module voltages decrease, the signal margin also decreases, making the signal more susceptible to noise. As a result, memory error rates increase.

Two kinds of errors can typically occur in a memory system, hard errors and soft errors.

A hard error occurs when there is a physical problem such as a defective DIMM or a broken connection. A hard error recurs unless the problem is repaired. HP tests every DIMM that is placed in its servers to reduce the number of hard errors a customer might encounter.

A soft error occurs randomly when an electrical disturbance near a memory cell alters the charge on the capacitor. A soft error does not indicate a problem with a memory device because after the data is corrected, the same error does not reoccur.

Soft errors are more prevalent than hard errors. Research has shown that the number of soft errors increases with memory capacity. Based on 460 soft errors per year for 10,000 computers with 64MB of memory each, approximately 3 soft errors will occur per computer per year for every 4GB of memory.

HP has used a number of technologies to detect, and in many cases correct, soft errors. These technologies include parity checking, error checking and correcting (ECC), and advanced ECC.

4.4.1 Parity Checking

The first error-detecting technology used in memory was parity checking. It was originally designed for desktop computers. A system that uses parity checking adds a parity bit to each byte when data is written to memory, as shown in Figure 4-14.

Figure 4-14. Parity checking adds an extra parity bit to each byte.

The system counts the number of 1s in the byte. If it is an even number, the parity bit is 0. If it is an odd number, the parity bit is 1.

Example

1110011 = 1 for parity

1100011 = 0 for parity

When the system reads the byte from memory, the memory controller recalculates the parity bit and compares it to the stored parity bit. If a single bit in the byte has changed, the new parity value will not match the stored parity value. The controller knows that a soft error has occurred and shuts down the system to avoid data corruption. No error is logged.

Although parity checking detects many errors, it does have some drawbacks. Parity checking can detect only odd numbers of single-bit errors. If two soft errors occur in the same byte, the parity bit will be correct although the byte is corrupt. In addition, parity checking does not know which bit is bad and cannot correct it.

4.4.2 ECC

ECC memory was introduced in Compaq Systempro XL servers in 1992, and is used in all ProLiant servers. ECC memory can detect up to 4-bit errors in a single byte. More importantly, it can correct single-bit errors.

ECC systems are similar to parity systems, with one important difference. Instead of calculating a parity bit for each byte of data written to memory, ECC systems calculate a 72-bit syndrome (checksum) for every 64 bits of data. The checksum contains the 64 bits of data and 8 check bits. Figure 4-15 illustrates this concept.

Figure 4-15. ECC memory technology.

On a read operation, the memory controller recalculates the checksum and compares it to the original. If the syndromes do not match, the system knows that an error has occurred. It uses the checksums to determine which bit is corrupt and corrects the error before sending it to the requesting device.

Some soft errors are multibit errors, which traditional ECC cannot correct and which will cause an ECC system to fail. The potential for multibit errors increases with memory capacity. For example, servers with 1GB of memory using ECC are protected against memory failures only about as well as servers with 64MB of memory using parity checking. With each new generation of servers, memory capacity increases and so does the potential for system failures.

4.4.3 Advanced ECC

To improve memory protection, Compaq introduced advanced ECC (AECC) technology in 1996. HP and other server manufacturers continue to use this solution in current server product lines.

Standard ECC devices can detect and correct a single-bit error during a read from a DIMM. This prevents the error from causing the server to shut down. Standard ECC memory also detects multibit errors, but it cannot correct them. AECC can correct a multibit error that occurs within one DRAM chip on a DIMM and thus can correct a complete DRAM chip failure.

AECC performs memory writes, and it reads and corrects memory errors in a system with four DIMMs.

4.4.3.1 MEMORY WRITE

During a memory write in an AECC system, a 256-bit cache line is sent from the processor to the memory controller. The data in the cache line is divided into four 64-bit data words, which are sent to four error detection and correction (EDC) circuits.

Each EDC circuit generates 8 bits of ECC data that it adds to the 64-bit data word to form a 72-bit syndrome. The memory controller sends the syndrome to 4 DIMMs. Each DIMM has 18 DRAM chips, 9 on each side, for a total of 72 DRAMs. For each syndrome, the memory controller stripes the data across all 72 DRAMs, sending 1 bit to each chip.

4.4.3.2 MEMORY READ

When a device requests data from memory, the 72-bit syndrome is retrieved from the DRAMs and sent to the EDC circuits. The circuits run several ECC algorithms on the data to ensure its integrity. Then the EDC circuit strips the extra 8 bits of ECC data from the syndrome and sends the data back to the requesting device.

4.4.3.3 MEMORY ERRORS

If an EDC circuit detects a single-bit error within a syndrome, it can correct the error before sending the data to the requesting device.

Advanced ECC can detect and correct up to a 4-bit error, if all the errors originate from the same DRAM.

When a DRAM fails, it sends a single bit of corrupt data to each EDC circuit. Each circuit detects and corrects the error sent to it before sending the data to the requesting device.

AECC can detect and correct up to four single-bit errors, if all the errors originate from different DRAMs and go to separate EDC circuits. Advanced ECC cannot correct multibit errors if errors from different DRAMs are sent to the same EDC circuit.

< Day Day Up >