The CPU and motherboard architecture (chipset) dictates a particular computer's physical memory capacity and the types and forms of memory that can be installed. Over the years, two main changes have occurred in computer memory: It has gradually become faster and wider. The CPU and the memory controller circuitry indicate the speed and width requirements. In most cases, the server chipset's North Bridge contains the memory controller. For this reason, Intel chipsets based on hub architecture call this chip the memory controller hub chip. However, AMD Opteron processors incorporate the memory controller into the processor itself. Even though a system might physically support a given amount of memory, the type of software you run could dictate whether all the memory can be used. The first modern server-class processors (Pentium and Pentium-MMX) had 32 address lines, enabling them to use up to 4GB of memory; the Pentium Pro, Pentium II/III, and 4, and Xeon as well as the AMD Athlon family (including the Athlon 64), have 36 address lines and can manage an impressive 64GB. The Opteron uses 40-bit addressing, allowing up to 1TB of physical RAM. Itanium and Itanium 2 processors feature 44-bit addressing, which allows for up to 16TB (terabytes) of physical RAM!
In reality, the actual limit on today's server-class processors isn't the processor's memory-address capability. Instead, it's the cost of memory, in addition to the limitations of real-world server and memory design. SIMMs, DIMMs, and RIMMsStarting in the mid-1980s, motherboard designs using socketed or soldered individual memory chips (often referred to as dual inline package [DIP] chips) on systems began to be replaced by motherboards designed to use small circuit boards that had multiple memory chips soldered to them. The benefits of this approach included faster system assembly, increased reliability, and easier replacement of failed modules. The first generation of memory modules used a design known as a single inline memory module (SIMM). For memory storage, most modern systems have adopted SIMMs, DIMMs, or RIMMs as an alternative to individual memory chips. These small boards plug in to special connectors on a motherboard or memory card. The individual memory chips are soldered to the module, so removing and replacing them is impossible. Instead, you must replace the entire module if any part of it fails. The module is treated as though it were one large memory chip. One type of SIMMs, three main types of DIMMs, and one type of RIMM have been used in servers. The various types are often described by their pin count, memory row width, or memory type. SIMMs, for example, were available in two main physical types30-pin (8 bits plus an option for 1 additional parity bit) and 72-pin (32 bits plus an option for 4 additional parity bits)with various capacities and other specifications. Most servers used the 72-pin version. The 30-pin SIMMs were physically smaller than the 72-pin versions, and either version could have chips on one or both sides. SIMMs were widely used from the late 1980s to the late 1990s and have become obsolete. DIMMs are available in three main types. DIMMs usually hold standard SDRAM or DDR SDRAM chips and are distinguished by different physical characteristics. A standard DIMM has 168 pins, one notch on either side, and two notches along the contact area. The notches enabled the module to be keyed for motherboards using SDRAM DIMMs or for the much rarer EDO DRAM DIMMs. A DDR DIMM, on the other hand, has 184 pins, two notches on each side, and only one offset notch along the contact area. A DDR2 DIMM has 240 pins, two notches on each side, and one notch in the center of the contact area. All DIMMs are either 64 bits (non-ECC/parity) or 72 bits (parity or ECC) wide (data paths). The main physical difference between SIMMs and DIMMs is that DIMMs have different signal pins on each side of the module. That is why they are called dual inline memory modules, and why with only 1 inch of additional length, they have many more pins than a SIMM.
RIMMs also have different signal pins on each side. Three different physical types of RIMMs are available: a 16/18-bit version with 184 pins, a 32/36-bit version with 232 pins, and a 64/72-bit version with 326 pins. Each of these plugs in to the same sized connector, but the notches in the connectors and RIMMs are different, to prevent mismatches. A given board will accept only one type. By far the most common type is the 16/18-bit version. The 32-bit version was introduced in late 2002, and the 64-bit version was introduced in 2004 but has never been produced. A standard 16/18-bit RIMM has 184 pins, one notch on either side, and two notches centrally located in the contact area. 16-bit versions are used for non-ECC applications, whereas the 18-bit versions incorporate the additional bits necessary for ECC. Servers using RIMMs normally use the 18-bit versions. Figures 5.4 through 5.9 show a typical 30-pin (8-bit) SIMM (seldom used in servers, by the way), 72-pin (32-bit) SIMM, 168-pin SDRAM DIMM, 184-pin DDR SDRAM (64-bit) DIMM, 240-pin DDR2 DIMM, and 184-pin RIMM, respectively. The pins are numbered from left to right and are connected through to both sides of the module on the SIMMs. The pins on the DIMM are different on each side, but on a SIMM, each side is the same as the other and the connections carry through. Note that all dimensions are in both inches and millimeters (in parentheses), and modules are generally available in ECC versions with 1 extra ECC (or parity) bit for every 8 data bits (multiples of 9 in data width) or versions that do not include ECC support (multiples of 8 in data width). Figure 5.4. A typical 30-pin SIMM.Figure 5.5. A typical 72-pin SIMM.Figure 5.6. A typical 168-pin SDRAM DIMM.Figure 5.7. A typical 184-pin DDR SDRAM DIMM.Figure 5.8. A typical 240-pin DDR2 DIMM.Figure 5.9. A typical 184-pin RIMM.All these memory modules are fairly compact, considering the amount of memory they hold, and are available in several capacities and speeds. Table 5.10 lists the various capacities available for SIMMs, DIMMs, and RIMMs.
SIMMs, DIMMs, DDR/DDR2 DIMMs, and RIMMs of each type and capacity are available in various speed ratings. You should consult your motherboard documentation for the correct memory speed and type for your system. It is usually best for the memory speed (also called throughput or bandwidth) to match the speed of the processor data bus (also called the FSB). If a system requires a specific speed, you can almost always substitute faster speeds if the one specified is not available. Generally, no problems occur in mixing module speeds, as long as you use modules equal to or faster than what the system requires. Because there's little price difference between the various speed versions, buying faster modules than are necessary for a particular application might make them more usable in a future system that could require the faster speed. Because DIMMs and RIMMs have onboard SPD that reports their speed and timing parameters to the system, most systems run the memory controller and memory bus at the speed matching the slowest DIMM/RIMM installed. Most DIMMs contain either SDRAM or DDR SDRAM memory chips. Note A bank is the smallest amount of memory needed to form a single row of memory addressable by the processor. It is the minimum amount of physical memory that is read or written by the processor at one time and usually corresponds to the data bus width of the processor. If a processor has a 64-bit data bus, a bank of memory is also 64 bits wide. If the memory is interleaved or runs dual-channel, a virtual bank is formed that is twice the absolute data bus width of the processor. You can't always replace a module with a higher-capacity unit and expect it to work. Systems might have specific design limitations for the maximum capacity of module they can take. A larger-capacity module works only if the motherboard is designed to accept it in the first place. You should consult your system documentation to determine the correct capacity and speed to use. Registered SDRAM and DDR DIMMsSDRAM and DDR DIMMs are available in unbuffered and registered versions. Most desktop and some entry-level server motherboards are designed to use unbuffered modules, which allow the memory controller signals to pass directly to the memory chips on the module with no interference. This is not only the least expensive design but also the fastest and most efficient. Its only drawback is that the motherboard designer must place limits on how many modules (that is, module sockets) can be installed on the board and possibly also limit how many chips can be on a module. So-called double-sided modules that really have two banks of chips (twice as many as normal) on board might be restricted on some systems in certain combinations. Systems designed to accept extremely large amounts of RAM, including most servers, often require registered modules. A registered module uses an architecture that has register chips on the module that act as an interface between the actual RAM chips and the chipset. The registers temporarily hold data passing to and from the memory chips and enable many more RAM chips to be driven or otherwise placed on the module than the chipset could normally support. This allows for motherboard designs that can support many modules and enables each module to have a larger number of chips. In general, registered modules are required by server or workstation motherboards designed to support more than 1GB or 2GB of RAM. The important thing to note is that you can use only the type of module your motherboard (or chipset) is designed to support. To provide the space needed for the buffer chip, a registered DIMM is often taller than a standard DIMM. Figure 5.10 compares a typical registered DIMM to a typical unbuffered DIMM. Figure 5.10. A typical registered DIMM is taller than a typical unbuffered DIMM to provide room for buffer and parity/ECC chips.
Tip If you are installing registered DIMMs in a slim-line or blade server, clearance between the top of the DIMM and the case might be a problem in some situations. Some vendors sell low-profile registered DIMMs that are about the same height as an unbuffered DIMM. You should use this type of DIMM if your system does not have enough headroom for standard registered DIMMs. Some vendors sell only this type of DIMM for particular systems. SIMM PinoutsTable 5.11 shows the interface connector pinouts for standard 72-pin SIMMs. These SIMMs also include a special presence detect table that shows the configuration of the presence detect pins on various 72-pin SIMMs. The motherboard uses the presence detect pins to detect exactly what size and speed SIMM is installed. Industry-standard 30-pin SIMMs do not have a presence detect feature, but IBM did add this capability to its modified 30-pin configuration. Note that all SIMMs have the same pins on both sides of the module.
Notice that a 72-pin SIMM uses a set of four or five pins to indicate its type to the motherboard. These presence detect pins are either grounded or not connected to indicate the type of SIMM to the motherboard. Presence detect outputs must be tied to the ground through a 0-ohm resistor or jumper on the SIMMto generate a high logic level when the pin is open or a low logic level when the motherboard grounds the pin. This produces signals the memory interface logic can decode. If the motherboard uses presence detect signals, a POST procedure can determine the size and speed of the installed SIMMs and adjust control and addressing signals automatically. This enables autodetection of the memory size and speed. In many ways, the presence detect pin function is similar to the industry-standard DX coding used on modern 35mm film rolls to indicate the ASA (speed) rating of the film to the camera. When you drop the film into the camera, electrical contacts can read the film's speed rating via an industry-standard configuration. Table 5.12 shows the JEDEC industry-standard presence detect configuration listing for the 72-pin SIMM family. As discussed earlier in this chapter, JEDEC is an organization of U.S. semiconductor manufacturers and users that sets semiconductor standards.
Unfortunately, unlike in the photographic film industry, not everybody in the computer industry follows established standards, and presence detect signaling is not a standard throughout the PC industry. Different system manufacturers sometimes use different configurations for what is expected on these four pins. Many Compaq, IBM PS/2 systems, and Hewlett-Packard (HP) systems that used 72-pin SIMMs had nonstandard definitions for these pins. If you service very old servers that use 72-pin memory, don't assume that it can always be interchanged. Table 5.13 shows how IBM defined these pins.
Although servers that use SIMMs are most likely to be outdated, you should keep these differences in mind if you are salvaging parts to keep an older server in service or if you must order memory for a server that uses SIMMs. SIMM pins might be tin- or gold-plated. The plating on the module pins must match that on the socket pins, or corrosion will result. Caution To have the most reliable system, you must install modules with gold-plated contacts into gold-plated sockets and modules with tin-plated contacts into tin-plated sockets only. If you mix gold contacts with tin sockets, or vice versa, you are likely to experience memory failures from six months to one year after initial installation because a type of corrosion know as fretting takes place. This was a major problem with 72-pin SIMM-based systems because some memory and motherboard vendors opted for tin sockets and connectors, while others opted for gold. According to connector manufacturer AMP's "Golden Rules: Guidelines for the Use of Gold on Connector Contacts" (available at www.amp.com/products/technology/aurulrep.pdf) and "The Tin Commandments: Guidelines for the Use of Tin on Connector Contacts" (available at www.amp.com/products/technology/sncomrep.pdf), you should match connector metals. Commandment 7 from the Tin Commandments specifically states "Mating of tin-coated contacts to gold-coated contacts is not recommended." If you are maintaining systems with mixed tin/gold contacts in which fretting has already taken place, use a wet contact cleaner. After cleaning, to improve electrical contacts and help prevent corrosion, you should use a liquid contact enhancer and lubricant called Stabilant 22 from D.W. Electrochemicals when installing SIMMs or DIMMs. Its website (www.stabilant.com/llsting.htm) has detailed application notes on this subject that provide more technical details. DIMM PinoutsTable 5.14 shows the pinout configuration of a 168-pin registered SDRAM DIMM. Note again that the pins on each side of the DIMM are different. All pins should be gold-plated.
A DIMM uses a completely different type of presence detect than a SIMM, called SPD. SPD consists of a small EEPROM, or flash memory, chip on the DIMM that contains specially formatted data indicating the DIMM's features. This serial data can be read via the serial data pins on the DIMM, and it enables the motherboard to autoconfigure to the exact type of DIMM installed. DIMMs for PC-based servers use 3.3V power and might be unbuffered or registered. DIMMs made for Macintosh computers use a 5V buffered design. Keying in the socket and on the DIMM prevents the insertion of 5V DIMMs into a 3.3V slot or vice versa. See Figure 5.11. Figure 5.11. 168-pin DRAM DIMM notch key definitions.DDR DIMM PinoutsTable 5.15 shows the pinout configuration of a 184-pin DDR SDRAM DIMM. Note again that the pins on each side of the DIMM are different. All pins are typically gold-plated.
DDR DIMMs use a single key notch to indicate voltage, as shown in Figure 5.12. Figure 5.12. 184-pin DDR SDRAM DIMM keying.
A 184-pin DDR DIMM uses two notches on each side to enable compatibility with both low- and high-profile latched sockets. Note that the key position is offset with respect to the center of the DIMM to prevent it from being inserted in the socket backward. The key notch is positioned to the left, centered, or to the right of the area between pins 52 and 53. The position indicates the I/O voltage for the DDR DIMM and prevents the installation of the wrong type into a socket that might damage the DIMM. DDR2 DIMM PinoutsTable 5.16 shows the pinout configuration of a 240-pin DDR2 SDRAM DIMM. Pins 1120 are on the front side, and pins 121240 are on the back. All pins should be gold-plated.
A 240-pin DDR2 DIMM uses two notches on each side to enable compatibility with both low- and high-profile latched sockets. The connector key is offset with respect to the center of the DIMM to prevent it from being inserted into the socket backward. The key notch is positioned in the center of the area between pins 64 and 65 on the front (184/185 on the back), and there is no voltage keying because all DDR2 DIMMs run on 1.8V. RIMM PinoutsRIMM modules and sockets are gold-plated and designed for 25 insertion/removal cycles. A 16/18-bit RIMM has 184 pins, split into two groups of 92 pins on opposite ends and sides of the module. Table 5.17 shows the pinout configuration of a RIMM.
A 16/18-bit RIMM is keyed with two notches in the center. This prevents backward insertion and prevents the wrong type (voltage) of RIMM from being used in a system. To allow for changes in RIMM designs, three keying options are possible in the design (see Figure 5.13). The left key (indicated as "DATUM A" in Figure 5.13) is fixed in position, but the center key can be in three different positions, spaced 1mm or 2mm to the right, indicating different types of RIMMs. The current default is Option A, as shown in Figure 5.13 and Table 5.18, which corresponds to 2.5V operation. Figure 5.13. RIMM keying options.
A RIMM incorporates an SPD device, which is essentially a flash ROM onboard. This ROM contains information about the RIMM's size and type, including detailed timing information for the memory controller. The memory controller automatically reads the data from the SPD ROM to configure the system to match the RIMMs installed. Figure 5.14 shows a typical RIMM installation. Note that RIMM sockets not occupied by a module cannot be left empty but must be filled with a continuity module (essentially a RIMM module without memory). This enables the memory bus to remain continuous from the controller through each module (and, therefore, each RDRAM device on the module) until the bus finally terminates on the motherboard. Figure 5.14. Typical RDRAM bus layout showing a RIMM and one continuity module.
Determining a Memory Module's Size and FeaturesMost memory modules are labeled with a sticker indicating the module's type, speed rating, and manufacturer. If you are attempting to determine whether existing memory can be used in a new server, or if you need to replace memory in an existing server, this information can be very useful. If you have memory modules that are not labeled, however, you can still determine the module type, speed, and capacity if the memory chips on the module are clearly labeled. For example, assume that you have a memory module with chips labeled thus:
By using an Internet search engine such as Google and entering the number from one of the memory chips, you can usually find the datasheet for the memory chips. Note for a registered memory module that you want to look up the part number for the memory chips (usually eight or more chips) rather than the buffer chips on the module (one to three, depending on the module design). In this example, the part number turns out to be a Micron memory chip that decodes like this:
The full datasheet for this example is located at http://download.micron.com/pdf/datasheets/dram/ddr/512MBDDRx4x8x16.pdf. From this information, we can determine that the module has the following characteristics:
If the module has 9, instead of 8, memory chips (or 18 instead of 16), the additional chips are used for parity checking and support ECC error correction on servers with this feature. To determine the size of the module in megabytes or gigabytes and to determine whether the module supports ECC, count the memory chips on the module and compare this number to Table 5.19. Note that the size of each memory chip in Mb is the same as the size in MB if the memory chips use an 8-bit design.
The additional chip used by each group of 8 chips provides parity checking, which is used by the ECC function on most server motherboards to correct single-bit errors. A registered module will contain full-sized memory chips plus additional chips for ECC/parity and buffering. These chips are usually smaller in size and located near the center of the module, as shown in Figure 5.10. Note Some modules use 16-bit-wide memory chips. In such cases, only 4 chips are needed for single-bank memory (5 with parity/ECC support) and 8 for double-bank memory (10 with parity/ECC support). These memory chips use a design listed as capacity x 16, such as 256Mbx16. You can also see this information if you look up the manufacturer, the memory type, and the organization in a search engine. For example, this web search:
locates a parts list for Micron's 512MB and 1GB modules at www.micron.com/products/modules/ddrsdram/partlist.aspx?pincount=184-pin&version=Registered&package=VLP%20DIMM. The Comp. Config column lists the chip design for each chip on the module. As you can see, with a little detective work, you can determine the size, speed, and type of a memory module, even if the module isn't marked, as long as the markings on the memory chips themselves are legible. Tip If you are unable to decipher a chip part number, you can use the HWiNFO or SiSoftware Sandra program to identify your memory module, as well as many other facts about your computer, including chipset, processor, empty memory sockets, and much more. You can download shareware versions of HWiNFO from www.hwinfo.com and SiSoftware Sandra from www.sisoftware.net. Memory BanksMemory chips (DIPs, SIMMs, SIPPs, and DIMMs) are organized in banks on motherboards and memory cards. You should know your memory bank layout and position on the motherboard and memory cards. You need to know the bank layout when adding memory to a system. In addition, memory diagnostics report error locations by byte and bit addresses, and you must use these numbers to locate which bank in your system contains the problem. The banks usually correspond to the data bus capacity of the system's microprocessor. Table 5.20 shows the widths of individual banks, based on the type of server processor used and whether the chipset operates in single-channel or dual-channel mode.
DIMMs are ideal for Pentium and higher systems because the 64-bit width of the DIMM exactly matches the 64-bit width of the Pentium processor data bus. Therefore, each DIMM represents an individual bank, and DIMMs can be added or removed, one at a time. Note that for dual-channel operation, matched pairs of DIMMs must be inserted into the appropriate slots on the motherboard. Note that the Itanium 2 runs only in dual-channel mode. The physical orientation and numbering of the memory module sockets used on a motherboard is arbitrary and determined by the board's designers, so documentation covering your system or card is handy, particularly if you want to take advantage of the additional performance available from running recent server designs in dual-channel mode. Memory Module SpeedWhen you replace a failed memory module or install a new module as an upgrade, you typically must install a module of the same type and speed as the others in the system. You can substitute a module with a different speed, but only if the replacement module's speed is equal to or faster than that of the other modules in the system. Some people have had problems when mixing modules of different speeds. With the wide variety of motherboards, chipsets, and memory types, few ironclad rules exist. When in doubt as to which speed module to install in your system, you should consult the motherboard documentation for more information. Substituting faster memory of the same type doesn't result in improved performance if the system still operates the memory at the same speed. Systems that use DIMMs or RIMMs can read the speed and timing features of the module from a special SPD ROM installed on the module and then set chipset (memory controller) timing accordingly. In these systems, you might see an increase in performance by installing faster modules, to the limit of what the chipset will support. To place more emphasis on timing and reliability, Intel and JEDEC standards govern memory types that require certain levels of performance. A number of common symptoms result when the system memory has failed or is simply not fast enough for the system's timing. The usual symptoms are frequent parity check errors or a system that does not operate at all. The POST might report errors, too. If you're unsure of which chips to buy for your system, you should contact the system manufacturer or a reputable chip supplier.
Parity and ECCPart of the nature of memory is that it inevitably fails. Memory failures are usually classified as two basic types: hard fails and soft errors. The best understood memory failures are hard fails, in which the chip is working and then, because of some flaw, physical damage, or other event, becomes damaged and experiences a permanent failure. Fixing this type of failure normally requires replacing some part of the memory hardware, such as the chip, SIMM, or DIMM. Hard error rates are known as HERs. The other, more insidious, type of failure is the soft error, which is a nonpermanent failure that might never recur or could occur only at infrequent intervals. (Soft fails are effectively "fixed" by powering the system off and back on.) Soft error rates are known as SERs. Problems with early memory chips were caused by alpha particles, a very weak form of radiation coming from trace radioactive elements in chip packaging. Although this cause of soft errors was eliminated years ago, cosmic rays have proven to be a major cause of soft errors, particularly as memory chip densities have increased. Although cosmic rays and other radiation events are the biggest cause of soft errors, soft errors can also be caused by the following:
Most of these problems don't cause chips to permanently fail (although bad power or static can damage chips permanently), but they can cause momentary problems with data. Although soft errors are regarded as an unavoidable consequence of desktop and portable computer operation, system lockups are absolutely unacceptable for servers and other mission-critical systems. The best way to deal with this problem is to increase the system's fault tolerance. This means implementing ways of detecting and possibly correcting errors in PC systems. Historically, early PCs and servers used a type of fault tolerance known as parity checking, while more recent servers use fault-tolerance methods that actually correct memory errors. Parity CheckingOne standard that IBM set for the industry is that the memory chips in a bank of nine each handle 1 bit of data: 8 bits per character plus 1 extra bit, called the parity bit. The parity bit enables memory-control circuitry to keep tabs on the other 8 bitsa built-in cross-check for the integrity of each byte in the system. If the circuitry detects an error, the computer stops and displays a message, informing the user of the malfunction. In a GUI operating system, a parity error generally manifests itself as a locked system. Upon reboot, the BIOS should detect the error and display the appropriate error message. SIMMs and DIMMs are available both with and without parity bits. Originally, all PC systems used parity-checked memory to ensure accuracy. Although desktop PCs began to abandon parity-checking in 1994 (saving 10% to 15% on memory costs), servers continue to use parity checking. Parity can't correct system errors, but because parity can detect errors, it can make the user aware of memory errors when they happen. This has two basic benefits:
Let's look at how parity checking works and then examine in more detail the successor to parity checking, called ECC, which can not only detect but correct memory errors on-the-fly. How Parity Checking WorksIBM originally established the odd parity standard for error checking. As the 8 individual bits in a byte are stored in memory, a parity generator/checker, which is either part of the CPU or located in a special chip on the motherboard, evaluates the data bits by adding up the number of 1s in the byte. If an even number of 1s is found, the parity generator/checker creates a 1 and stores it as the ninth bit (the parity bit) in the parity memory chip. That makes the sum for all 9 bits (including the parity bit) an odd number. If the original sum of the 8 data bits is an odd number, the parity bit created would be a 0, keeping the sum for all 9 bits an odd number. The basic rule is that the value of the parity bit is always chosen so that the sum of all 9 bits (8 data bits plus 1 parity bit) is stored as an odd number. If the system used even parity, the example would be the same, except the parity bit would ensure an even sum. It doesn't matter whether even or odd parity is used; the system uses one or the other, and it is completely transparent to the memory chips involved. Remember that the 8 data bits in a byte are numbered 0 1 2 3 4 5 6 7. The following examples might make it easier to understand: Data bit number: 0 1 2 3 4 5 6 7 Parity bit Data bit value: 1 0 1 1 0 0 1 1 0 In this example, because the total number of data bits with a value of 1 is an odd number (5), the parity bit must have a value of 0 to ensure an odd sum for all 9 bits. Here is another example: Data bit number: 0 1 2 3 4 5 6 7 Parity bit Data bit value: 1 1 1 1 0 0 1 1 1 In this example, because the total number of data bits with a value of 1 is an even number (6), the parity bit must have a value of 1 to create an odd sum for all 9 bits. When a system reads memory back from storage, it checks the parity information. If a (9-bit) byte has an even number of bits, that byte must have an error. The system can't tell which bit has changed or whether only a single bit has changed. If 3 bits changed, for example, the byte still flags a parity-check error; if 2 bits changed, however, the bad byte could pass unnoticed. Because multiple bit errors (in a single byte) are rare, this scheme gives a reasonable and inexpensive ongoing indication that memory is good or bad. Parity error messages vary by system, but usually include a reference to parity check or NMI (non-maskable interrupt). Most systems that use parity checking do not halt the CPU when a parity error is detected; instead, they display an error message and offer the choice of rebooting the system or continuing as though nothing happened. Although you don't need to reboot a system after a parity error, it makes sense to do so because the contents of memory might be corrupted. Obviously, parity checking is not sufficient fault tolerance for servers. ECCECC goes a big step beyond simple parity error detection. Instead of just detecting an error, ECC allows a single-bit error to be corrected, which means the system can continue without interruption and without corrupting data. Older implementations of ECC can only detect, not correct, double-bit errors. Because studies have indicated that approximately 98% of memory errors are the single-bit variety, the most commonly used type of ECC is one in which the attendant memory controller detects and corrects single-bit errors in an accessed data word (double-bit errors can be detected but not corrected). This type of ECC is known as single-bit error-correctiondouble-bit error detection (SEC-DED) and requires an additional 7 check bits over 32 bits in a 4-byte system and an additional 8 check bits over 64 bits in an 8-byte system. Consequently, you can use parity-checked (36-bit SIMM or 72-bit DIMM) memory in any system that supports ECC memory (as most recent servers do), and the system will use the parity bits for ECC mode. RIMMs are installed in singles or pairs, depending on the chipset and motherboard. They must be 18-bit or 36-bit versions if parity/ECC is desired. ECC entails the memory controller calculating the check bits on a memory-write operation, performing a comparison between the read and calculated check bits on a read operation, and, if necessary, correcting bad bits. ECC has a slight effect on memory write performance. This is because the operation must be timed to wait for the calculation of check bits and, when the system waits for corrected data, reads. On a partial-word write, the entire word must first be read, the affected byte(s) rewritten, and then new check bits calculated. This turns partial-word write operations into slower read-modify writes. Fortunately, this performance hit is very small, on the order of a few percent at maximum, so the tradeoff for increased reliability is good. An ECC-based system is a good choice for servers, workstations, or mission-critical applications in which the cost of a potential memory error outweighs the additional memory and system cost to correct it, along with ensuring that it does not detract from system reliability. If you value your data and use your system for important (to you) tasks, you want ECC memory, assuming, of course, that your system supports it. You should check the specifications for a new server or server motherboard to ensure that it supports ECC. Advanced Error Correction TechnologiesAlthough single-bit ECC is useful for entry-level servers with memory below 4GB, today's high-capacity servers (some of which have memory sizes up to 64GB) and higher memory module capacities (1GB and larger) need more effective error correction technologies. Many recent servers support Advanced ECC (also known as ChipKill), which differs from standard ECC in its ability to correct up to 4-bit errors that take place within the same memory module. Early versions of Advanced ECC/ChipKill required special Advanced ECC memory modules, but current implementations support standard parity/ECC memory modules. Another method used on high-end servers is hot-plug RAID memory. This technology uses five memory controllers to create a memory array, similar in concept to a RAID 5 disk array. Four of the controllers store memory data in a striped fashion, while the fifth controller stores parity information. If memory connected to one of the memory controllers used for data fails, it can be removed and replaced without taking down the server. The contents of the original memory are rebuilt from the striped data and parity information in the other modules. Memory scrubbing is another technique many recent servers use. Memory scrubbing tests memory during idle periods for errors and, if possible, corrects them. If correction is not possible, the system informs the operator that the memory module that has failed. If you are building or purchasing a midrange or high-end server, you should find out which types of advanced memory error correction technologies are used and choose hardware that provides the best combination of performance and reliability for your needs. |