Troubleshooting Memory Problems


Memory problems can be difficult to troubleshoot. For one thing, computer memory is still mysterious to people because it is a kind of "virtual" thing that can be hard to grasp. The other difficulty is that memory problems can be intermittent and often look like problems with other areas of a system, even software. The following sections show some simple troubleshooting steps you can take if you suspect that you are having a memory problem.

To troubleshoot memory, you first need some memory-diagnostics testing programs. You already have several of them. Every motherboard BIOS has a memory-diagnostics program in the POST that runs when you first turn on the system. You probably also have a memory-diagnostics program on a utility disk that came with your system. Many commercial diagnostics programs are available on the market, and almost all of them include memory tests.

When the POST runs, it not only tests memory but also counts it. The count is compared to the count taken the last time BIOS Setup was run; if these counts are different, an error message is issued. As the POST runs, it writes a pattern of data to all the memory locations in the system and reads that pattern back to verify that the memory works. If any failure is detected, you see or hear a message. Audio messages (beeping) are used for critical, or "fatal," errors that occur in areas that are important for the system's operation. If the system can access enough memory to at least allow video to function, you see error messages instead of hearing beep codes.

Note

Visit the Upgrading and Repairing Servers web page (accessible from www.upgradingandrepairingpcs.com) to download detailed listings of the BIOS beep and other error codes, which are specific to the type of BIOS you have.


If your system makes it through the POST with no memory error indications, there might not be a hardware memory problem, or the POST might not be able to detect the problem. Intermittent memory errors are often not detected during the POST, and other subtle hardware defects can be difficult for the POST to catch. The POST is designed to run quickly, so its testing is not nearly as thorough as it could be. On many recent systems, memory testing is disabled by default to permit faster startup. That is why you often have to boot from a DOS or diagnostic disk and run a true hardware diagnostic to do more extensive memory testing. These types of tests can be run continuously and be left running for days, if necessary, to hunt down an elusive intermittent defect.

Still, even these programs do only pass/fail type testing; that is, all they can do is write patterns to memory and read them back. They can't determine how close the memory is to failingonly whether it worked. For the highest level of testing, the best thing to have is a dedicated memory test machine, usually called a memory module tester. This type of device enables you to insert a module and test it thoroughly at a variety of speeds, voltages, and timings, so you can determine for certain whether the memory is good or bad. Versions of these testers are available to handle all types of memory, from older SIMMs to the latest DDR and DDR2 DIMMs or RIMMs. Defective modules can work in some systems (slower ones) but not others, so the same memory test program may fail the module in one machine but pass it in another. In a module tester, such a problem is always identified as bad, right down to the individual bit, and the tester even tells the actual speed of the device, not just its rating. Companies that offer memory module testers include Tanisys (www.tanisys.com), CST (www.simmtester.com), and Innoventions (www.memorytest.com). They can be expensive, but for a professional in the PC repair business, using one of these memory module testers is the only way to go.

After your operating system is running, memory errors can still occur, and they are typically accompanied by error messages. These are the most common error messages:

  • Parity errors This type of error message indicates that the parity-checking circuitry on the motherboard has detected a change in memory since the data was originally stored.

    See "How Parity Checking Works," p. 390.


  • General or global protection faults This is a general-purpose error message, indicating that a program has been corrupted in memory, usually resulting in immediate termination of the application. This can also be caused by buggy or faulty programs.

  • Fatal exception errors Error codes are returned by a program when an illegal instruction has been encountered, invalid data or code has been accessed, or the privilege level of an operation is invalid.

  • Divide-by-zero error This is a general-purpose error message, indicating that division by 0 was attempted or the result of an operation does not fit in the destination register.

If you are encountering any of these errors, they could be caused by defective or improperly configured memory, but they could also be caused by software bugs (especially drivers), bad power supplies, static discharges, close proximity to radio transmitters, timing problems, and more.

If you suspect that these types of problems are caused by memory, there are ways to test the memory to determine whether that is the problem (as discussed earlier in this section). Most of this testing involves running one or more memory test programs.

Testing Memory

Many people make a critical mistake when they run memory testing software. For example, many people run memory tests with the system caches enabled. This effectively invalidates memory testing because most systems have what is called a write-back cache. This means that data written to the main memory is first written to the cache. Because a memory test program first writes data and then immediately reads it back, the data is read back from the cache, not from the main memory. This makes the memory test program run very quickly, but all it tests is the cache. The bottom line is that if you test memory with the cache enabled, you aren't really writing to the SIMM/DIMMs, but only to the cache. Before you run any memory test programs, you need to be sure your cache is disabled. The system will run very slowly when you do this, and the memory test will take much longer to complete, but you will be testing your actual RAM, not the cache.

The following steps enable you to effectively test and troubleshoot your system RAM. Figure 5.16 provides a boiled-down procedure to help you step through the process quickly.

1.

Power up the system and observe the POST. If the POST completes with no errors, basic memory functionality has been tested. If errors are encountered, go to the defect isolation procedures.

2.

Restart the system and enter your BIOS (or CMOS) Setup. In most systems, you do this by pressing the F2, Delete, or Esc key during the POST but before the boot process begins (see the onscreen prompts or system/motherboard documentation for details). When you're in BIOS Setup, verify that the memory count is equal to the count that has been installed. If the count does not match what has been installed, go to the defect isolation procedures.

3.

Find the BIOS Setup options for cache and set all cache options to Disabled. Save the settings and reboot to a DOS-formatted system disk (floppy) that contains the diagnostics program of your choice. Note that some diagnostic disks are self-booting (that is, they contain their own operating systems). If your system came with a diagnostics disk, you can use that, or you can use one of the many commercial PC diagnostics programs on the market, such as PC-Technician by Windsor Technologies (which comes in self-booting form) or others.

4.

Follow the instructions that came with your diagnostic program to have it test the system base and extended memory. Most programs have a mode that enables them to loop the testthat is, to run it continuouslywhich is great for finding intermittent problems. If the program encounters a memory error, proceed to the defect isolation procedures.

5.

If no errors are encountered in the POST or in the more comprehensive memory diagnostic, your memory has tested okay in hardware. Be sure at this point to reboot the system, enter the BIOS Setup, and reenable the cache. The system will run very slowly until the cache is turned back on.

6.

If you are having memory problems, yet the memory still tests okay, you might have a problem that is undetectable by simple pass/fail testing, or the problem could be caused by software or one of many other defects or problems in your system. You might want to bring the memory to a SIMM/DIMM tester for a more accurate analysis. Most PC repair shops have such testers. You should also check the software (especially drivers, which might need to be updated), power supply, and system environment for problems such as static and radio transmitters.

Figure 5.16. Testing and troubleshooting memory.


First, let's cover the memory testing and troubleshooting procedures:

Memory Defect Isolation Procedures

If you have identified an actual memory problem that is being reported by the POST or disk-based memory diagnostics, you can use the following steps and Figure 5.17 to identify or isolate which SIMM or DIMM in the system is causing the problem:

1.

Restart the system and enter the BIOS Setup. Look for a menu called something like Advanced or Chipset Setup for memory timing parameters. Select BIOS or Setup defaults, which are usually the slowest settings. Save the settings, reboot, and retest, using the testing and troubleshooting procedures listed in "Testing Memory," earlier in this chapter. If the problem has been solved, improper BIOS settings were the problem. If the problem remains, you likely have defective memory, so continue to the next step.

2.

Open the system so you have physical access to the SIMMs/DIMMs/RIMMs on the motherboard. Identify the bank arrangement in the system. For example, Pentium systems use 64-bit banks, which means two SIMMs or one DIMM per bank. Systems that support dual-channel memory use matched pairs of modules. Pentium 4 systems require two 184-pin RIMMs at a time (in separate channels) or a single 232-pin RIMM if the system uses that type. Using the manual or the legend silk-screened on the motherboard, identify which modules correspond to which banks.

3.

Remove all the memory except the first bank and retest, using the troubleshooting and testing procedures listed in "Testing Memory," earlier in this chapter. If the problem remains with all but the first bank removed, the problem has been isolated to the first bank, which must be replaced. With a dual-channel system, you can swap down to a single module for troubleshooting. However, afterward you should restore the matched pair to the system.

4.

Replace the memory in the first bank, preferably with known good spare modules (or others that you have removed and retest). If the problem still remains after testing all the memory banks (and finding them all to be working properly), it is likely that the motherboard itself is bad (with the problem probably in one of the memory sockets). Replace the motherboard and retest.

5.

At this point, the first (or previous) bank has tested good, so the problem must be in the remaining modules that have been temporarily removed. Install the next bank of memory and retest. If the problem resurfaces now, the memory in that bank is defective. Continue testing each bank until you find the defective module.

6.

Repeat step 5 until all remaining banks of memory are installed and have been tested. If the problem does not resurface after you remove and reinstall all the memory, the problem was likely intermittent or caused by poor conduction on the memory contacts. Often simply removing and replacing memory can resolve problems because of the self-cleaning action between the module and the socket during removal and reinstallation.

Figure 5.17. Follow these steps if you are still encountering memory errors after completing the steps in Figure 5.16.





Upgrading and Repairing Servers
Upgrading and Repairing Servers
ISBN: 078972815X
EAN: 2147483647
Year: 2006
Pages: 240

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net