|
Quintero - Deploying Linux on IBM E-Server Pseries Clusters Authors: N Published year: 2003 Pages: 37-38/108 |
| < Day Day Up > |
Chapter 4. Linux for pSeries RAS and problem determinationProblem determination is definitely a key area when it comes to system administration. System administrators tend to spend hours debugging and trying to find out what is wrong with the system. In this chapter, we discuss how to do centralized system logging using the syslog and the evlog . We also discuss Linux rescue methods , and provide tips that allow you to fix boot partition errors, corrupt file systems, and configuration problems. We also discuss the ppc64 Run Time Abstraction Service (RTAS). RTAS allows you to query and extract information from the pSeries firmware. We share with you the tools developed by the IBM Linux Technology Center to allow you to work more closely with AIX, commands which are used in AIX to extract hardware information, and patch system firmware. Finally, we discuss common tuning methods in Linux, and explain which tools you can use to tune Linux to allow it to do more for you. |
| < Day Day Up > |
| < Day Day Up > |
4.1 Linux on pSeries RASThe IBM pSeries hardware has been known for its RAS capabilities due to IBM's knowledge and experience in developing mainframes and mission-critical servers. Much of the RAS design has been developed to analyze failures within the Central Electronic Complex (CEC) to either eliminate the errors or to contain and reduce them to avoid bringing the entire server down. Some of the RAS features that you see available for Linux on pSeries are:
Many of the design efforts put into the development of the pSeries server RAS have been designed to be operating system-independent. This basically means that you do not need the AIX operating systems to exploit most of the RAS capability inside the hardware. In the booting up process, the Built-In Self Test (BIST) and Power-on Self Test (POST) are designed to check the processors, caches, memory prior to loading the operating systems. If a critical error is detected, the system tries to deallocate the component and continue the boot-up process. In this way, your system is not at risk of running with a faulty component. Detected errors are logged into the system non-volatile RAM (NVRAM). Refer to 4.1.2, "IBM diagnostics tools" on page 168 for more information on the nvram. Surveillance of the system operation is provided by the service processor. The service processor basically records and automatically checks for heartbeats from the operating systems. It can be configured to automatically reboot the system if the service processor does not detect any heartbeat within a default time interval. If the system is unable to come up successfully, the service processor logs the error and leaves the system powered on. The service processor is also designed to report errors to the Service Focal Point. In environments where the system is attached to a Hardware Management Console (HMC), the errors are logged and reported to the Service Focal Point application running in the HMC. In additional, the IBM diagnostic tools for Linux on pSeries records and analyzes pSeries-specific messages, and logs them into the Linux system log facility. Generic software and hardware errors are also recorded and analyzed by the Linux error log analysis (LELA). Refer to 4.1.2, "IBM diagnostics tools" on page 168 for the description of the tools packages inside the IBM diagnostics tools. 4.1.1 RunTime abstraction service in PPC64Specific to the PowerPC kernel, /proc/rtas/* gives you some interface to interact with the service processor directly. The RunTime Abstraction Service or RTAS in Linux is enabled by default by the SuSE Linux Enterprise Server (SLES) 8, or you can recompile the kernel by hand with the CONFIG_PPC_RTAS option. In Figure 4-1, you can see how the RTAS in Linux interacts with the pSeries firmware. Figure 4-1. ppc64 Linux RunTime abstraction service
The open source community is continuously improving RTAS in the PowerPC kernel; here are some of the RTAS service in the /proc file systems that we can use today:
4.1.2 IBM diagnostics toolsIBM has recently released the Linux for pSeries Service aids for hardware diagnostics. The service aids allow system administrators to extract valuable information from the robust pSeries service processor for problem determination and servicing . Many of the commands packaged inside the service aids are very similar to the commands that you may find in AIX. The service aid can be downloaded from the Web site: http://techsupport.services.ibm.com/server/Linux_on_pSeries The IBM diagnostics tools require POWER4-based systems and a supported release of Linux on pSeries (SuSE SLES8 SP3 or Red Hat Advance Server 3). To install the packages, run: # rpm -ivh ppc64-utils-0.4-77.rpm # rpm -ivh lsvpd-0.9.2-1.ppc.rpm # rpm -ivh diagela-1.1.0.1-2.ppc.rpm # rpm -ivh IBMinvscount-2.1-1.ppc.rpm # rpm -ivh devices.chrp.base.ServiceRM-2.1.0.0-2.ppc.rpm
Important Be aware that the last RPM in this list (devices.chrp.base.ServiceRM) is dependent on the installation of the five RMC RPMs (src, rsct.core.utils, rsct.core, csm. core , and csm.client). These RPMs are also downloable from: http://techsupport.services.ibm.com/server/Linux_on_pSeries You need to initialize the lsvpd if you are running it for the first time: # /etc/init.d/lsvpd start Make sure that the lsvpd service is started. This basically creates a symbolic link inside /etc/rc.d/rc3.d and /etc/rc.d/rc5.d: # chkconfig lsvpd 35 # (this will start it at runlevel 3 & 5) After installing the packages, you find the following commands available to extract information from your pSeries server using Linux. These commands are installed into the /usr/sbin/ibmras or /usr/sbin/ directory. Some of these commands require root access.
Refer to the IBM Redbook Effective System Management Using the IBM Hardware Management Console for pSeries , SG24-7038, for information on how to set service authority. Corresponding to the above screen, you will notice that the Operator Panel of the LPAR will show as "Flashing". Figure 4-4 on page 172 shows the Operator Panel. Figure 4-4. Operator in HMC showing Firmware Flashing in progress
Whenever the rtas_errd background daemon scans and detects any error reported by the system firmware, it basically activates the analysis program to deduce what kind of problem it is facing . After analysis, it reports it back to the system logs and the respective mechanism that you have configured for it to use. Example 4-5 on page 174 shows the analysis of power failure output from the diagela daemon. Example 4-5. /var/log/messages showing analysis of power failure of the systemDiagela for Linux for pSeries Oct 28 14:29:41 lpar1 diagela: 10/28/2003 14:29:40 Oct 28 14:29:41 lpar1 diagela: Automatic Error Log Analysis reports the following: Oct 28 14:29:41 lpar1 diagela: Oct 28 14:29:41 lpar1 diagela: 651204 ANALYZING SYSTEM ERROR LOG Oct 28 14:29:41 lpar1 diagela: A loss of redundancy on input power was detected. Oct 28 14:29:41 lpar1 diagela: Oct 28 14:29:41 lpar1 diagela: Check for the following: Oct 28 14:29:41 lpar1 diagela: 1. Loose or disconnected power source connections. Oct 28 14:29:41 lpar1 diagela: 2. Loss of the power source. Oct 28 14:29:41 lpar1 diagela: 3. For multiple enclosure systems, loose or Oct 28 14:29:41 lpar1 diagela: disconnected power and/or signal connections Oct 28 14:29:41 lpar1 diagela: between enclosures. Oct 28 14:29:41 lpar1 diagela: Oct 28 14:29:41 lpar1 diagela: Supporting data: Oct 28 14:29:41 lpar1 diagela: Ref. Code: 10111520 Oct 28 14:29:41 lpar1 diagela: Oct 28 14:29:41 lpar1 diagela: Analysis of /var/log/platform sequence number: 3 |
| < Day Day Up > |
|
Quintero - Deploying Linux on IBM E-Server Pseries Clusters Authors: N Published year: 2003 Pages: 37-38/108 |