21.1 The mainframe architecture

The mainframe architecture is a long-lived architecture that is designed to excel at business computing. This section is for readers looking for a quick understanding of what an explicit architecture can do for them, as well as an understanding of the key architectural components in z/Architecture that contribute to the mainframe's robustness. This section discusses what architecture is, the mainframe instruction set, mainframe recoverability, storage, and the interrupt structure. The full mainframe architecture is described in z/Architecture Principles of Operation, SA22-7832.

The mainframe is based on the von Neumann computer model illustrated in Figure 21-1. While most computers are based on this model, the mainframe is different from most other implementations of it. The mainframe has always allowed parallelism in all three components, which allows the mainframe to perform large amounts of I/O without adversely affecting the other components.

Figure 21-1. The von Neumann model of computing

graphics/21fig01.gif

We talked about the fundamental resources of a computer CPU, memory, and I/O in Chapter 2, "Introducing the Mainframe." In this section, we discuss the following architectural topics:

Instruction set. The mainframe is a complex instruction set computer (CISC).
Interrupts. The mainframe architecture is interrupt driven.
The program status word (PSW). The PSW contains information required for the execution of the currently active program.

The mainframe architecture is still evolving from modest beginnings with the S/360 systems to parallel processing and clusters of today. Table 21-1 shows some important milestones and their impact on storage and virtual storage development. The current zSeries architecture has evolved from S/360 in an upwardly compatible way so that your applications continue to run, without modifications, when you upgrade your mainframe.

Table 21-1. Summary of mainframe architecture development
Time	System announced, features
1964	System/360
1970	System/370 EC mode
1972	Virtual memory, multiprocessing, and multiple address spaces
1983	System/370 Extended Architecture, extends addressing from 24 to 31 bits, real and virtual
1988	Enterprise Systems Architecture/370, access registers, data spaces
1990	Enterprise System Architecture/390
1994	S/390 Parallel Enterprise Server, CMOS technology, and sysplex clusters with Coupling Facility as cluster-aware shared read-write storage
2000	zSeries, z/Architecture, 64-bit addressing, real and virtual
2003	Architecture extension to allow attachment of SCSI devices to a mainframe

21.1.1 Architecture versus design

Architecture, in the context of this book, is an explicit, formal definition that states the behavior and results from a set of valid and, in some cases, invalid, inputs. An architecture will usually refrain from specifying the technology and design for any implementation.

The zSeries architecture specifies the structure of a computing machine (CPU, memory, and I/O) of the class called symmetric multiprocessors. This particular architecture specifies attributes of the components.

The CPU definition specifies: The number of CPUs allowed, the interaction of the CPUs, the interrupt structure, and the interaction of CPUs with memory and I/O.

The memory definition specifies: The atomicity of stores and loads of storage, real and virtual storage, and some of the behavior for dynamic address translation.

The I/O definition specifies: The behavior of the I/O subsystem, the channel programming instructions for communicating with external devices, the paths to the devices, and the response to normal and abnormal conditions that might surface from either a CPU or a device with respect to I/O operations.

In contrast, the design is a set of engineering specifications from which the actual computing machine will be built. The design of a zSeries machine that implements the zSeries architecture specifies how the symmetric multiprocessing capability is implemented so as to preserve the behavior defined in the architecture.

For example, the architecture specifies that only one instruction is executed at a time. The design of the zSeries machine allows for partial execution of multiple instructions during a machine cycle. The design preserves the "one and only one" restriction by ensuring that multiple partial executions are not detectable. If the machine were put into a "stop state," there would be no evidence of multiple partial executions. For example, if a page fault was needed during a partial execution, that condition would not be surfaced until the architecturally correct time.

Similarly, the architecture specifies only the data input and output behavior for an instruction. The architecture assumes that the physical implementation itself does not fail and fully implements what is specified. According to the architecture, if an instruction fails to produce the correct result, a specific type of machine check interrupt should occur. The zSeries design has special circuitry that assures that either the results are correct or a machine failure is raised for whatever might go wrong.

The architecture defines a single memory structure where all load and store instructions happen atomically. In the zSeries design, there are multiple layers of cache and significant logic circuitry that assure that, even though each processor has its own unique cache, the atomicity of the load and store still behaves architecturally correctly while gaining a significant performance improvement over a design that was built on a single level of store.

21.1.2 The mainframe instruction set

The mainframe is a complex instruction set computer (CISC). CISC instructions aim to minimize the number of instructions that need to be processed. These instructions tend to be relatively complex. Intel-based architectures also use a CISC instruction set.

In contrast, reduced instruction set computers (RISC) aim to reduce the instruction set by getting rid of all but the most necessary instructions and replacing more complex instructions with groups of smaller ones. CISC and RISC represent two different design strategies aimed at reducing processing time.

Most RISC computers excel at CPU-intense computing, for example, simulating microseconds of nuclear explosions or weather forecasting. The mainframe CISC computer excels at commercial computing involving heavy I/O traffic.

21.1.3 Recovery from hardware error

Some knowledge of hardware errors is useful for understanding how the mainframe accomplishes recovery. Hardware failures can be transient, permanent, or intermittent:

A transient error (also called a soft error) occurs randomly when environmental conditions, noise, or cosmic particles cause an incorrect result but the circuit itself is functioning correctly. Errors in CMOS technology are predominantly environmental and, therefore, transient. A transient error can be recovered by retrying the operation. Of interest is the mainframe's ability to (1) detect the error and (2) recover dynamically and transparently from the error.
A permanent error (also called a hard error) is an error in a hardware unit, such as a circuit: The circuit no longer gives the correct output, given the same input. A permanent error requires repair or replacement of the failing unit. Again, the mainframe has the ability to (1) detect the error and (2) replace the failed unit without application downtime.
Intermittent errors sometimes produce an incorrect result, sometimes not. They can be handled as transient errors if recoverable. They become permanent errors if the error recurs beyond a threshold.

In a zSeries machine, each central processor contains dual instruction processing units. The units operate simultaneously and independently of each other. The results of processing an instruction are compared dynamically. If the results do not match, the instruction is retried. This retry capability allows the zSeries to detect and recover from transient errors. zSeries computers achieve the retry with no measurable loss of performance. The recovery is controlled by hardware and is independent of any operating system.

An error that cannot be successfully retried or that exceeds a certain threshold is considered a permanent error that requires repair rather than recovery. The mainframe achieves that repair dynamically with dynamic CPU chip sparing.

Some microprocessors are designated as "spares." If a running CPU chip fails and an instruction retry is unsuccessful, the spare CPU chip begins executing at precisely the instruction where the other CPU chip failed. Activation by the spare is done completely by hardware, with no operating system awareness. The system is restored to full capacity at machine speed as opposed to hours of downtime for swapping in a new card or board. Therefore, Linux, as well as operating systems such as z/OS, benefits.

In addition, the mainframe provides memory chip sparing: An error threshold is maintained for each chip and, when exceeded, a new chip is nondisruptively substituted by the memory subsystem hardware.

Similarly, the mainframe provides cache-line sparing. When an error threshold is exceeded, the defective cache line can be nondisruptively removed and later substituted by hardware. On the mainframe, all data in the cache hierarchy are protected by data redundancy, provided by means of write-through cache design and ECC. Errors are both detected and corrected.

Microcode patches can also be applied nondisruptively.

21.1.4 Storage

Imagine storage as being a long horizontal string of bits. For most operations, accesses to storage proceed in a left-to-right sequence. The string of bits is subdivided into units of eight bits, or bytes. Each byte location is identified by its byte address. Only bytes can be addressed; in other words, the storage is byte-addressable.

There are three basic types of addressing:

Absolute addressing: the use of actual byte-positions in the main storage.
Real addressing: like absolute addressing, except that a real address must be prefixed by a bit-string to form the absolute address. In a multiprocessor environment, all CPUs share storage. Yet every CPU must have its own unique prefix area from 0 to 8 KB. (Actually, from 0 to 8 KB 1.) The zSeries architecture solves the problem by giving each CPU a unique prefix. This prevents clashes between CPUs for referencing their own private low core pages. Yet they can address the same storage locations, if necessary.
Virtual addressing: a virtual address is translated into a real address. Virtual addresses may indicate bytes not currently in main storage, but which will be brought in from auxiliary storage by paging. A virtual storage address can exceed the maximum address of installed absolute storage.

21.1.5 Interrupts

In order to process the workload and use the processor resources efficiently, a technique is needed to facilitate switching control from one task to another. While one task waits, another can execute. This switching is driven by interrupts.

What are interrupts?

The mainframe architecture is interrupt-driven. An interrupt is an event that alters the sequence in which the processor executes instructions. An interrupt can be solicited (specifically requested by the program) or unsolicited (caused by an event that is not related to the executing task).

The interrupt process consists of the hardware recognizing that a special condition has occurred (see list below), storing interrupt information in a well-defined location, and causing the instruction flow to transfer control to the specific interrupt handler which then proceeds to execute. When the interrupt handler finishes, it typically exits back to the system dispatcher. The zSeries architecture's interrupt behavior is quite similar to many other operating systems' interrupt behavior.

Types of interrupt

Mainframe interrupts are grouped into six classes, listed here in order of priority:

Supervisor call Caused by the SUPERVISOR CALL instruction.
Program Enables the CPU to respond to and report exceptions and events that occur during the execution of programs.
Machine check Enables the CPU to respond to malfunctioning equipment.
External Enables the CPU to respond to various signals from inside or outside the configuration.
Input/Output Enables the CPU to respond to errors in I/O devices and the channel subsystem.
Restart Provides a means for the operator or another CPU to invoke the execution of a specified program.

Each interrupt type has an old program status word (PSW) and a new PSW associated with it. The six classes are distinguished by the storage locations at which the old PSW is stored and from which the new PSW is fetched. During an interrupt, the CPU stores the current PSW as an old PSW and fetches a new one. Along with the old PSW, information that identifies the cause of the interrupt is stored. The old PSW contains the address of the instruction that would have been executed next, had the interrupt not occurred, thus permitting resumption of the interrupted program.

For each processor, the old and new PSWs are stored in the real storage area called the Prefix Save Area (PSA), as shown in Figure 21-2 for a single processor.

Figure 21-2. Single processor low-storage PSA

graphics/21fig02.gif

Each type of interrupt has a first level interrupt handler (FLIH) in the operating system kernel or nucleus. The new PSWs that are loaded have instruction addresses that point to the corresponding FLIH. A FLIH itself cannot take an interrupt; it runs disabled. The FLIH saves the register state from the interrupted process. It must not be interrupted until at least the raw state information is safely stored.

In this way, the processor can be enabled or disabled for external, I/O, machine check, and certain program interrupts. If an interrupt occurs while the processor is disabled for any of the first three types, the hardware will leave the interrupt pending.

In a tightly coupled multiprocessor, each processor has a unique PSA assigned to it, as shown in Figure 21-3 for a 64-bit system.

Figure 21-3. Multiprocessor low-storage PSA

graphics/21fig03.gif