Chapter 10: Availability and Scalability | Oracle Real Application Clusters

< Day Day Up >

10.1 Introduction

In the previous chapter we browsed through definitions of the various data dictionary views and parameters pertaining to RAC. All dynamic views that are available in a single stand-alone configuration of Oracle are available from a global level. What this means is that these global views will provide visibility to all instances participating in the clustered configuration. DBAs could view the statistics of any instance from any other instance in the cluster. For example, to look at the users connected to instance RAC2 from instance RAC1, the global view GV$SESSION could be queried where the INST_ID has a value of 2, or by querying the entire view information from all views.

Oracle has introduced and maintains several initialization parameters that are specific to a RAC configuration. The previous chapter also discussed these parameter definitions. The usage of many of the parameters is covered later in the appropriate chapters of this book.

Availability of a football player of 94% indicates that the player has missed one game of a 16-game season. However, 99.97% availability of a computer system indicates a downtime of 2.5 hours in a year. Today's business requirements are to meet 99.99% or 99.999% availability, which indicates 0.5 hours or no downtime. To meet these high-availability requirements with no downtime in a year, the factor critical for success is for systems to provide automatic failover when one participating system fails, with minimal interruption to the user. If this does not occur, when a system or a participating node fails either in a clustered or non-clustered configuration, then a considerable amount of time will be used in migrating the user from the failed node to another (re-establishing the database connection, re-executing the query, user having to browse through the screen to his/her last view, etc.). When this failure happens more frequently there would be user frustrations and subsequently potential loss of business.

Availability of enterprise systems is not confined to the database tier, because it is not only the database tier that could fail. When an availability requirement of 99.999% is specified, it applies to the entire system. This includes the database tier, the application tier, the firewalls, interconnects, networks, LAN, storage subsystems, and the controllers to these storage subsystems, because every tier in the enterprise system is prone to failure. To meet the availability factor across the enterprise, it is required that every tier should consider meeting the same SLA requirements of 99.999%. As discussed in Chapter 2 (Hardware Concepts), the availability of the enterprise system could be achieved by providing a redundant architecture. This means that every subsystem or component should have redundant hardware, so that if one piece of hardware fails, the other redundant piece is available to provide the required functionality and continued business.

Figure 10.1 represents a two-node total redundancy system with failover options at every level of the equipment. This type of totally redundant configuration supports and provides a higher availability and includes the following functions:

When a node leaves the cluster, the remaining systems go through a cluster state transition and during this phase, automatically adjust to the new cluster membership.
When one path to the storage device fails, the device can be accessed through an alternate path.
If one communication path fails, systems that are part of the clustered configuration can still exchange information on the remaining communication paths.
Using volume shadowing and a mirrored disk option provides availability at the disk subsystem/storage level. This means that when one disk fails, the information is still available on the other mirrored disk. For example, in a RAID 1 configuration, which is mirror only, a copy of the data is simultaneously saved on disks; that is, to the original disk and to the mirrored disk. When one of the disks is unusable, the system continues operation by accessing the other disk as if no loss was encountered. Because of this, users are not affected.

click to expand
Figure 10.1: Various levels of redundancy for high-availability systems.

The high-availability scenario also applies to application systems accessing the clustered configuration. In the event of a server failure (due to hardware, application, or operator fault) another node, configured as the secondary node (in an active/passive configuration), automatically assumes the responsibilities of the primary node. At this point, users are switched to that node without operator intervention. This switch minimizes downtime.

A cluster is designed to avoid a single point of failure. Applications are distributed over more than one computer (in an active/active configura tion), achieving a degree of parallelism and failure recovery and providing high availability. In this type of configuration, when one node in the cluster fails, other nodes automatically assume the responsibility of the primary node and the users are distributed across other available nodes.

While availability of the systems is very critical to the enterprise, equally critical is the potential that the application systems and the layered hardware architecture and technology selected are able to accommodate the increased usage. This implies that the hardware system should be able to scale without much difficulty. Very similar to availability requirements are the scalability requirements, which are seldom considered critical. Businesses that start with a low capital investment, potentially start with hardware models that are small and at some point reach a saturation level where the specific hardware has reached its limitations and is unable to handle the increased users on the system. At this stage, either additional hardware needs to be purchased or the capacity of the current systems increased.

Increasing the capacity will take the enterprise a certain distance, allowing the business to continue and accept the increased growth of users. Now this increased growth of users obtained with the increased capacity of the existing hardware i.e. by adding additional resources such as CPU, storage, memory, etc., at some point reaches a stage when any further increase is not possible because every model or hardware platform is designed for a certain maximum resource capacity, after which the hardware must be replaced with higher specification models (vertical scaling). One of the biggest drawbacks of the vertical scalability model is that this model does not support availability. When this bigger hardware fails, the system is down and the application is not usable.

Figure 10.2 represents the vertical and horizontal scalability of the systems. When the computers are vertically scaled, the system configuration increases due to the computer now having additional resources compared to the standard configuration. Under the horizontal scalability, the size of the computer remains the same but another computer is added, almost doubling the resources and aiding in distribution of load between both computers. Horizontal scalability also provides higher availability because users can be migrated to the other available computer if one node fails.

click to expand
Figure 10.2: Vertical and horizontal scalability representation.

Another option would be that additional hardware (nodes) could be added to the existing hardware. This provides for distribution of work (scalability) and when any node or hardware fails, the other systems in the configuration are available to carry on with the business (availability).

Availability and scalability of the enterprise systems is dependent on several factors including the hardware, the application, the database and other media. On the database tier, availability and scalability in an Oracle environment could be achieved from features such as Oracle Data Guard (ODG), Oracle Advanced Replication (OAR), or RAC.

ODG and OAR are high-availability options however, not within the true concept of immediate availability, but rather availability due to a disaster. Availability provided by RAC is more real-time availability, basically because of its clustered architecture and the fact that multiple nodes or instances share the common single copy of the physical database, providing data consistency and distribution to users accessing the database from multiple nodes or instances.

In this chapter we will discuss the various high-availability and scalability features provided by RAC. RAC provides a clustered database solution where two or more nodes share a common storage subsystem. The nodes are connected to each other via a high-speed cluster interconnect option, normally a Gigabit Ethernet. Each node contains an instance and is configured to be in an active state. The database that is configured on this common shared storage subsystem is accessible to the users from any instance.

Under RAC, when one of the participating systems fails, the users are migrated to another system, providing a failover mechanism. While this is a normal failover using the features of the clustered operating system, RAC provides additional failover opportunities by using features such as TAF, database connections and processes that had lost connection are reconnected and this failover is transparent to the user. In the next section we will look at availability.

< Day Day Up >