High availability (HA) is one of the components contributing to continuous service provision for applications, by masking or eliminating both planned and unplanned downtime of systems and applications. This is achieved by eliminating hardware and software single points of failure (SPOF). High availability solutions should eliminate single points of failure through appropriate design, planning, hardware selection, software configuring, application control, carefully environment control and change management discipline. High availability versus fault tolerance : Based on the response time and response action to system detected failures, clusters and systems can be generally classified as: • Fault-tolerant • High availability Fault-tolerant systems : The systems provided with fault tolerance are designed to operate virtually without interruption, regardless of the failure that may occur (except perhaps for a complete site going down due to a natural disaster). High availability systems : The systems configured for high availability are a combination of hardware and software components configured to work together to ensure automated recovery in case of failure with a minimal acceptable downtime. For the purpose of designing and implementing a high-availability solution for networked robotic stations integrated in a manufacturing environment, the following terminology and concepts are introduced: RMC : The Resource Monitoring and Control (RMC) is a function giving one the ability to monitor the state of system resources and respond when predefined thresholds are crossed, so that many routine tasks can be automatically performed. Cluster : A cluster defines relationships among cooperating systems, where peer cluster nodes provide the services offered by a cluster node that should be unable to do so. There are two types of high availability clusters: • Peer domain • Managed domain Node : A robot controller that is defined as part of a cluster. Each node has a collection of resources (disks, file systems, IP addresses, and applications) that can be transferred to another node in the cluster in case the node or a component fails. Clients : A client is a system that can access the application running on the cluster nodes over a local area network. Topology : Contains basic cluster components nodes, networks, communication interfaces, communication devices, and communication adapters. Resources : Logical components or entities that are being made highly available (for example, file systems, raw devices, service IP labels, and applications) by being moved from one node to another. Service IP label : A label matching to a service IP address and which is used for communications between clients and the node. IP address takeover : The process whereby an IP address is moved from one adapter to another adapter on the same logical network. Resource takeover : This is the operation of transferring resources between nodes inside the cluster. If one component or node fails due to a hardware or operating system problem, its resource groups will be moved to the another node. Failover : Represents the movement of a resource group from one active node to another node (backup node) in response to a failure on that active node. Fallback : Represents the movement of a resource group back from the backup node to the previous node, when it becomes available. This movement is typically in response to the reintegration of the previously failed node. Heartbeat packet : A packet sent between communication interfaces in the cluster, used to monitor the state of the cluster components - nodes, networks, adapters. RSCT processes : (Reliable Scalable Cluster Technology). They consist of two processes (topology and group services) that monitor the state of the cluster and each node. The cluster manager receives event information generated by these processes and takes corresponding (response) actions in case of failure(s). Group Leader (GL) : The node with the highest IP as defined in one of the cluster networks (the first network available), that acts as the central repository for all topology and group data coming from the RSCT daemons concerning the state of the cluster. Group leader backup : This is the node with the next highest IP address on the same arbitrarily chosen network, acting as a backup for the Group Leader; it will take over in the event that the Group Leader leaves the cluster. Mayor : A node chosen by the RSCT Group Leader (the node with the next highest IP address after the GL Backup), if such exists; otherwise the Mayor node is the GL Backup itself. It is the Mayor’s responsibility to inform other nodes of any changes in the cluster as determined by the Group Leader (GL). Quorum : The notion of quorum is used to ensure that in case of loss of connectivity between two subsets of the peer domain only one subset considers itself as the peer domain. The quorum is defined as n/2+1, where n is the number of nodes defined in the peer domain. SPOF : A single point of failure (SPOF) is any individual component integrated in a cluster which, in case of failure, renders the application unavailable for end users. Good design will remove single points of failure in the cluster - nodes, storage, networks. Author works for OMICS Publishing Group, which is built upon the principles of open access journals and is determined to provide free and unrestricted access of research articles to scientists around the world for the advancement of science and technology.
Related Articles -
Robot, Fault tolerance, Node, Open Access,
|