
With the growing complexity of applications in today’s embedded designs, fault tolerance and high availability have become standard system requirements. Typical specifications for the most dependable equipment call for a 99.999% uptime, yet some aerospace, medical, automotive, and telecommunications customers find even this level of performance unacceptable. Embedded-system designers recognize that, although downtime is costly and possibly life-threatening, all systems are subject to failure at any time. The art of high-availability system design is to economically manage inevitable failures to eliminate or minimize interruptions to the underlying embedded function or service.
Although the purpose of any embedded system is to perform a task or to provide a service to a user, high-availability systems are designated as mission-, revenue-, or safety-critical to demonstrate the importance of their dependability. The availability of a system is usually expressed in terms of the service it provides, and you can interrupt these services from any number of causes, such as hardware failures; operating-system failures; application-software failures; operator error; outside influences, such as power outages; and planned maintenance. High-availability systems include built-in techniques to continuously provide the service in spite of these hardware or software failures.
Availability is a function of system reliability and the time it takes to restore operation. Expressed another way, availability is the probability that a service or system is ready to use at any time. Reliability is a measure of a system’s continuous uptime and is usually designated as MTBF (mean time between failures). Hardware manufacturers derive theoretical calculations of MTBF from statistical failure rates of the components used in a product or system. Some of the accepted standards for performing hardware-reliability predictions are MIL-HDBK-217 and Bellcore TR-332. The second element of availability is the time it takes to return a system to operation after a failure. This repair time is also known as the MTTR (mean time to repair). Even if a system fails frequently, it can provide high availability as long as you can quickly return it to service. Availability is expressed as:
Availability=MTBF/(MTBF+MTTR).
Systems are generally classified as high-availability systems if they can provide uninterrupted service 99.999% of the time. Often referred to as “five nines,” 99.999% translates to approximately five minutes of downtime per year. Some new communications switching equipment specifications call for “six nines,” or 99.9999%, which means about 30 seconds of downtime in a year.
EXTENDING MTBFOne of the most basic ways to increase reliability is to review the design of individual circuits within your system. Simply ensuring that all components in your design are functioning well with their specified operating range can eliminate many failures. Substituting higher failure-rate components, such as rotating memory with rugged-ized, solid-state devices, is another improvement. You can use circuit-simulation tools to uncover unsuspected high-stress elements before releasing a design to production. You can also extend MTBF by improving the circuit cooling and airflow to eliminate heat buildup. Environmental stress screening in the form of extreme temperatures, humidity, and vibration can also catch early failures before you place a system in operation.
| Beyond careful circuit design, you can introduce fault tolerance so that your system continues to operate in the presence of failures. For hardware, you can achieve fault tolerance by introducing redundant elements to the system. The most basic method of providing redundancy is to simply add a component to the system. The added component does not begin operating until the primary component fails. This method is known as “standby,” or “powered spares.” Such a system requires circuitry or software to determine which components are active and which are standby. An extension of the standby system, load sharing, uses the redundant hardware to provide part of the service. In this manner, the system provides one level of service when all components are functioning properly and a lower but satisfactory level of service when one of the redundant components fails. Another classic approach to high availability is to employ two or three instances of a circuit with a final check to compare the results of each circuit. In the case of two circuits, the failure is identified, and, with three circuits, operation can continue by using the results from the two circuits that agree. |  Figure 1 SBS Technologies’ Cascade II high-availability system includes two CompactPCI backplane buses for hot standby and full-redundancy options. |
You must consider the scope of a fault when designing redundant components to ensure that a single failure will not also cripple the backup. Failures in bus-based systems are prime examples of the types of faults that can affect multiple boards because any failure directly connected to a backplane signal could block all data transfers. System providers battle bus-type failures with dual backplanes and automatic switchover. For example, the Cascade II high-availability telecomm system from SBS Technologies includes two single-board computers; a watchdog alarm module; hot-swap redundant power supplies; and two independent, 64-bit, 33-MHz CompactPCI buses (
Figure 1).
CLUSTER COMPUTIONGIf your high-availability system also happens to be a complete computer system, such as a server providing an information-delivery service, you can increase reliability by combining multiple systems into a cluster. Clusters also increase the performance of a system by distributing the computational load over multiple nodes. Management software can automatically remove failed nodes from the cluster and transfer the computational task to another node. The overall cluster then continues to provide the required service at a degraded performance level. An advantage of the cluster arrangement is that the individual computer systems can be low-cost, off-the-shelf devices that become nodes in the overall cluster. Although numerous commercial cluster-software products are available, you can get step-by-step instructions for setting up a low-cost, dependable cluster server from the Linux High Availability project (
www.linux-ha.org). This open-software project offers Heartbeat, a Linux add-on for automatic resource allocation, system monitoring, and IP-address management in a cluster-based system.
Coding schemes are another important method of increasing the dependability of data-handling systems. Parity bits and CRC bytes add redundant information to data streams to detect word or block errors. Communications systems may then add time-based redundancy in which the transmitter resends a message if the receiver detects an error. High-reliability systems also employ self-correcting codes where enough redundant information is added to a data stream so that the receiver can recover the original data in the case of some transmission dropouts.
When a system goes down, your first impulse is to look for hardware failures. Yet with today’s complex systems, the problem could just as likely be a software malfunction. In fact, you should spend as much time analyzing possible software failures as hardware failures in the design of a high-availability system. Software reliability differs from hardware reliability in that it does not degrade over time. In fact, software usually gets more reliable over time as bugs are identified and corrected. Unlike hardware, software faults are basically design faults that are more difficult to detect and correct.
 Figure 2 SelfReliant 2.0 from Go-Ahead Software runs as a middleware layer and manages hardware, operating systems, applications, and clusters for high-availability applications. |
You can also use redundancy to achieve fault tolerance in software if you can provide multiple implementations of software components. Because most software bugs show up only during certain timing or loading conditions that were not demonstrated during verification testing, high-availability developers sometimes assign two or more independent design teams to implement the same code. With names such as recovery blocks, N-version programming, and N-self-checking programming, multiple-version fault-tolerance techniques attempt to provide complete tolerance to software faults through design diversity.
The most widely used approach to gain software fault tolerance is to distribute the computational load over multiple machines in a networked or cluster arrangement. Distributed application-software programs rely on middleware, such as the CORBA (Common Object-Request Broker Architecture) from the Object Management Group to enable location independence. CORBA is the acronym for an open, vendor-independent architecture and infrastructure that applications use to work together over networks. A CORBA-based program from any vendor on almost any computer, operating system, programming language, and network can interoperate with a similar CORBA-based program from any vendor on another computer using a standard protocol.
Another consideration of fault tolerance is the latency required to switch to a backup system and continue operation. Many online and telephone applications specify recovery times of less than 100 ms. Developing systems with stringent fault recovery and latency requirements is easier if each of the subsystems are also built to be fault-tolerant. For example, newer systems are using InfiniBand switched-fabric architecture to connect individual subsystems while incorporating fault tolerance (see
sidebar “InfiniBand: a self-healing, system-interconnection fabric”).
CAUGHT IN THE MIDDLE Figure 3 The Foundation HA extensions to Wind River Systems’ VxWorks real-time operating system support data- and service-critical embedded products. |
Most high-availability systems also include middleware software for fault management. Designers use hardware- or software-performance-monitoring techniques, such as data-range checks, checksums, redundant-circuit comparisons, and watchdog time-outs, to detect failures in high-availability systems. After detection, a diagnosis phase isolates the failure to a subsystem, which you can deactivate or replace. As many techniques for recovery exist as designers, but some of the more popular methods include failover, load balancing, software replacement, and rebooting. In the case of hardware failures, the final fault-management step is to repair the system by physically replacing the defective component. SelfReliant 2.0 from Go-Ahead Software is a recent example of off-the-shelf commercial software products for increasing system reliability in embedded- and enterprise-level systems (
Figure 2). This product runs as a middleware layer on top of an operating system and manages critical hardware, an operating system, other middleware, applications, and clusters to enable fault tolerance and high availability. SelfReliant automatically detects failures and invokes predetermined actions to enable 99.999% or greater system uptime and maintain continuity of the service.
Embedded real-time operating systems have also begun to include high-availability features for scheduling and resource control. Wind River Systems now offers Foundation HA software extensions for its VxWorks RTOS, which provides fault notification and a device-management system that allows you to add, remove, and reconfigure devices while the system is running (
Figure 3). Hardware-management features include “hot swap,” which lets you remove and replace peripherals in a backplane while other redundant I/O cards remain active, and “CPU Failover,” which supports live switchover of a failed CPU to a standby CPU without rebooting or restarting the system (
Figure 4).
 Figure 4 The US$995 System Monitor hot-swap CompactPCI board from One Stop Systems offers real-time monitoring of system temperature, power supplies, and fans. |
Many high-availability systems are too large and complex for a single vendor to completely manufacture and are generally made up of components from multiple suppliers. Several industry organizations promote and standardize high-availability systems and services among member firms. The High Availability Forum, whose members include Intel, Hewlett-Packard, Motorola, and RadiSys, aims to standardize the interfaces and capabilities of high-availability building blocks (
Reference 1). The Service Availability Forum (
www.saforum.org) seeks to drive industry adoption of open-interface specifications that will benefit software, equipment, and service providers when disparate, proprietary pieces are connected in packet-based, multiservice networks.
So, what’s next in high-availability systems? PICMG, the organization behind CompactPCI, is working on a new family of specifications for building the next generation of high-end, carrier-grade equipment, which may push the six nines’ dependability zone. The new architecture, AdvancedTCA, is oriented toward switch-fabric technology instead of a conventional parallel bus. The specifications will allow board, backplane, and enclosure manufacturers to independently develop interoperable products for tomorrow’s high-availability applications.
REFERENCES1. “Providing Open Architecture High Availability Solutions,” High Availability Forum, February 2001,
developer.intel.com/platforms/applied/eiacomm/haforum.htm. INFINIBAND:A SELF-HEALING,SYSTEM-INTERCONNECTION FABRIC—
Kevin Deierling, Vice President, Product Marketing, Mellanox TechnologiesThe InfiniBand architecture supports several mechanisms allowing the I/O fabric to detect and correct errors, commonly referred to as failover. The InfiniBand architecture incorporates failover capabilities directly in the hardware, meaning that the fabric is in effect self-healing. These failover capabilities offer tremendous benefits to system architects, simplifying the overall design of high-availability systems. Furthermore, because the hardware enables the fabric repairs, the response time can be much faster than typical software-recovery processes. In the absence of a hardware self-healing fabric, a sophisticated real-time operating system running on a central agent is responsible for failover. The operating system must continually poll, or be interrupted by, subsystems to ensure that they are operating correctly. As systems scale and become more complex, the time required for polling all the subsystems, or responding to many interrupts, increases. This situation in turn increases the delay before an error is detected and failover can correct it.
The InfiniBand architecture defines APM (automatic path migration) that is implemented in the InfiniBand silicon within the switches and channel adapters. APM allows you to set up a redundant standby path between two end processes. Under error conditions, the hardware may effect a migration from the primary path to this alternative path. This alternative path may take a different route between two endpoints, thus bypassing a failed switch element, for example. For instance, consider a simple system consisting of a server connected through redundant InfiniBand switches to a storage array (
Figure A). In this case, the primary path is established from the server to its storage through Switch 1. As part of the InfiniBand fabric initialization, InfiniBand also discovers the redundant path between the server and the storage array and sets it up as the alternative path. If a failure occurs on any of the links on the primary path or even on Switch 1, then the InfiniBand fabric can detect the error and APM to the alternative path.
 Figure A InfiniBand architecture features self-healing interconnects that automatically reroute signals to bypass a failed element. |
The InfiniBand architecture provides the mechanism for the endpoints to initialize APM but has not defined when failover is invoked. This situation gives system developers the flexibility to define powerful failover policies. For example, a device might initiate path migration after a preprogrammed number of CRC errors have occurred within a given time frame. Such CRC errors may provide a “soft” indication of an impending failure even before the link actually fails completely. Being able to respond to such information before the failure becomes serious greatly improves the system-level reliability.
Furthermore, the InfiniBand fabric heals itself and only then notifies higher level software that an APM event has occurred, improving failover latency. InfiniBand’s APM capability bypasses the latency that complex operating-system scheduling and ring transitions introduce. For embedded systems in which failover requirements are measured in milliseconds, this situation can be essential. InfiniBand technology support of APM means that the I/O fabric is self-healing, providing a powerful tool to system architects in an effort to design high-availability systems.

You can contact Technical Editor Warren Webb at
(1) 858-513-3713, Fax (1) 858-486-3646
E-mail
wwwebb@cts.com