The airbag is now for virtually all of us a familiar technology in our vehicles, from the smallest run-around to the largest and most expensive executive supercomputers on wheels.
All airbag systems have two things in common. When they are functioning correctly and are used correctly – in conjunction with a correctly worn seat belt - they prevent serious injury and save lives. But conversely, when an airbag malfunctions or is used incorrectly – without a seatbelt or placing a child in a rear facing child seat in a front passenger seat - they can also cause unnecessary injury or even, in extreme cases, death.
While airbag system developers, vehicle manufacturers and governments do all they can to prevent incorrect use, this cannot be ruled out. It is clearly the responsibility of the developers and manufacturers to help ensure that these safety critical systems function correctly.
‘Diagnostic coverage’ and ‘fail safe fraction’
The meaning of “coverage” seems to be obvious: a device has a high coverage when it reaches a safe state for almost 100% of faults. This superficial interpretation is wrong, and it leads to a development dilemma. As “high” needs to be quantified, more and more often, the automotive industry is making reference to IEC61508-2 Tables 2/3, where diagnostic coverage is linked to Safety Integrity Levels (SIL) – loosely seen, the higher the diagnostic coverage provided, the higher the SIL level (system safety) can be claimed. Typically, automotive suppliers are targeting SIL3 and, maybe incorrectly, interpret Table 3 for fail silent systems including new products as requiring a diagnostic coverage of at least 99%. In fact, Table 3 refers to Safe Failure Fraction.
Latest at this point hardware engineers become nervous: It requires sophisticated DFT approaches to achieve 99% test coverage for production tests – in application mode this is unachievable. So how could one ever achieve this 99% goal?
Having a closer look at IEC61508-2 reveals some misunderstandings. Annex C defines “diagnostic coverage” (DC) and “safe failure fraction” (SFF). The key figures are λD, λDD and λS which represent the probabilities for dangerous failures, dangerous detected and safe failures. So we deal with probabilities, not with the numbers of faults. While in modern devices there are >100 million possible faults the probability for a failure is very low. FIT rates below 100 are state of technology (1 FIT = 10-9 failures/h).
There is a small set of faults with much higher probability, though. These faults are usually transient, depend on the application profile and are induced externally. Some examples include. short-circuits on the PCB, EMC injections and supply voltage variations outside the specified range, not to mention faults induced by human mistakes.
The diagnostic coverage is defined as DC=∑λDD/∑λD while the safe failure fraction is SFF=(∑λS+∑λDD)/(∑λS+∑λD). Obviously SFF gets close to 1 (or 100%) when we take care of this small set of likely faults.
With this background it becomes clear where the focus for future “safe devices” needs to be: Probabilities of faults must be quantified rather than counting faults. As faults induced externally dominate by far the probabilities, these must be detected and managed.
The benefit of core self tests
An interesting observation is that so called “core self tests” cover only a tiny part of λD: Assuming a device with 100 FIT containing a core using 5% of the die, an ideal core self test would improve the FIT rate by 5. However core self tests do not detect transient faults that do not physically damage the device. Assuming the probability of externally influenced faults is >10-7/h reducing the detection probability by 5*10-9/h does not improve DC or SFF significantly.
Despite this fact OEMs and TIER1s have been demanding core self tests with 99% fault coverage. Not only that the definition of fault coverage for these tests is misleading with respect to safety (see above). Even more it does not significantly improve the DC or SFF. In order to achieve the required SFF or DC more focus needs to be on the probability, detection and mitigation of externally induced faults. Today the discussion of probabilities of faults is a sensitive area as a standardized catalogue of faults and probabilities in the automotive industry does not exist.
Can higher Integration improve safety?
With the ever increasing demand for higher integration the industry is facing a new shift of paradigms: Redundant functionality will be integrated in fewer devices. Dual core devices are already state of the technology for some applications, but there are also other types of redundancy on one chip. From a safety point of view the spatial proximity of redundant functions on one die is considered challenging due to common cause faults. Clearly this is a valid point – it does not exclude, though, that a higher integrated solution can be safer and more reliable.
Today’s technology bares a huge potential for new solutions. The integration step alone already improves the reliability of interconnections significantly, as device internal routing is much less susceptible to chemical or physical influence. Due to very short antennas the susceptibility to disturbances and EMC is also lower. Short circuits inside a chip are an extremely rare incident compared to PCBs. Therefore the probability for data transfer inconsistency inside a chip is extremely low.
A basic requirement is that common cause faults as mentioned above are detected with high probability. Today’s solutions address this partly, like clock monitors, voltage monitors and on chip thermal sensors. Research and quality assurance help to understand failure mechanisms better. This has to go hand in hand with the development of further enhanced detection and mitigation strategies against common cause faults.
To implement these strategies the integration opens the space for a new type of monitoring functions. On chip there is direct access to internal signals for online diagnosis – these signals are not accessible from outside. Multiple signals can be monitored at the same time. Tests can be performed with extremely short periods and detect faults very quickly even before they can start to propagate through the system. Moreover monitors can be implemented redundantly in different places of the die, so even the failure of a monitor would be detected.
Technology trends
In the near future, a single chip solution is unlikely because of the safety critical nature of the airbag system and problems associated with thermal runaway in faulty power stages – caused by PCB or silicon faults – leading to loss of control and ultimately chip destruction. But Freescale’s technology can be expected to support the move to ever higher levels of integration, leading to a two chip solution engineered to allow the drive of airbag systems into ever smaller and lower cost vehicles, including those in developing markets such as China.
REFERENCES
[1] IEC 61508 “Functional safety of electrical/electronic/programmable electronic safety-related systems”
[2] IEC TR 62380 “Reliability data handbook – Universal model for reliability prediction of electronics components, PCBs and equipment”
Illustrations:
Figure 1