In 2018, I had to replace some failed components on a 25-year-old HP34401A DMM. Two capacitors failed, taking out diodes as well. That was the third instance of having to replace failed capacitors within a year, the others being a computer monitor and a DVD recorder. The engineers who designed these products did a credible job, for each product lasted for years before a failure (25 for the DMM). Fortunately, all were easy to fix—the failures didn't cause any significant disruptions.

Had the 34401A been installed in an automated test station used for production, the situation might have been different, but only if the meter were being used for AC measurements. If the meter's AC functions were not used, then the failure could have gone unnoticed unless someone could smell the burned components (Figure 1).

failed capacitorFigure 1. Capacitors—especially aluminum electrolytic and tantalum used in power supplies—are the most likely components to fail. In this case, the time-to-failure was 25 years.

Many engineering teams begin looking for possible failures as soon as they have a basic design. Medium-to-large organizations typically have reliability engineers onsite who review design documents, bills of materials, test procedures, and manufacturing processes looking for weaknesses that can lead to failure. Small companies or startups may use consultants. Reliability engineers look for and analyze possible failure conditions (modes) to determine each failure's severity and probability of occurrence. "Bring in reliability people at the start," said Ken Rispoli, who recently retired as Sr. Principal Engineer at Raytheon Integrated Defense Systems. "They can point to problems that designers might overlook."

Define the failure

"The first thing you must do when developing a reliability plan is to define a failure, which can take one or two meetings" said reliability consultant Kevin Granlund. "Reach consensus and document it." The HP 34401A capacitor failure highlights that point: What's a failure in some circumstances might be not a failure in others. It depends on the use case. For some failures, a momentary interruption is tolerable, but not for others. That depends on the user's tolerance.

Failure analysis comes down to risk. How much risk of failure is acceptable depends on the product's intended use cases and the consequences of the failure. If lives or large sums of money are at stake, then you need to minimize risk to the greatest extent possible. Risk also depends on the expected lifetime of a product. As Rispoli noted "Do you need to design a consumer product to work for 20 years when its expected lifetime is two years?" Yes, in the case of the 34401A and similar products, many of which have been in use for 20 years and longer.

Engineers have several tools and practices available for reliability planning and analysis.

  • Stress analysis
  • Mean time between failure (MBTF)/mean time to failure (MTTF)
  • Failure mode and effects analysis (FMEA)/Failure mode, effects and criticality analysis (FMECA)
  • Worst-case analysis

According to Charles Hymowitz, managing director of AEi Systems, stress analysis is the most important to perform, but it doesn’t cover everything. When designing circuits for reliability, many engineers start here because you can perform stress analysis through simulation.

We know how temperature affects component and system lifetimes. Thus, stress analysis is, in many ways, thermal analysis. Thermal problems come from too much heat generated by circuits combined with inadequate cooling. Start by calculating the voltage and current in each component, then calculate the dissipated power. Figure 2 shows a typical thermal simulation for an IC.

thermal analysisFigure 2. Thermal analysis can reveal hot spots in IC, boards, and systems.

According to Hymowitz, temperature increases caused by heat dissipation will have a more adverse effect on passive components and discrete semiconductors than on ICs. Temperature rise can not only affect a component's time to failure, but it can change a component's value, which can cause a board or system to functions outside of its intended parameters.

Following stress analysis in circuit design comes MBTF/MTTF analysis. Many passive and active component manufacturers publish these data, separately for wafer fab and packaging. MBTF is largely based on historical data. While MBTF is a common analysis performed on many designs (you need it for FMEA/FMECA), it can be error prone, according to Hymowitz.

A typical MBTF/MTTF analysis ranks parts in order of shortest time to failure. At that point, it's a tradeoff among risk of failure, consequences of failure, and cost. Extending MBTF/MTTF often means paying for higher-grade parts. If you're designing a satellite or other system where failure is paramount, you'll probably choose parts with the longest MBTF. For, say, an mp3 player, cost might take priority, to a point. If your product is a system such as a network or manufacturing system that can't be replaced in whole and downtime is an issue, then take mean time to repair (MTTR) into account as well.

In my recent repairs, the point of failure was always a capacitor. Capacitor manufacturers help you analyze MBTF by providing calculators such as this one. The calculator plots a capacitor's expected lifetime in a power supply based on temperature, DC voltage, and ripple (AC) voltage.

[Continue reading on EDN US: Get critical]

Martin Rowe is a senior technical editor covering test and measurement for EDN and EE Times.

Check out all the stories in this reliability Special Project:

Case Study: How Hoverboard Makers Neglected Safety
Compared to the recent catastrophic crashes of Boeing 737 Max airliners, the consumer hoverboards that literally caught fire in 2015 seem like small potatoes. But the exploding hoverboards might be an easier place to start studying safety-critical design issues.

Boeing Crashes Highlight a Worsening Reliability Crisis
Is it so much to ask that technologies work as advertised and — oh, by the way — not kill us?


NASA: Hard-Earned Lessons Can Yield Reliability
Technology disasters have helped refocus attention on safety and engineering excellence at NASA.