Using digital design to implement physical reliability

Article By : Wendy Luiten and John Parry

Digital design is increasingly used earlier in the design cycle to predict the zero-hour nominal performance and to access reliability.

Using computer models to simulate product behavior was first used mainly to replace hardware prototyping to assess design performance at the end of the development cycle. However, with increasingly powerful computational resources, digital design is increasingly used earlier in the design cycle to predict the zero-hour nominal performance and to access reliability.

Reliability is “the probability that a system will perform its intended function without failure, under stated conditions, for a stated period of time.” The first part of this definition focuses on performance—the product must perform its intended function without failure. The second part addresses usage aspects—under what conditions will the product be used. The third part addresses time—how long will the product be operating.

v diagram showing the steps for system developmentFigure 1 In this system development diagram, the requirements flow down and the capabilities flow up.

The flow of digitally designing for performance is illustrated through the well-known V-model (Figure 1)—the requirements flow down, and the capabilities flow up. Business and marketing requirements flow down for the system, then the subsystem, and the components in the left-hand side of the V. After design, the capability of the component to fulfill its sub-function without failure is verified, as is the capability of the subsystem and the system. Finally, the full system is validated against business and marketing expectations.

Three parts of designing for reliability

Digital design improves and speeds the verification step by calculating whether the specified system, subsystem, or component inputs will result in the required output. In addition, digital design can be used to guide architecture and design choices. To design and analyze electronics cooling, 3D computational fluid dynamics (CFD) software is used to construct a thermal model of the system at the concept stage, before design data is committed into the electronic design automation (EDA) and/or mechanical CAD (MCAD) systems. The model is then elaborated with data imported from the mechanical and electrical design flows as the development progresses to create a digital twin of the thermal performance of the product that can then be used for verification and analyses.

The second part of designing for reliability—the conditions—incorporates use cases representing different stages in the lifecycle of the system. These include transport, preparing for use, first use, normal use, and end-of-use scenarios. The product should be able to withstand normal transport conditions such as drops, vibrations, and temperature extremes. It should not break down when a mistake is made in handling. Different loading conditions will occur in varying temperature and humidity environments during normal use.

After end-of-use, a product should be easily recycled and not create environmental damage. These use cases represent a wider group of scenarios than typical, normal use conditions. A product that only works in the lab is not perceived as a reliable product. Digital design is used to simulate specific steps in the lifecycle, for instance, drop and vibration tests to mimic transport conditions. In addition, digital design is used to run through what-if scenarios, simulating worse-case environmental conditions.

The third part of the reliability definition is about the time span that a product is expected to perform its intended function without failure. This is measured by the failure rate, which is defined simply as the proportion of the running population that fails within a certain time. If we start with a population of 100 running units, and we have a constant failure rate of 10 percent, then at t = 1, 90 units (90% of 100) are still running and at t = 2, 81 (90%×90) are running.

annotated graph showing bathtub curvesFigure 2 The bathtub curve shows the rates of failure over time. Source: Wikipedia

In time, the failure rate changes. The performance of a hardware product can be illustrated by a bathtub curve (Figure 2). The first phase, infancy, has a decreasing failure rate while the kinks are worked out of an immature design and its production. Example root causes of infancy failure include manufacturing issues from part tolerances and issues caused by transport or storage conditions, installation, or start up. This stage is where it is confirmed that the manufactured product performs as designed.

Note that this is from the business perspective, so the failures do not refer to a single instance of a product, but rather to the population that the business produces. An important factor that affects all parts of the bathtub curve is temperature, so the thermal performance of the system should be checked and compared to the simulation model at this stage.

The next phase is normal life, where the failure rate bottoms out to the flat part of the bathtub curve. Random failures from various sources of overstress combine as a constant aggregate failure rate. Overstress is defined as excursions outside known safe-operating limitations. In the third part of the curve, the failure rate increases because of the product wearing out over time as it is used.

Failure and the stages of maturity

The V-diagram in Figure 1 shows that reliability is ensured by adherence of the manufactured product to the requirements. Parts that do not meet these requirements are considered defective because we can assume that they will fail early. Typically, higher levels are an aggregation of many lower levels. For example, an electronics assembly will contain multiple boards, and each board will contain multiple components and an even larger amount of solder joints. This also means that lower levels need progressively lower failure rates to ensure reliability at higher levels. In high-reliability environments, failure rates are expressed in terms of parts per million (ppm) and process capability index (Cpk).

In the electronics-industry supply chain, the maximum acceptable failure rates of electronic assemblies range from a Cpk of 1.0, corresponding to 2,700 ppm falling outside either the upper or lower specification limits. Large suppliers typically work from a Cpk of 1.33 (60 ppm) to a Cpk of 1.67 for critical parts (<1 ppm). In automotive applications, the growth of electronics subsystems, particularly those related to safety, is driving the supply chain to achieve ever-lower defect rates, now approaching 1 ppm at the level of individual components.

A reliability capable organization is one that is set up to learn from experiences and operates proactively. The IEEE 1624-2008 Guide for organizational Reliability Capability defines five stages in a reliability capability maturity model (CMM) that varies from stage 1, which is purely reactive, to stage 5, which is proactive. Table 1 shows an extract from the matrix that covers reliability analysis and testing beginning with stage 2.

Table 1 IEEE 1624 capability maturity matrix excerpt on reliability analysis and testing

table of IEEE 1624 reliability stages

For a complex design, the multitude of failure conditions and use cases results in many potential failure conditions, which are costly and time consuming to test for in hardware. Testing based on hardware requires a mature product in late design. Hence, for a complex product, a stage 1 approach quickly shows the need for predictive modeling.

Digital design—computer simulations and modeling—is deployed from CMM stage 2. On the lower levels, this is purely performance and environment driven. Can the product perform its intended function, in all use cases, without failure, based on nominal inputs and outputs? Pilot runs, manufacturing investments, and lifetime tests are typically started after design freeze.

These entail investment of time and money that do not allow for an iterative approach. Stage 2 companies often identify providing computer simulations as design verification before design freeze. Experience shows that design rework is often needed to meet the requirements of the parts’ safe-operating limitations, for example, a maximum ambient temperature.

By stage 3, virtual analysis should be highly correlated with failure conditions, for instance, through use of field data and dedicated reliability tests that provide a high likelihood of detecting failures through virtual analysis before they happen. In design failure mode and effect analysis (DFMEA), a risk priority number (RPN) is assigned to product failures as scores for severity, occurrence, and detection. Increasing the likelihood of detection can lower the RPN by as much as 80 percent.

In CMM stage 4, typically simulation is used early in the design process. It is used not only to calculate a nominal performance, but also the statistical distribution. In other words, failure is calculated with more granularity—not as a yes/no binary outcome but as a probability of failure, which is the statistical capability of the design as expressed in Cpk. In the DFMEA, this further lowers the RPN by backing up the claim of a low or remote occurrence score. In thermal design, higher CMM companies evolve to use measurements to underpin the fidelity of the simulation model by confirming material properties and thicknesses of bond lines along the heat-flow path.

Early design models, such as that shown in Figure 3 for an automotive ADAS control unit, simulated before component placement has closed in the EDA design flow, can be used to support choice of a cooling solution, apply deterministic design improvements, and explore the likely impact of variations in input variables.

Simcenter Flotherm model of an automotive ADAS designFigure 3 The initial design for this automotive ADAS unit is modeled in Simcenter Flotherm.

The combination of computer simulations and statistical techniques is powerful in addressing both nominal design and statistical design capabilities. In design-of-experiments (DOE), a scenario consisting of a number of specific cases can be calculated as an array of virtual experiments. The cases are selected to enable separating out the effects of inputs and combinations of inputs, which results in the nominal performance output as a quantified function of the design inputs. At the lower CMM levels, this function can be used to choose the design inputs so that the design meets its intended function in all stated conditions.

Becoming a highly-capable reliability company

At higher CMM levels, the V-model also includes knowing the statistical distribution of the inputs and having a requirement on the allowed probability of failure, usually expressed as a Cp/Cpk statistical capability or a sigma level. Again, a DOE can be used to determine the output performance as a function of design inputs and noise factors. Subsequently, the effect of the noise and the statistical distribution of the input factors can be determined, for instance, through Monte Carlo simulation. For each design input and each noise factor, a random value is picked from the relevant distribution and substituted in the equation to calculate the performance output.

This is repeated a large number of times, for example 5,000 times. So, 5,000 times a set of design inputs and noises is selected and substituted into the function to calculate the performance output. This results in a predicted data set of 5,000 values for the performance output that can be used to show the expected statistical distribution and the expected statistical capability and failure rate.

block workflow for higher level CMMFigure 4 This higher level CMM workflow combines digital and statistical design.

The workflow for a higher level CMM is shown in Figure 4, with the results of the capability analysis of the 5,000 simulations shown for an improvement to the design shown in Figure 3. The demonstrated Cpk of 1.05 is far below 1.33 so the expected failure rate far exceeds the acceptable ppm level. Because a low failure rate is sought, the number of Monte Carlo experiments needed is high, as shown in Figure 5.

graph showing prediction of junction temperature in software programFigure 5 This prediction of junction temperature for a critical IC7 component for 5,000 simulations accounts for statistical variation in input parameters using HEEDS software.

A proactive rather than reactive approach

The lower-level CMM organizations have a reactive approach to high levels of failure in normal use; that is, nominal calculations that affect the failure rate in the flat part of the bathtub curve. Mature organizations simultaneously work in more fields and deploy both nominal and statistical modes of digital design specific to the different parts of the bathtub curve: product infancy, normal use, and wear.

Stage 5 CMM organizations also invest in understanding the root causes of failure mechanisms underpinning the random failures in normal life and wear. Overstress and wear need more extensive investigations to be able to link lifetime expectations with operating conditions. Often this entails linking specific measurements to failure conditions and simulations to identify stressors, failure physics, and acceleration mechanisms.

Siemens EDA has created a power electronics testing solution that combines active power cycling with automated, JEDEC-compliant, thermal impedance measurements. Assessment of the package’s thermal structure can be used to calibrate a detailed 3D thermal simulation model to deliver the highest predictive accuracy during design. The graph in Figure 6 compares the results of running thermal structure functions for a Simcenter Flotherm model of an IGBT to testing the actual part in the POWERTESTER.

Image showing the Simcenter Powertester, and IGBT module thermal module, and a graph of test resultsFigure 6 This image shows the Simcenter POWERTESTER, IGBT module thermal model, and a graph of the measured and calibrated structure functions.

Both of these systems provide comprehensive cycling strategies for different use-case conditions and capture a range of electrical and thermal test data that can be applied to the model, in addition to running regular thermal transient tests. For example, the results can be used to identify damage to the package interconnect or to locate the cause of degradation within the part’s thermal structure, thereby meeting the testing requirements of CMM stage 4 and providing the data necessary to achieve stage 5.

This article was originally published on EDN.

Wendy Luiten is a thermal specialist at Siemens EDA and a Master Black Belt of Design for Six Sigma (DfSS).

Dr. John Parry, CEng, is the electronics industry manager at Siemens EDA.

Related articles:


Leave a comment