Why engineers should not do their own worst-case analysis

Article By : Charles Hymowitz

The elements necessary to perform WCCA should be brought together along with the right software, people, test data, and experience.

The term escapes is a euphemism for all the excuses programs, program managers, engineers, and reviewers use to curtail or eliminate worst-case-circuit-analysis (WCCA) associated activities. It is essential that all the elements necessary to perform the WCCA be brought together along with the right software, people, test data, and experience. With each hurdle, the analysis will stall, impacting the level of rigorousness and the veracity of the conclusions. In addition, there are a variety of issues that can plague a successful completion of WCCA.

two graphs showing rogue waves on white and black backgroundsFigure 1 Poor power integrity can lead to rogue waves. These are transient load conditions where there is an alignment of the stepped current requirements of the load—FPGA, processor, and memory—and resonances in the power rail’s PDN impedance. When this happens the power supply voltage can jump out of specification resulting in a fault condition, which is tremendously difficult to recreate in traditional production testing. But this worst case event happens all the time in real life. Source: AEi Systems

Some common escapes include a lack of budget or foresight to properly scope out and budget the effort; poor, non-existent, or ever-changing design specifications or flowdown requirements from the customer; saying we test and we have redundancy (note: redundancy won’t save a bad design); and time compression or poor scheduling.

WCCA needs time. It is often force fit between the end of the design process and production. Unfortunately, too many programs find themselves still designing right up until final reviews or after WCCA findings are revealed, and there is little or no time to properly perform the WCCA let alone fix the issues found. WCCA needs time to be completed properly and any non-compliances need to be addressed appropriately. A reanalysis pass to define and confirm fixes is always necessary.

WCCA needs test data to support models and assumptions; if the hardware does not meet up with the analysis, problems will occur.

The need for hardware is essential for efficient WCCA. The lack of part data to fill in datasheet holes, model correlation data to define model performance, and circuit correlation data to anchor simulations, assumptions, and conclusions is a critical issue. Without data, you will be making judgements and design decisions without a firm foundational basis.

Designers who think they don’t need to do the analysis is another common escape. The selection of the parameters to be analyzed should not be generated by the circuit designer alone. Mistakes in the design will often be repeated in the analysis. Circuits that the designer believes are too simple, obvious, or repeatable may be ignored. It’s often where problems lie.

Don’t underestimate the tolerance stack-up. Until the tolerance database or parts variability database (PVDB) is compiled and the analysis performed, it is tremendously difficult to know what part tolerances will do to performance. We know very little about the parts we use, and we often don’t know the sensitivity of the circuit performance to various unbounded or undocumented parameters. To dismiss the variances as inconsequential before performing the analysis is one of the biggest escapes. Whether RSS’d or EVA’d, the tolerance stack-up is bigger than you believe it to be.

Company, program, and engineering biases

These entities are often infected by the nominal. They believe they know all they need to know given typical data. Typical datasheet information, curves, and test data are often used to justify conclusions about worst-case behavior, tolerance distributions, and so much more. Often the nature of the data is not even explored. Statistically speaking, the nominal does not tell you about extremes and should not be used to bound WCCA. The difference between a nominal stress analysis, a worst-case steady state stress analysis, and a worst-case transient stress analysis—using EOL part values, loading, and environment extremes—can easily be an order of magnitude.

Companies believe past success is a future predictor, even if the parts, requirements, environment, and the designs change. Likely, it would be difficult to trace the issue back to a particular circuit or functional block if there were a unit failure. They also believe they have done all the homework they need; for instance, what little analysis they do along with test data, which is both deemed accurate and sufficient, clears any functional concerns or risks.

These entities put 100% stock in reference designs and datasheet information without any hint of pessimism. They are unfamiliar with the tolerance stack-up and the ‘Cracker-Jack’ phenomenon—the surprises that are waiting inside most ICs that you don’t know about until you open the box and look (deeply) inside. They do not understand the role limited, priority-based, and targeted WCCA can play in achieving higher reliability and meeting mission performance goals, and they do not care to learn how it can benefit them.

Another escape is saying it costs too much or we don’t have anyone to do it—clearly these escapes bely reality. As for cost, WCCA doesn’t cost money, it saves money. This may seem misguided at first, but once you understand the direct and ancillary benefits, it’s easy to see WCCA’s value. Below are just a few planned and executed ways to manage costs. As for who can do the work, well, consultants exist and the work can be targeted.

The return on investment for WCCA is significant. Table 1 lists some of the many reasons to perform WCCA.

Reasons to perform WCCA
Need Reason
Design verification and reliability To verify circuit operation and quantify the operating margins over past tolerances and operating conditions: Will the circuit perform its functions and meet specifications?
To improve performance: determine the sensitivity of components to certain characteristics or tolerances in order to better optimize/understand a design and what drives performance
To verify that a circuit interfaces with another design properly
To determine the impact of part failures or out of tolerance modes
Test cost reduction To evaluate performance aspects that are difficult, expensive, or impossible to measure (i.e. determine the impact of input stimulus and output loading so as not to damage hardware)
To set ATP limits: without analysis, how will you know what you are supposed to see in test?
To verify SATs/SITs and if they are needed/what their limits should be
To reduce the amount and scope of testing
Parts assessment To determine if a part is suitable (too cheap, too expensive) or if a new technology can be used
To support/set critical parameters and SCD requirements/screening definition
To perform single event transient (SET) analyses
To support the switching and transient stress and derating analysis
Schedule, cost, or contractual risk reduction To reduce board spins: determine the impact of late stage design or part changes
To verify changes to heritage circuits
To obtain better insurance rates and reduce contractual liabilities
To avoid a catastrophic or costly incident
Return on investment To improve future products
To improve the knowledge and capability of your engineering staff

We test so we don’t need to analyze

This is one of the biggest escapes of all: Can’t electrical testing be used as a less expensive alternative? The answer is generally no.

The beginning-of-life (BOL) vs. end-of-life (EOL) tolerance variances are discussed in the blog “Optimizing Electronics Test/Analysis Ratio.” BOL tolerances dominate. Testing does not usually account for BOL tolerances; initial testing is rarely extensive due to various practical constraints, so test does not retire as much risk/margin as people think relative to the tolerance stack-up.

Usually, testing only determines typical 25°C performance. In many cases, extended testing must be performed with extreme operating conditions such as temperature, voltage, and power to determine aging margins. This can overstress the hardware. Testing is only valid for the measured lot and may vary lot-to-lot and manufacturer-to-manufacturer. It requires the parts to be procured prior to completion of the WCCA, which can be very risky and very costly if sophisticated test equipment is required.

While testing is essential to support the WCCA, testing doesn’t cover EOL analysis and often doesn’t even cover all operating conditions. In addition, testing has the following inherent concerns:

  • Without analysis, how do you know what you expect to see? One of the most basic rules of testing is know what you expect to see. And it’s often impossible if you have not performed any analysis.
  • Testing isn’t cheap, fast, or easy. Test setups often distort measurement data, and most labs are severely under- or ill-equipped.
  • Testing does not compute margins, risk, or parametric sensitivity—three key outputs of WCCA. Therefore, it is much harder to improve the design with only test data as the guide.
  • Many of the things we need to look at are not even testable.
  • Worst-case test conditions are often not defined, unattainable, or would over-stress the hardware.
  • Testing is often limited to the top-level outputs. If an anomaly isn’t seen, probing to lower levels is often not performed. Key functional blocks are often not tested. For instance, it’s easy for an op amp or power supply with poor stability to hide in a system that appears to be working properly. The poorly-performing circuit may be masked and dismissed as increased noise. It is known that stability margins of control loops can change 20°C to 30°C degrees over temperature. So, without knowing where you stand nominally, EOL issues can easily crop up.
  • The differences between engineering model and flight/production parts and layouts are often underestimated.
  • In many cases, the PCB can impact the performance of the circuit. Therefore, it is essential that the final layout be used when testing.
  • One of the places where we find many worst-case issues is in power supplies. Power supplies are often not measured down to the level they should be, and the size of today’s power supplies is often so small that they cannot be measured properly or easily.

Eliminating bias, ensuring independence

The project engineer is often under great schedule pressure, program budget pressure, and the company’s political pressure. One of the main tenants of the Aerospace TOR guideline on WCCA (TOR-2012(8960)-4_Rev. A) is that WCCA performed in-house is not independent. Monetary, political, and personal feelings all serve to destroy the checks and balances that WCCA is supposed to bring to the design process.

photos of a rocket explosion, a melted Samsung Galaxy and a Tesla on fireFigure 2 These images show worst case events happening in real life: (L-R) a rocket explosion, a Samsung Galaxy battery fire, and a Tesla on fire. Source: AEi Systems

It’s not to say that designers should not be involved. Certainly, the designer should develop the nominal models and be involved in the WCCA review. But independence is key to avoiding escapes. While some of these biases can influence even the most independent of analysts, this is clearly why companies and design engineers should not do their own worst-case analysis and why it is imperative to use an independent assessment team.

This article was originally published on EDN.

Charles Hymowitz is a technologist, marketer, and business executive with over 30 years of experience in the electrical engineering services and EDA software markets.

Related articles:

 

 Lucky Draw 2021

Leave a comment