Continuous performance regression 101

Article By : Travis Lazar

With proper planning and execution, continuous performance-regression testing can be a powerful tool for hardware as well as software projects, enabling servers that support data center needs throughout the server life cycle.

Whether they are cloud service providers, co-location data center operators, or enterprises running a private cloud, data center operators have three key demands: reliability, availability, and performance. Developing servers that can support those needs throughout the life cycle of a server product requires more than simple one-time testing. Given the ever-changing nature of the software ecosystem and the breadth of software used in the data center, a multiyear approach to performance-regression testing is required.

This can present major challenges. Every day, thousands of software packages release updates into the data center ecosystem that present a technical burden for hardware developers and data center operators who don’t necessarily know how their hardware or infrastructure will be used in the future. This reality is a win for the development communities, but it requires a big-picture approach for hardware developers in how they tackle continuous performance testing.

The historical process of performance-regression testing has often been static. A pretty typical approach might be to develop a shell script that outputs a performance result, run the test as part of a standard build flow, compare the result at build time with a baseline value, and pass/fail the test based on that result. This has two major drawbacks: It only runs the test at build time, and the baseline value used in determining pass/fail is unlikely to change throughout the life cycle of the test. Fixing the problem would require either manual maintenance of the test suite (which is costly and error-prone) or adoption of continuous performance-regression techniques.

Continuous performance-regression testing is a methodology for analyzing system performance throughout the lifetime of the product. It involves the entire software stack, from firmware up to user-space applications, addressing the widest possible range of application configurations. Crucially, it doesn’t just test with the initial configurations but also continues to test changes to the ecosystem as they evolve over time.

The hardware and firmware development worlds have typically lagged modern software communities in their approach to automation and continuous development activities. This is often due to the amount of time it takes to develop even a single generation of hardware coupled with large amounts of legacy process and tooling.

When applied to hardware and firmware testing, continuous performance-regression testing and analysis deliver valuable insights into how systems behave under a wide variety of software deployments. The information is critical to optimizing workloads and maintaining the predictable software environments that data center operators and end users demand. Because bare-metal testing like this is such a complex task, involving thousands of moving parts, we’ll discuss some of the big-picture insights, including the pitfalls and how to avoid them, that we’ve gained along the way. These lessons can be applied to any software project leveraging modern DevOps technologies.

Continuous performance regression 101

We define continuous performance-regression testing as repeatedly evaluating the performance of each workload on a continuous and indefinite cadence. The performance measurements here are done in a fully integrated environment, meaning that we test using the entire hardware and software stack. This means, in turn, that one of many components might change between test runs. These changes can be from the firmware, operating system, kernel, libraries, or other components (see Figure 1).

Core elements of the software stack
Figure 1: Core elements of the software stack used in continuous performance-regression testing encompass firmware up to workloads run using user-space applications.

Each result then provides a point of comparison for each subsequent result. The goal is to provide actionable information with an actionable workflow:

• Has performance changed in a negative way (regressed)?

• Has performance changed in a positive way (improved)?

• Is the change problematic? If so, how can the issue be reproduced to enable the team to debug the issue?

• Is the change beneficial? If so, what can be learned from it and applied in other areas?

The depth of the software stack here means that the amount of change we could see is substantial. Testing every individual commit or version update is not practical or useful. We effectively test “as often as possible” and gather information at specific points in time. This allows us to use a bisection algorithm when doing root-cause analysis. The more often we can collect data, the easier the root-cause process will be. Our tests run roughly six times per day, so the window for a regressing change to occur is roughly four hours.

Our test process requires that the system be rebuilt for every test. This ensures completely reproducible results.

Figure 2 shows a typical test process from start to finish.

CIDR Test Process
Figure 2: This procedure generates a rolling set of results that are captured multiple times a day (see Figure 3).

You’ll notice that this is a shotgun approach to validating software performance of the ecosystem. While a more specific or targeted approach might be preferred, there just isn’t enough computing or manpower in the world to test every permutation of software in a stack. It’s better to think of the testing process as analogous to creating a video — taking a picture at frequent-enough intervals to view performance as a process in motion. If issues arise, they can be revisited with a slow-motion camera to capture additional detail.

effects of specific updates on performance
Figure 3: Results for tests of a specific workload show the effects of specific updates (large dots) on performance. The red dot indicates the point at which the results dropped below the computed regression threshold. (Source: Ampere)

Keys to successful testing

Start with a strategy.

A test is only as effective as its design. Before capturing any data, think through your strategy. Start by understanding exactly what you’re trying to measure and work backward. For example, when testing memory bandwidth, ensure you don’t involve the caches, or you’ll get misleading data.

Figure out which characteristics of the system you’re trying to measure. Identify elements that need to remain static between runs and which can (and should) change between runs. A good plan means good results.

Be careful with data capture.

The purpose of continuous performance-regression testing is not just to catch regressions but also to assist in identifying the root causes. If you don’t collect the right information during test cycles, you can burn a lot of time just trying to reproduce a previous result. It’s a good idea to capture extra git hashes or other versioning information to increase granularity during debug. You won’t use 90% of the data you collect, but bits are cheap — and when you do need them, they can save time and frustration.

We use this technique when regressing the performance of open-source projects. We run our tests every eight to 10 hours instead of against every single commit. This saves quite a few compute cycles. During this testing process, we capture commit hashes and versions of every library on the system so that we can fully reproduce the software stack if needed. The debug process for a regression then becomes executing a git bisect against the software being tested.

Make sure you’re measuring what you intend to measure. If you’re capturing data incorrectly, you may end up debugging an unrelated issue. For example, consider a memory bandwidth test (like Stream). If the block size of memory written during the test is smaller than the cache, the test will partially evaluate cache performance rather than raw memory bandwidth. Every workload has important configuration requirements; make sure you’re doing due diligence and being intentional with data capture.

Decouple data acquisition and analysis.

Design the test strategically, but be sure the data can be leveraged beyond the initial capture. In other words, keep the raw data. Capturing and maintaining raw data will support a richer level of analysis after the fact.

In one workload, we discovered that things ran twice as fast on Ubuntu as on CentOS. We had the complete kernel configuration and software settings available as raw data and developed a series of studies by diffing the configurations of each OS distribution that were pulled from the systems at test time. We then took those studies to bare-metal systems to verify our differing hypothesis. By shifting and automating this analysis to raw data off-system, we saved days of engineering effort and system time. Multiply those savings across hundreds of workloads, and you impact your schedule in a very positive way.

Record and control system configurations.

It’s no secret that the hardware configuration is extremely important when measuring performance. A performance result collected on a laptop will be much different from one collected on a top-end server platform. There are even differences among deployments of the same CPU and platform. The memory configuration and speed will impact memory performance, storage technology will impact I/O performance, and many tuning factors will impact compute performance. It’s all relative to your deployed and tested configuration paired with the workload that you’re measuring.

Only regress and compare against “like” systems. What that means will differ from test program to test program, but I would go so far as to require the same exact model/SKU/version of each major hardware component. We typically will not compare regression results between two systems in which any of the major hardware components differ in these key areas.

Create a standardized test format and language.

Of course, you can’t debug an issue if you can’t reproduce the result. This includes not just the test tool and system configuration but also the method of returning and interpreting the results. Methods for parsing results can vary wildly among test libraries. Communicating workload flags can become a game of telephone in which bits of information are lost as the message passes from engineer to engineer.

It’s common to see something like this communicated between engineers doing performance debug:

fio –filename=devicename –direct=1 –rw=randread –bs=4k –ioengine=libaio –iodepth=256 –runtime=120 –numjobs=4 –time_based –group_reporting –name=iops-test-job –eta-newline=1 –readonly

This command line is specific and will generate reproducible results. It’s cryptic, however, and doesn’t lend itself to database storage or easy comparison.

We devised a solution by creating an extension to the open-source Phoronix Test Suite called Phoronix Test Extensions. These are clearly enumerated and identified Phoronix-compatible tests that never change, can easily be communicated, can be stored in a database, and present output in a standardized format for easy and uniform processing. This type of approach streamlines the process and dramatically improves the quality and reliability of results.

For example, the above FIO command line might be packaged in a Phoronix-compatible test, called ptx-io-fio-randread-4k-libaio-iod256-000001, that gets codified in a source code repository from which it can be referenced and run. Because the test is fully compatible with the Phoronix test runner, it can be run anywhere Phoronix runs, making it extremely portable and flexible. It also outputs a standard composite.xml results format, as defined in the Phoronix test runner — making the results of any test in the library uniform and parsable.

Don’t miss mild/moderate performance changes.

Another trap that can be overlooked when dealing with continuous performance-regression activities is the reality that people are often working on performance improvements. This is especially true in silicon development, where performance is one of the highest priorities. This means that a baseline for performance regression needs to shift as work is done in the software stack or ecosystem.

Imagine that you collect a load of data for a workload and have a high level of confidence in a baseline result. Over the course of the year, your teams push performance gradually higher. This is objectively good news, but the potential pitfall is that it creates a gap that can hide regressions that remain above the baseline (see Figure 4).

Measured vs Expected Performance
Measured vs Expected Performance 2
Figure 4: Plots of measured versus expected performance demonstrate how a significant regression (15% bottom) that remains above a baseline could be hidden.

The red line in the figure signifies the performance-regression baseline (failure criteria) set at the beginning of the project. The top chart shows a significant incremental performance improvement throughout the first year of development (into December). The bottom chart shows a large performance regression in the new January. The baseline criteria will not flag this as a regression, however, because the criteria do not account for the incremental performance improvements over the year of development.

Manually adjusting performance baseline criteria would be costly and error-prone. Our in-house system automatically adjusts baselines based on every result collected. The more test results it collects, the smarter the system becomes.

Remember, the test process can affect system performance.

It’s an unfortunate reality of performance testing that the measurement process itself can affect the results. Capturing system data such as clock frequencies, active processes, and CPU utilization can eat up system resources, reducing workload performance in some (but not all) configurations. For the unwary, this can lead to time wasted chasing phantom regressions.

The solution is to abstract the hardware monitoring process from the performance measurement process. For example, you could do four test runs for each configuration. Use the first three datasets in the performance-regression analysis. The fourth run would measure hardware behavior. The results of the fourth run would be used strictly to provide system measurement information and would not be used in the regression analysis.

Develop effective, standardized reporting.

The best test infrastructure is useless if the results are not presented in an actionable way. Poor data science practices can easily misrepresent performance or obscure patterns. Data plots with inconsistent and non-zero scales can prevent easy comparison. Showing single run changes without also showing variance can also be problematic or misleading. Some tests are hyper-consistent — a delta of 1% is huge. For others, ±2% would be a normal intra-run deviation. Data presentation must make those differences easy to detect in context.

The sheer volume of data produced by continuous performance-regression testing demands an easy format for visualizing results. We suggest a standardized performance-regression report that everyone consumes. This centralizes data science best practices and creates a consistent visual language that everyone can become familiar with.

Data that isn’t actionable isn’t worth looking at.


Continuous performance-regression testing is well-known among software developers, especially those in the web development domains. It can also be a powerful tool for hardware or lower-level software projects. Most of the modern development practices that software developers have embraced as mainstays are not widely practiced in hardware development.

Test results are only as good as the planning, procedures, and execution of the tests themselves. Applying the techniques I’ve described will enable you to remain alert to potential pitfalls.

Leave a comment