Improving test case quality with mutation testing

Article By : Frank Büchner

This article shows how mutation testing can be applied to improve the quality of your test cases; problems that are raised by mutation testing and how they can be overcome.

Mutation testing rates the quality of your test cases. It re-executes test cases that have already passed, but on a changed test object, and reveals if the test cases detect the changes in the test object. Standards for the development of safety-critical systems, e.g. IEC 61508, recommend mutation testing. In praxis, automation of both test execution and generation of mutations is inevitable. Automated mutation testing is the most important new functionality in the new main version V4.3 of TESSY, the tool for the automated unit, module and integration test of embedded software. This article shows how mutation testing can be applied to improve the quality of your test cases; problems that are raised by mutation testing and how they can be overcome.

Mutation Testing Fundamentals

Mutation testing repeats the execution of test cases that have already been passed for a test object, for example a software unit. However, when the test cases are repeated, they are not carried out with the original source code of the test object, but with code that has been changed (“mutated”). The mutated code is different from the original code; the changes can concern minor details, such as replacing a logical AND with a logical OR; however, the changes can also be drastic, such as removing the else branch of an if-instruction. Of course, the test object must remain compilable even after the change, otherwise a test repetition is not possible.

When the test is repeated with the mutated source code, the question is whether the existing test cases reveal the mutation (the technical term is “kill”). A mutation is killed if at least one test case fails when the test is repeated. If this does not happen, the test cases do not detect that the source code has been changed, or in other words: the test cases also consider a test object other than the original one as correct. This is worrying and needs to be investigated further. For this investigation, it is helpful if only a single mutation has been made.

The quality of the test cases is lacking if the surviving mutation is not an equivalent mutation. An equivalent mutation does not change the behavior of the test object to the outside and thus cannot be killed. An example for an equivalent mutation is given further below. Of course, it matters how radical the mutation was. One subtle mutation will be more difficult to detect than several drastic changes. Usually several test runs with different mutations are carried out. This assesses the test case quality.

Figure 1 shows the mutation test process as it is carried out automatically by TESSY. After the original source code passes all existing tests, the actual mutation test process can be started. TESSY performs exactly one mutation and repeats all existing tests, of course recording whether a mutation was killed or not. Then the original test object is restored and another mutation is carried out.


Figure 1: TESSY automates the entire mutation test process. (Source: Hitex)

Figure 2 shows the mutations that TESSY can carry out since version V4.3. The user can select the mutations to be applied and thereby of course also influence the number of mutations carried out, which in turn affects the execution time of the entire mutation test process.


Figure 2: The default mutations carried out by TESSY V4.3. (Source: Hitex)

Two Assumptions for Mutation Testing

The mutations performed by TESSY by default are subtle, for example the relational operator ‘<‘ becomes ‘<=’. This is based on the assumption of the “competent programmer” [Liggesmeyer], which says that a skilled software developer only makes minor mistakes – for example using a loop that is run through once too many or once too little (“off-by-one” error). In order to find out whether the test cases find such minor errors and are therefore of good quality, the mutations have to be subtle. Radical mutations, such as the removal of one or even several instructions, should also be revealed by low-quality test cases. Another empirically confirmed assumption [Offut] states that there is a coupling effect: a test case that kills exactly one mutant also kills multiple mutations. It is therefore sufficient to only carry out one mutation at a time.

Example

We consider a test object that has passed four test cases (Figure 3) and achieved 100% code coverage with these test cases.


Figure 3: The test data of the four test cases that pass applied to the original test object. (Source: Hitex)

If TESSY carries out the mutation test (with the standard settings for the mutations as in Figure 2), the result is a killed mutant and a surviving mutant (Figure 4).


Figure 4: Outcome of the mutation test in TESSY. (Source: Hitex)

In the upper left part of the figure above (Figure 4), the killed mutation (“mutation caused test failure”) is shown. This mutation changed the relational operator in the first if-instruction of the test object (highlighted on the top right side of Figure 4) from ‘<‘ to ‘<=’. As a result, a test case fails, which is positive in mutation testing. Therefore, this mutation is marked with a green tick.

In the bottom left part of Figure 4 you can see the surviving mutation (“mutation survived all test cases”); this mutation changed the relational operator in the second if-instruction of the test object (highlighted on the bottom right side of Figure 4) from ‘>’ to ‘> =’. No test case detected this change by failing. This is questionable and needs to be investigated.

Mutation Score and Test Case Quality

The mutation score is the ratio of killed mutations to all mutations.


Figure 5: The mutation score in TESSY. (Source: Hitex)

The figure above (Figure 5) shows the mutation score that TESSY determined for the four test cases (refer again to the test data for the four test cases, numbered 1.1 through 4.1, shown earlier in Figure 3).

Test case 2 killed one of the two mutants, because test case 2 failed due to the mutation in the first if-instruction (v1 < r1.range_start) from ‘<‘ to ‘<=’. This results in a mutation score of 50%. Test case 2 is marked with a green check mark in column M because it killed a mutant. The other three test cases did not kill any mutant and therefore have a red cross or a mutation score of 0%.

Test case 2 killed a mutant and is therefore of higher quality than the other test cases that did not kill any mutant. This is due to the value for the variable v1 in test case 2. This depends on the relational operator in the first if-instruction. In the second test case, both the variable v1 and the variable r1.range_start have the value 5, thus the decision in the first if-instruction is ‘5 < 5‘, which evaluates to “false “. In the mutation, the decision is ‘5 <= 5’, which evaluates to “true”. Therefore, the second test case delivers an unexpected result (“no” instead of the correct and expected “yes”), which kills the mutant.

Test case 4 should kill the other mutation (from ‘>’ to ‘>=’) in the decision of the second if-instruction (v1 > r1.range_start + r1.range_len). But that doesn’t work because the value for v1 in test case 4 is inappropriate. The variable v1 has the value 9 and r1.range_start + r1.range_len results in 5 + 2 = 7. Thus, the decision in the second if-statement is in the original ‘9 > 7’ and in the mutant ‘9 >= 7’, which both evaluates to “true”. Thus, the original and the mutant give the correct result “no” in both cases; both original and mutant pass the fourth test case; this means the mutant is not killed.

Test case 2 has better quality than test case 4 because test case 2 uses a boundary value and test case 4 does not. Test case 2 uses the boundary value 5 which is the starting value of the range that starts at 5 and has the length 2. With the value 9 for the variable v1, test case 4 does not use a boundary value of the range.

This demonstrates why boundary values form good test data and why standards for the development of safety-critical software recommend boundary values as test data. For example, IEC 61508 [61508] recommends the “boundary value analysis” method in Tables B.2 and B.3 of Part 3. In both tables, this method is recommended for SIL 1 and highly recommended for SIL 2 to 4. ISO 26262 [26262] also mentions as method 1c in Table 8 of Part 6 “analysis of boundary values” as a procedure for obtaining test data for software unit testing. The method is recommended for ASIL A and highly recommended for ASIL B to D.

Mutation testing can also assess test case sets. A set of test cases is called adequate if it kills all mutants. The smaller an adequate test case set, the better. It can also be used to assess test case construction methods.

Endless Loops and Crashes

Mutations can also result in endless loops; this means that a test does not come to an end. To ensure that such a mutation does not bring the entire process to a standstill, TESSY monitors the execution time. If the execution time of a mutation exceeds the execution time without a mutation by a factor of ten times, TESSY aborts the test execution. Endless loop or timeout kills the mutant. A mutation can also cause the test object to crash, for instance due to a division by zero. A crash of the mutated test object also kills the mutant. After a timeout or a crash, the mutation test process is continued, if more mutations are applicable.


Figure 6: An endless loop kills the mutant. (Source: Hitex)

In the example above (Figure 6) the count() function is tested with a test case that has the input value 10 for the parameter x and delivers the correct result by returning the value 1. This test case kills all the four mutations depicted on the left-hand-side of Figure 6. The third mutation (from ‘<=‘ to ‘>=‘) is not killed by failure of the test case, as usual, but by an endless loop and the timeout triggered by it. TESSY rates this mutation as killed. After that, the fourth mutation is carried out.


Figure 7: A crash kills the mutant. (Source: Hitex)

In the example above (Figure 7) the crash_by_divide() function is tested with a test case in which both parameters a and b have the same value. This test case kills the mutation of ‘!=’ to ‘==’ in the decision of the if-instruction.

Equivalent Mutations Are Problematic

The main problem with mutation testing is equivalent mutations. These mutations do not change the external behavior of the test object.


Figure 8: Example for an equivalent mutation. (Source: Hitex)

In the figure above (Figure 8) an equivalent mutation is shown. The mutation of the relational comparison operator from ‘>’ to ‘>=’ does not have an externally visible effect and therefore cannot be killed by any test case. But the input value 0 definitely causes a different internal program behavior of original and mutated source code. In the original code executes the else-branch of the if-instruction, the mutation executes the then-branch.

Because equivalent mutations cannot be killed by a test case, all surviving mutations must be checked manually (by a human) to determine whether or not it is an equivalent mutation. That can be time consuming. However, it is helpful here if, as with TESSY, only one mutation is made at a time. Furthermore, the test objects at hand are software units, which are small in comparison to the entire software. This reduces the effort of checking for equivalent mutations. And on top of it we can assume, that safety-critical software, which undergoes unit testing, has better test cases than other software, because this software is required to achieve a high percentage of code coverage. This means, that only a small part (if any) of safety-critical software is not executed by any test case. On the other hand, software that is not tested so thoroughly like safety-critical software might have huge parts of the code which are not executed by any test case. It is obvious that a mutation in a part of the software which is not executed by any test case cannot be killed. This means a higher number of surviving mutants and consequently higher effort to decide between insufficient test cases and equivalent mutants.

Equivalent mutations can be viewed as killed mutations; they do not indicate low-quality test cases.

Avoid Unnecessary Mutations

During unit testing of safety-critical software, a surviving mutation (which is not an equivalent mutation) should lead to changed or additional test cases. Due to the required high integrity of the software, the final goal is that all applied mutations are killed (again, excluding equivalent mutations). This might not be the goal for integration testing. The main objective for integration testing is to test the correct interaction of the units. Thus, test cases for integration testing check the interaction of the units and not if each single unit reacts correctly to each error condition (e.g. an unexpected NULL pointer) that might be possible.

Provoking error conditions during integration testing might be difficult technically and therefore might be neglected. This is supported because one can assume that the reaction to error conditions was tested during unit testing. Therefore, reaching 100% code coverage is not paramount during integration testing, especially parts of the code which represent defensive programming (e.g. the reaction to an unexpected NULL pointer) might stay uncovered. It is obvious that mutations in code that is not executed by any test case cannot be killed. Applying such mutations causes human effort for the manual investigation of such mutations, because it is not obvious if this mutation has survived because it is an equivalent mutation or because of low quality test cases. Furthermore, applying such mutations increases the execution time for mutation testing.

If code coverage information is available in TESSY for mutation testing, TESSY avoids mutations in parts of the code that are not executed by any test case. This feature is especially useful during integration testing, because of potentially large parts of uncovered code that are not and will never be executed by any test case. Although less useful, TESSY suppresses mutations in uncovered code also during unit testing.


Figure 9: Two mutations are suppressed, because they cannot be killed. (Source: Hitex)

In the figure above (Figure 9) the functions push() and pop() of the abstract data type “stack” are tested integrated. On the right-hand side of Figure 9 the source code of push() is displayed. The first if-instruction in line 15 checks if the stack pointer (the variable next_free_element) has reached the top of the stack, indicating a stack overflow. The then-part of the first if-instruction is shaded in red, indicating it was not executed by any test case. In consequence, a mutation in the decision in the second if-instruction (in line 17) is undetectable and will survive.

Leveraging code coverage information, TESSY suppresses two mutations of the relational operator ‘>’ in the decision of the second if-instruction (error_report_level > 0), shaded in grey on the right-hand side of Figure 9. On the left-hand side of Figure 9, the same decision is shaded in grey and below of it, the two possible mutations are displayed (from ‘>’ to ‘<’ and from ‘>’ to ‘>=’). Both mutations were not applied. This is indicated by the dash (‘-‘) in the column “Result”.

If TESSY performs mutation testing without prior code coverage measurement, TESSY applies the two mutations for the decision of the second if-instruction. Of course, none of them are killed. Opposed to unit testing, it might not be necessary for integration testing to add a test case to check if the error condition (stack overflow in our case) is handled correctly. By avoiding these mutations, TESSY saves a lot of time, both for humans and for calculations by computer.

Mutation Testing in Standards

IEC 61508 describes mutation testing as “test case execution from error seeding” and recommends this for Safety Integrity Levels (SIL) 2 to 4 (in Table B.2 of Part 3). IEC 61508 also states (in Section C.5.6 of Part 7) that one can estimate the total number of errors from the number of errors that a test suite discovers in an original test object and the number of mutations that this test suite kills (predictive). The ratio of killed mutants to the total number of mutants is equal to the ratio of the errors found in the original test object to the total number of errors in the original test object. This estimation naturally assumes the same statistical distribution of the types and positions of the mutations and the actual errors; for example, if the actual errors are erroneous calculations but no arithmetic mutations are used, the estimate will hardly be accurate.

ISO 26262 only mentions “code mutations” in a note to the “Fault Injection Test” method (method 1l) in Table 7 of Part 6, which lists methods for software unit verification.

Conclusion

Mutation testing can reveal insufficient test cases. Improving them increases the chance of finding errors in the tested software. Therefore, mutation testing does not only assess the quality of the test cases, but can also contribute to a better quality of the tested software. The execution of the mutation test is automated in TESSY, so that the execution does not entail any greater effort.

However, even without TESSY, everyone who disposes of a testing project can manually perform some mutations and re-execute the tests and see if the test cases kill the mutations.

Terminology

Error seeding

This is how IEC 61508 calls mutation testing (refer to Section C.5.6 of Part 7)

Coupling effect

If a mutant with a single mutation is discovered by a set of test cases, multiple mutations are also discovered.

Strong mutation test

The mutant is only considered from the outside (black box) and the mutant is only discovered through a test case, which externally delivers a different result than the original.

Weak mutation test

A test case ensures a different behavior inside the mutant than in the original. However, this other behavior is not externally visible.

Adequate test case / adequate test case set

A test case is called adequate if it kills a non-equivalent mutant. A set of test cases is called adequate if it kills all non-equivalent mutants.

Mutation score

The ratio of mutants killed to the number of mutants. Usually given in percent.

Fault injection

External errors are injected into the non-mutated test object in order to test the robustness. ISO 26262 mentions “code mutations” as example for fault injection test.

References

[26262] ISO 26262, International Standard, Road vehicles – Functional Safety, second edition, 2018
[61508] IEC 61508, Functional safety of electrical /electronical/programmable electronic safety related systems, second edition, 2010.
[Liggesmeyer] Liggesmeyer, Peter: Software-Qualität: Testen, Analysieren und Verifizieren von Software. 2. Auflage, Heidelberg, Berlin, 2009. Spektrum Akademischer Verlag.
[Hoffmann] Dirk W. Hoffmann, Software-Qualität. Springer-Verlag Berlin Heidelberg, 2008.
[Offut] A. Jefferson Offut, Clemson University: Investigations of the software testing coupling effect, in ACM Transactions on Software Engineering and Methodology, New York, Volume 1 Issue 1, Jan. 1992.

This article was originally published on Embedded.

Frank Büchner studied Computer Science at the Technical University of Karlsruhe, today the Karlsruhe Institute of Technology (KIT). Since graduating, he has spent more than thirty years working in different positions in the area of embedded systems. Over the years, he specialised in testing and software quality of embedded software. Currently he works at Hitex GmbH, Karlsruhe, Germany, as Principal Engineer Software Quality.

Leave a comment