Testing the test: How to use coverage metrics for more effective embedded software testing
November 13, 2015
Timing is everything in software development. Just think about the timing of when a software defect is found. Found after a release, it can be a disas...
Timing is everything in software development. Just think about the timing of when a software defect is found. Found after a release, it can be a disaster, compromising people’s safety and costing millions. Found before a release, the defect can be reduced to merely an annoyance.
This is why software testing has become an integral part of the development life cycle. Indeed, according to a 2012 survey by the Fraunhofter Esk Institute, not only is testing an important part of the software development process, for most embedded systems developers it is the most difficult part as well.
Testing can be difficult for many reasons, but one of the most fundamental challenges is measuring progress. Failure to track progress during testing with reliable metrics can waste time and result in a lack of software quality. It is tempting to try to ignore metrics and aim for testing everything, yet this approach is dangerous since software testing has the potential to become an endless undertaking. Glennford Myers demonstrated this in 1976 in his book “Software Reliability: Principles and Practices,” when he showed that a 100-line program could have as many as 1018 unique paths. In the modern software development world, software can grow as large as millions of lines of code. This makes completely exhaustive testing infeasible.
In addition, it is often not just the software development team that needs to be satisfied with the level of testing. Customers may ask for evidence that the code was properly tested, and regulatory authorities for safety-critical industries, such as avionics, automotive, and medical fields, will want proof that there was sufficient checking for defects. Therefore, it is necessary to define a way of measuring “sufficient” testing, and it needs to be done in a way that can be measured objectively to satisfy all stakeholders in the development process.
For effective software testing, developers need to address how to measure the testing process, how to determine how much testing is enough, and how development teams can most strategically ensure that the software application they developed has been adequately tested.
What is code coverage?
Structural code coverage analysis is a way of looking at what parts of the logical structure of a program have been exercised, or “covered,” during test execution. The logical structure depends on the code coverage metric being used. For example, “Entry Point” coverage looks at which function calls or “entry points” have been exercised in a test. Likewise, “Dynamic Dataflow” coverage looks at what parts of the data flow have been exercised. While different structural coverage metrics examine code from different angles, they all share a common purpose to give meaningful insight into the testing process by showing how much of the code is tested and which parts of the code have been exercised (Figure 1).
[Figure 1 | LDRA's TBvision code coverage results are displayed inline with system/file/function name to give a detailed overview of which aspects of the system meet the expected code coverage levels or metrics.]
Specialized structural coverage metrics can serve special use-cases for testing, such as analyzing data and control coupling. However, for measuring general test effectiveness, three code coverage metrics have found wide industry usage:
- Statement coverage (SC) – How many statements of a program have been exercised
- Decision coverage (DC) – How many branches of a decision have been exercised; this is actually a superset of statement coverage, since for all branches of all decisions to be covered, all statements must also be covered
- Modified condition/decision coverage (MC/DC) – This builds on decision coverage by making sure each of the subconditions of a complex decision is independently exercised in both its true and false states
These metrics have been widely recognized as ways to measure the thoroughness of testing. In particular, industries such as automotive, avionic, and industrial software have embraced these metrics in their software safety standards.
Higher criticality requires more thorough testing
Notably, these software safety standards do not mandate using statement, decision, and MC/DC coverage uniformly on all projects. Instead, each of the major industry software safety standards recommends using different levels of structural coverage depending on how critical the code is, although the level of criticality is often determined in an industry-specific way. For instance, DO-178C, the software safety standard for the avionics industry, uses the concept of software safety levels and mandates different levels of structural coverage analysis for each.
IEC 61508, a general industrial software safety standard, defines safety integrity levels (SIL) and recommends different structural coverage metrics based on each level.
In all of these standards, a common philosophy can be seen: the “safer” the code must be, the greater the thoroughness of required testing. The exact definition of what software safety means depends on the concerns, experiences, and regulatory pressures for the particular industry, but this general principle of matching higher levels of safety required with greater levels of structural coverage required remains constant across the standards.
Testing should rise from requirements
Another commonality across industries in software safety standards is the belief that tests should emerge from requirements. Software requirements should determine the desired inputs and outputs of a test. If they don’t, the tests can become a parallel set of requirements and this leads to confusion and software errors. Structural coverage cannot replace requirements as the basis of testing, since coverage metrics cannot dictate how code should behave – only that it be reachable during execution (and, given the abilities of debuggers, reachable during execution can be a flexible concept).
Although complementary, testing the effectiveness of executing code and testing the completeness of requirements are two different things. Test effectiveness, as measured in structural coverage analysis, looks at what parts of the code are exercised. Test completeness, which is sometimes called “requirements coverage,” looks at whether or not the code has been tested for proper behavior for all requirements. If a software program is built according to its requirements, and if it contains no code unrelated to its requirements, then complete testing of the requirements should cause the tests to effectively exercise all of the code. If there is code not exercised by the tests, this may be code that can be eliminated, or it may be a missing requirement, or it may be a flaw in the test. In this way, structural coverage analysis can provide feedback into the test design, software implementation, and requirements specification processes.
This relationship between exercising code and testing requirements also exists on the level of the individual requirement. While from an evidence-gathering perspective the high-level totals of how many requirements and how much code has been tested is more interesting, it is more often at the individual requirement testing level, and the structural coverage analysis of that individual requirement testing, where the most defects are identified and fixed.
Structural coverage analysis is often thought of as simply a target of achieving 100 percent of a metric, but it is essential to examine individual tests and the structural coverage resulting from them. This is especially true when the code being exercised is based on the requirement being tested. By examining the structural coverage of the code, it is possible to determine the exact behavior of the code under test and compare it to the expected behavior based on the requirement being tested. This approach reduces false negatives due to environmental factors or other parts of the code compensating for the incorrect code. In addition, if there is an incorrect behavior, structural coverage analysis often provides insight into the cause of the incorrect behavior as well.
When using structural coverage analysis to understand code behavior in this detailed manner, it is vital to be able to overlay the structural coverage analysis results on top of the analysis of the structure of the code. This overlay helps transform raw structural coverage information into a meaningful understanding of what is going on in the code (Figure 2).
[Figure 2 | LDRA's TBvision gives an interactive flowgraph view of the individual procedures, so developers can focus on which procedures provide coverage and identify aspects of the code that may need further testing.]
Set coverage goals at unit and systems levels
Often structural coverage analysis goals might be set at both the unit and system level. Unit-level structural coverage is achieved through tests at the unit level based on requirements for that unit. On the other hand, system-level coverage goals will often start with coverage from tests on higher-level requirements. Yet if only high-level tests are used for the system-level coverage analysis, there are frequently holes in the coverage. The causes of these holes can vary. In some cases, holes in the coverage may be due to defensive programming practices required by a coding standard, but these coverage holes can be based on important functionality implemented from requirements as well.
In particular, structural coverage holes may appear when the code is based on requirements that can only be tested through conditions that are difficult or impossible to create on a high level. An example of this type of scenario is a function-level check for file-system failure. While inducing file-system failure in general may be possible, it can be highly challenging to time the file-system failure so that it occurs during that function’s execution. Moreover, doing this in a repeatable way for future regression testing can be even more difficult. In situations like this, using a lower-level test that examines the code in isolation may be necessary. For this reason, structural coverage measured from higher-level tests is usually combined with structural coverage from lower-level tests when gathering metrics for achieving testing goals.
Metrics such as statement, decision, or MC/DC coverage do not guarantee that software is defect-free. As mentioned before, truly exhaustive testing can be impossible or at least infeasible. Structural coverage metrics can, however, provide a greater sense of the reliability of code and greater confidence in testing.
Since structural coverage analysis gives insight into testing activities by showing how much of the code is tested and which parts of the code have been exercised, it can be performed at the system, module, or unit level, and can be accumulated toward a testing goal. Code coverage should not be treated in isolation from requirements-based testing. Furthermore, there may be tests that need to be performed beyond structural coverage analysis. For instance, testing race conditions and integer-limit edge conditions can be valuable for detecting defects, but they may not contribute to your structural coverage goals. Structural coverage analysis is designed to gauge the testing you have done and to guide your test planning, but it should not be taken as a goal unto itself.
Accumulating structural coverage without understanding the tests can provide a false sense of security that can be more dangerous than inadequate testing. Structural coverage analysis is not a magic bullet, but a tool that needs to be used with intelligence and care. However, it is a tool that, when properly used, can make tests more useful and more effective and provide evidence of the testing process.