Thomas A. Alspaugh
Software Testing

(Under Construction)

There is a vast body of research and practical technique on software testing; this page sketches an overview.

Interface, Behavior, and Testing

Ideally, a piece of software would do everything that its interface requires and nothing it forbids, and the tests for the software would test that it does everything it should and nothing it shouldn't.

permitted / required

Figure 1. Everything required
is permitted

permitted / required

Figure 2. A different selection of
permitted behaviors is required

permitted / required

Figure 3. Everything permitted is
required

permitted / required

Figure 4. Inconsistent interface requiring some behaviors that are not permitted.

Even in our non-ideal world, an interface needs to not forbid anything it requires. Figure 1 illustrates an interface that permits everything it requires. In this and later figures, the green area represents all the behaviors that are permitted (= not forbidden), and the orange area represents all the behaviors that are required. Figure 1's interface is implementable in this sense.

The interface of Figure 1 gives the implementer some slack: there are green areas (permitted behaviors) that aren't required, so the implementer can choose to make the software do them or not as he/she pleases. Figure 2 shows a slightly different interface, also implementable (in this sense), with the same boundary between permitted and forbidden but a different selection of required behaviors. Here the developer has less slack than in Figure 1.

Figure 3 shows a completely strict interface. Everything that is permitted is required; there is no green area showing. The interface is still implementable (in this sense), in that there is no behavior that it both requires and forbids.

Figure 4 shows an inconsistent interface, one that requires behaviors that are outside the boundary of permitted behaviors. This interface is not implementable (in this sense or any other).

behavior

Figure 5. The situation we aim for

behavior

Figure 6. A module not meeting its interface

Now, how to effectively test a module against its interface? Figures 5 and 6 illustrate the two situations we hope to distinguish by testing. Figure 5 represents the module's behavior (magenta) in the set of all conceivable behaviors (light blue). The module does everything its interface requires, plus some extra things that aren't required but that are still permitted. The module does nothing that its interface does not permit. Everyone is happy; the users of the module are satisfied. It is an article of faith in the software development world that except perhaps for the very simplest software, this never happens — or if by chance it did happen, there is no way we could detect that it had. There are too many behaviors a module can enact, and too many things that could go wrong during development, so some always do. We typically get something like Figure 6, in which the module does most of what is required and only a little of what is not permitted.

Because there are so many possible things a module can do, usually an infinite number of sequences of behaviors, that it is impossible to test whether the module is like Figure 5 or like Figure 6. Instead we hope to use testing to identify problems and fix them — getting the situation closer to Figure 5 asymptotically — and verify that the most important things the module should do actually happen, and the most important things the module should never do don't happen in the most frequent situations.

behavior

Figure 7. Allocating the finite number of tests

Figure 7 suggests the choices and challenges of testing. Only a finite number of tests can be set up, performed, and interpreted — an infinite number would take forever. Thus the available number of tests must be allocated to the cases where they will do the most good. In the figure, five test cases have been allocated to verify required behavior (the Ts in the orange area), and five more to verify that specific forbidden behaviors do not occur (the Ts outside the green area).

One U has been placed in the green area but outside the orange area, illustrating that there is no benefit to testing some behaviors. The U (for Unnecessary) case can only tell us whether the behavior it tests occurs or doesn't occur. Since the green area outside the orange area is permitted behavior, there is no benefit to showing it did not occur (if that's how the test turned out); and since that area is not required behavior there is no benefit to showing it didn't occur (if that is how the case turned out). A test case in this area is not helpful either way.

Note that of the five tests of forbidden behavior, one (the one in the magenta area representing what the module does) has very valuably identified a problem, and the other four have confirmed that those four behaviors do not occur. There is no hope of testing that the module never does any of the (almost always) infinite number of forbidden behaviors. Similarly, of the five tests of required behaviors, one has very valuably identified a problem (which one?), and the other four have confirmed that those required behaviors are present. But there is no hope of testing that all the (typically infinite number of) required behaviors will occur in the (almost always infinite number of) contexts they are required in.

Fault-error-failure

The connection between a failed test and the problem(s) in the code that caused it are not necessarily simple or easy to detect. The concepts needed to think about this are:

fault
A thing that's wrong in the code is a fault. A fault is present whether the program is running or not.
error
An error is an undesired or incorrect program state. The error is embodied in the values of attributes and/or variables, and only exists while the program is running and indeed only in a particular situation then.
failure
A failure is a behavior or output of the program that is incorrect.

A program contains faults. When the program runs, a fault may cause an error. This error may eventually result in a failure.

Ordinarily, only failures are visible. Testing is geared towards identifying as many failures as possible. Once a failure has been found, the developer has to work backwards to try to identify the error(s) that resulted in the failure, and then the fault(s) that enabled the error to occur.

A little thought should show you that

Some faults are in code that is not reachable (due to mistakes in the developers' design of the program), even though if the program evolved so that the faulty code became reachable, they would cause errors. Other faults are in code that is reachable, but because the testing was insufficiently thorough and the users of the program don't exercise it in all ways, that code never runs.

Some errors are bumped back to correct states later in the program's execution before resulting in any failure. The code could be fortuitously set up so that this has to happen every time, or it may be (as above) that the particular test cases and patterns of usage that the program actually undergoes happen to bump it back, so that no failure occurs and the error is never made visible.

There are wide range of opinions on how frequently these things happen. One well-argued estimate based on careful analysis of code bases states that on the average only about 10% of faults ever cause errors, and only about 10% of those errors ever cause failures. But other researchers' estimates range as high as 90%.

Terminology

There is no one standard definition of the terms and concepts of testing. The following terms are relatively standard:

test case
A test case is the smallest unit of testing. Each case consists of
  1. the context in which this test takes place,
  2. the inputs required for ir, and
  3. the expected results.
test script
A test script is a statement of the sequence of actions that set up the context for a test case and to perform the test for the case. In practice a single script is often written to set up and test several cases.
test suite
A test suite comprises a group of test scripts, and thus their test cases. All the tests for a system can be grouped into one suite, or organized into several suites.

Coverage

Given that it impossible to test all the behaviors of a system, how can we intelligently allocate the possible testing to the cases that will give the greatest benefit? There are a number of ways of considering this. Most work from a specific set of possibilities and strive to cover that set economically and effectively although incompletely, so in general the allocation of test cases and resources is called coverage.

Coverage of user priorities

An obvious approach is to list the behaviors that are most important to the users, either the actual users or a hypothetical population of users with specified characteristics. First one identifies the behaviors with the highest priority; then one sets up test cases for those behaviors. If there is not enough time or resources to do all the tests, the higher priority behaviors are tested and the lower priority ones are not.

Coverage of requirements

A second approach is to cover the requirements to some degree. The criteria of coverage vary depending on the form in which the requirements are stated. If the requirements are a set of goals or properties, then an appropriate degree of coverage might be at least one test case per goal or requirement; if the requirements are in the form of a model, then coverage might be calculated in terms of the nodes and edges of the model that are covered by at least one test case.

Coverage of Code

A given degree of coverage of code is achieved by a test suite based on how much of the tested program is exercised by the suite. Test coverage of code is appealing for a number of reasons. Since it is the code that is being tested, applying test cases that cover the code in some sense is compelling (although not exclusively so). Because the theory of programming languages is relatively advanced, it is possible to formally state many coverage criteria and reason about them formally. Since software development is now typically done using IDEs such as Eclipse, there is the possibility of integrating the measurement of code coverage into the IDE with which the code is developed. Finally, some of the useful code coverage criteria are fairly intuitive for developers (although not all of them are).

It may not be immediately obvious that there could be more than one criterion for code coverage; one might think either it's covered or it isn't. But since the strongest criterion (All-Paths) typically involves an infinite number of test cases, researchers have examined many weaker but more easily achieved criteria. These criteria have been proved to be related in a subsumption hierarchy, shown in Figure 8 [Clarke et al. 1989], in which the various criteria form a partial order.

The partial ordering means that some criteria are stronger than or equivalent to others, but not all are related in this way. Each criterion that is connected by downward lines to another criterion D lower in the figure is stronger than or equivalent to D, or subsumes D. But there are pairs of criteria C and D that are incomparable, that is, C is neither stronger than D, weaker than D, nor equivalent to D. For example, All-Uses subsumes All-Defs, because there is a path from All-Uses through All-C-Uses/Some-P-Uses down to All-Defs (there is a second path as well). But All-Defs and All-Edges are not comparable (even though All-Edges is lower in the diagram than All-Defs) because there is no downward path from All-Defs. Thus it is not the case that All-Defs subsumes All-Edges, nor that All-Edges subsumes All-Defs; they are incomparable. See Ordered Sets for more explanation and examples of partial orders and incomparability.

Clarke et al. Fig 10

Figure 8. Clarke et al. Fig. 10

Coverage subsumption

Figure 9. Subsumption
of some simple criteria

In terms of code coverage, a test suite that achieves a specific degree of code coverage is considered to be more complete (and thus better) than another suite that does not achieve that degree of coverage. The more-complete suite will test all parts of the program that the less-complete one, plus possibly some more in addition.

Some of the more intuitive criteria are shown in Figure 9 and discussed below.

All-Paths

All-Paths coverage is achieved when all possible paths through the code have been covered. All-Paths coverage is the strongest coverage criterion.

All-Paths is impractical for all but simple programs, since it requires coverage of all possible numbers of iterations of each loop (can be unbounded) and all possible combinations of branches (can be 2n for n branches).

All-Nodes

All-Nodes coverage is achieved when all reachable nodes of the program's control flow graph have been covered, in other words when every reachable statement of the program has been exercised. All-Nodes is the weakest coverage criterion; there is no standard coverage criterion that is subsumed by All-Nodes.

All-Edges

All-Edges coverage is achieved when all edges of the program's control flow graph have been covered, in other words when every single-step way to get from one statement of the program to another has been exercised. As seen in Figure 9, All-Edges is subsumed by All-Uses but is incomparable with All-Defs.

All-Uses, All-Defs, and All-DU-Paths

All-Uses, All-Defs, and All-DU-Paths coverage are defined in terms of data flow analysis, in which a def of variable x is a statement that assigns (or may assign) a value to x, and a use of x is a statement that uses (or may use) a value of x. Each def of x corresponds to a (possibly empty) set of possible uses of x in which the value used is the one assigned by that def.

All-Defs coverage is achieved when all defs of any variable have been covered. Defs whose sets of uses are empty are typically not counted for this criterion.

All-Uses coverage is more complex. All-Uses coverage is achieved when a path from each def to each use of that def has been exercised. If a use is in the condition of a conditional or loop, then in addition All-Uses coverage requires paths for which both values of the condition are covered. A little thought will show that All-Uses subsumes All-Defs: covering all paths from each def to one of its uses will necessarily cover all defs that have uses.

All-DU-Paths coverage is similar to All-Uses, but instead of at least one path from each def to each use of that def, includes all paths from each def to each use of that def.

Rule-of-thumb coverage

Practitioners have accumulated a number of informal rules of thumb for the minimum degree of coverage for a test suite.

  1. At least one test case for every variable value that is treated uniquely.  For example, null is often treated differently than all other values, so each such variable should have a test case that results in null as its value.
  2. At least one test case for every subrange of a variable that is treated differently than other subranges.  For example, positive numbers might be treated differently than negative ones, indicating at least one test case resulting in a positive value for the variable and at least one with a negative value.
  3. At least one test case for every boundary in a variable's range.  For example, when testing code that involves a 0-based string index, there should be a test case for index 0, the lower boundary between valid indexes and invalid ones, and a test case for index (length-1), the upper boundary. Good practice would also indicate at least one value nearby on each side, such as -1 or -2 and 1 or 2 for the boundary at 0.

Organization of Testing

Experience has shown that testing should be done from the bottom up, so that the smaller units of a system are tested before any larger units that contain them, and so on up recursively. Otherwise it is too difficult to identify and locate the problems that cause test failures.

In terms of the usual units of software development, this means that each class should be tested, and pass all its tests, before the module or unit of work (consisting of several classes) that contains it; each module before the subsystem (consisting of several modules) that contains it; and each subsystem before the entire system. (If the system is big enough to have several levels of subsystems, each smaller subsystem is tested before the subsystem that contains it, for as many levels as exist.) 

Figure 10 illustrates this organization of the testing process. This is often called the V model or V diagram, for obvious reasons. It illustrates how each level of testing, from unit tests up to system tests, tests the piece of code in question against its specification, accomplishing the verification of the code against the specification. At the top of the diagram are the stakeholder needs and the deployed system, running in the social and organizational context in which it is used; in this context the analogue of testing is validation, the process of identifying the extent to which the deployed system meets the stakeholder needs. Validation tests, in a sense, that the system specification is appropriate for the goals of the stakeholders. However the validation can't be done by running tests, since the measure of validity is the degree to which the stakeholders are satisfied.

v-model

Figure 10. The V model, relating specifications, implementations, and tests

References

L. A. Clarke, A. Podgurski, D. J. Richardson, and S. J. Zeil. A formal evaluation of data flow path selection criteria. IEEE Transactions on Software Engineering, 15(11):1318–1332, 1989.
Abstract
doi:10.1109/32.41326

flip bgunflip