(Under Construction)
There is a vast body of research and practical technique on software testing; this page sketches an overview.
Ideally, a piece of software would do everything that its interface requires and nothing it forbids, and the tests for the software would test that it does everything it should and nothing it shouldn't.
Figure 1. Everything required
is permitted
Figure 2.
A different selection of
permitted behaviors
is required
Figure 3. Everything permitted is
required
Figure 4. Inconsistent interface requiring some behaviors that are not permitted.
Even in our non-ideal world, an interface needs to not forbid anything it requires. Figure 1 illustrates an interface that permits everything it requires. In this and later figures, the green area represents all the behaviors that are permitted (= not forbidden), and the orange area represents all the behaviors that are required. Figure 1's interface is implementable in this sense.
The interface of Figure 1
gives the implementer some slack:
there are green areas (permitted behaviors)
that aren't required,
so the implementer can choose to make the software do them
or not as he/she pleases.
Figure 2 shows a slightly different interface,
also implementable (in this sense),
with the same boundary between permitted and forbidden
but a different selection of required behaviors.
Here the developer has less slack
than in Figure 1.
Figure 3 shows a completely strict interface. Everything that is permitted is required; there is no green area showing. The interface is still implementable (in this sense), in that there is no behavior that it both requires and forbids.
Figure 4 shows an inconsistent interface, one that requires behaviors that are outside the boundary of permitted behaviors. This interface is not implementable (in this sense or any other).
Figure 5. The situation we aim for
Figure 6. A module not meeting its interface
Now,
how to effectively test a module
against its interface?
Figures 5 and 6 illustrate
the two situations we hope to distinguish by testing.
Figure 5 represents
the module's behavior (magenta)
in the set of all conceivable behaviors
(light blue).
The module does everything its interface requires,
plus some extra things that aren't required
but that are still permitted.
The module does nothing that its interface does not permit.
Everyone is happy;
the users of the module are satisfied.
It is an article of faith in the software development world that
except perhaps for the very simplest software,
this never happens —
or if by chance it did happen,
there is no way we could detect that it had.
There are too many behaviors a module can enact,
and too many things that could go wrong during development,
so some always do.
We typically get something like Figure 6,
in which the module does most
of what is required
and only a little
of what is not permitted.
Because there are so many possible things a module can do,
usually an
infinite number of sequences of behaviors,
that it is impossible to
test whether the module is like Figure 5 or like Figure 6.
Instead we hope to use testing
to identify problems and fix them —
getting the situation closer to Figure 5
asymptotically —
and verify that the most important
things the module should do
actually happen,
and the most important
things the module should never do
don't happen in the most frequent
situations.
Figure 7. Allocating the finite number of tests
Figure 7 suggests the choices and challenges of testing. Only a finite number of tests can be set up, performed, and interpreted — an infinite number would take forever. Thus the available number of tests must be allocated to the cases where they will do the most good. In the figure, five test cases have been allocated to verify required behavior (the Ts in the orange area), and five more to verify that specific forbidden behaviors do not occur (the Ts outside the green area).
One U has been placed in the green area but outside the orange area, illustrating that there is no benefit to testing some behaviors. The U (for Unnecessary) case can only tell us whether the behavior it tests occurs or doesn't occur. Since the green area outside the orange area is permitted behavior, there is no benefit to showing it did not occur (if that's how the test turned out); and since that area is not required behavior there is no benefit to showing it didn't occur (if that is how the case turned out). A test case in this area is not helpful either way.
Note that of the five tests of forbidden behavior, one (the one in the magenta area representing what the module does) has very valuably identified a problem, and the other four have confirmed that those four behaviors do not occur. There is no hope of testing that the module never does any of the (almost always) infinite number of forbidden behaviors. Similarly, of the five tests of required behaviors, one has very valuably identified a problem (which one?), and the other four have confirmed that those required behaviors are present. But there is no hope of testing that all the (typically infinite number of) required behaviors will occur in the (almost always infinite number of) contexts they are required in.
The connection between a failed test and the problem(s) in the code that caused it are not necessarily simple or easy to detect. The concepts needed to think about this are:
A program contains faults. When the program runs, a fault may cause an error. This error may eventually result in a failure.
Ordinarily, only failures are visible. Testing is geared towards identifying as many failures as possible. Once a failure has been found, the developer has to work backwards to try to identify the error(s) that resulted in the failure, and then the fault(s) that enabled the error to occur.
A little thought should show you that
Some faults are in code that is not reachable (due to mistakes in the developers' design of the program), even though if the program evolved so that the faulty code became reachable, they would cause errors. Other faults are in code that is reachable, but because the testing was insufficiently thorough and the users of the program don't exercise it in all ways, that code never runs.
Some errors are bumped
back to correct states
later in the program's execution
before resulting in any failure.
The code could be fortuitously set up
so that this has to happen every time,
or it may be (as above)
that the particular test cases
and patterns of usage
that the program actually undergoes
happen to bump it back,
so that no failure occurs and the error is never made visible.
There are wide range of opinions on how frequently these things happen. One well-argued estimate based on careful analysis of code bases states that on the average only about 10% of faults ever cause errors, and only about 10% of those errors ever cause failures. But other researchers' estimates range as high as 90%.
There is no one standard definition of the terms and concepts of testing. The following terms are relatively standard:
Given that it impossible to test all the behaviors of a system, how can we intelligently allocate the possible testing to the cases that will give the greatest benefit? There are a number of ways of considering this. Most work from a specific set of possibilities and strive to cover that set economically and effectively although incompletely, so in general the allocation of test cases and resources is called coverage.
An obvious approach is to list the behaviors that are most important to the users, either the actual users or a hypothetical population of users with specified characteristics. First one identifies the behaviors with the highest priority; then one sets up test cases for those behaviors. If there is not enough time or resources to do all the tests, the higher priority behaviors are tested and the lower priority ones are not.
A second approach is to cover the requirements to some degree. The criteria of coverage vary depending on the form in which the requirements are stated. If the requirements are a set of goals or properties, then an appropriate degree of coverage might be at least one test case per goal or requirement; if the requirements are in the form of a model, then coverage might be calculated in terms of the nodes and edges of the model that are covered by at least one test case.
A given degree of coverage of code is achieved by a test suite based on how much of the tested program is exercised by the suite. Test coverage of code is appealing for a number of reasons. Since it is the code that is being tested, applying test cases that cover the code in some sense is compelling (although not exclusively so). Because the theory of programming languages is relatively advanced, it is possible to formally state many coverage criteria and reason about them formally. Since software development is now typically done using IDEs such as Eclipse, there is the possibility of integrating the measurement of code coverage into the IDE with which the code is developed. Finally, some of the useful code coverage criteria are fairly intuitive for developers (although not all of them are).
It may not be immediately obvious that there could be more than one criterion for code coverage; one might think either it's covered or it isn't. But since the strongest criterion (All-Paths) typically involves an infinite number of test cases, researchers have examined many weaker but more easily achieved criteria. These criteria have been proved to be related in a subsumption hierarchy, shown in Figure 8 [Clarke et al. 1989], in which the various criteria form a partial order.
The partial ordering means that some criteria are stronger than or equivalent to others, but not all are related in this way. Each criterion that is connected by downward lines to another criterion D lower in the figure is stronger than or equivalent to D, or subsumes D. But there are pairs of criteria C and D that are incomparable, that is, C is neither stronger than D, weaker than D, nor equivalent to D. For example, All-Uses subsumes All-Defs, because there is a path from All-Uses through All-C-Uses/Some-P-Uses down to All-Defs (there is a second path as well). But All-Defs and All-Edges are not comparable (even though All-Edges is lower in the diagram than All-Defs) because there is no downward path from All-Defs. Thus it is not the case that All-Defs subsumes All-Edges, nor that All-Edges subsumes All-Defs; they are incomparable. See Ordered Sets for more explanation and examples of partial orders and incomparability.
Figure 8. Clarke et al. Fig. 10
Figure 9. Subsumption
of some simple criteria
In terms of code coverage, a test suite that achieves a specific degree of code coverage is considered to be more complete (and thus better) than another suite that does not achieve that degree of coverage. The more-complete suite will test all parts of the program that the less-complete one, plus possibly some more in addition.
Some of the more intuitive criteria are shown in Figure 9 and discussed below.
All-Paths coverage is achieved when all possible paths through the code have been covered. All-Paths coverage is the strongest coverage criterion.
All-Paths is impractical for all but simple programs, since it requires coverage of all possible numbers of iterations of each loop (can be unbounded) and all possible combinations of branches (can be 2n for n branches).
All-Nodes coverage is achieved when all reachable nodes of the program's control flow graph have been covered, in other words when every reachable statement of the program has been exercised. All-Nodes is the weakest coverage criterion; there is no standard coverage criterion that is subsumed by All-Nodes.
All-Edges coverage is achieved when all edges of the program's control flow graph have been covered, in other words when every single-step way to get from one statement of the program to another has been exercised. As seen in Figure 9, All-Edges is subsumed by All-Uses but is incomparable with All-Defs.
All-Uses, All-Defs, and All-DU-Paths coverage are defined in terms of data flow analysis, in which a def of variable x is a statement that assigns (or may assign) a value to x, and a use of x is a statement that uses (or may use) a value of x. Each def of x corresponds to a (possibly empty) set of possible uses of x in which the value used is the one assigned by that def.
All-Defs coverage is achieved when all defs of any variable have been covered. Defs whose sets of uses are empty are typically not counted for this criterion.
All-Uses coverage is more complex. All-Uses coverage is achieved when a path from each def to each use of that def has been exercised. If a use is in the condition of a conditional or loop, then in addition All-Uses coverage requires paths for which both values of the condition are covered. A little thought will show that All-Uses subsumes All-Defs: covering all paths from each def to one of its uses will necessarily cover all defs that have uses.
All-DU-Paths coverage is similar to All-Uses, but instead of at least one path from each def to each use of that def, includes all paths from each def to each use of that def.
Practitioners have accumulated a number of informal rules of thumb for the minimum degree of coverage for a test suite.
nearbyon each side, such as -1 or -2 and 1 or 2 for the boundary at 0.
Experience has shown
that testing should be done
from the bottom up
,
so that the smaller units of a system
are tested before any larger units that contain them,
and so on up recursively.
Otherwise it is too difficult
to identify and locate
the problems that cause test failures.
In terms of the usual units of software development, this means that each class should be tested, and pass all its tests, before the module or unit of work (consisting of several classes) that contains it; each module before the subsystem (consisting of several modules) that contains it; and each subsystem before the entire system. (If the system is big enough to have several levels of subsystems, each smaller subsystem is tested before the subsystem that contains it, for as many levels as exist.)
Figure 10 illustrates this organization of the testing process.
This is often called the V model
or V diagram
,
for obvious reasons.
It illustrates how each level of testing,
from unit tests up to system tests,
tests the piece of code in question
against its specification,
accomplishing the verification
of the code against the specification.
At the top of the diagram
are the stakeholder needs
and the deployed system,
running in the social and organizational context
in which it is used;
in this context the analogue of testing
is validation,
the process of
identifying the extent to which
the deployed system meets the stakeholder needs.
Validation tests
, in a sense,
that the system specification is appropriate
for the goals of the stakeholders.
However the validation can't be done
by running tests,
since the measure of validity
is the degree to which the stakeholders are satisfied.
Figure 10. The V
model,
relating specifications, implementations, and tests