This year, the International Advanced School on ESE (IASESE'08) focuses in two of these aspects that are much related: Replication and Aggregation. That is, how to aggregate the results to combine the findings of replications. The School will be formed of three parts:
For example, one set of replications might have found that testing technique A is more effective at detecting defects than technique B, whereas another set of replications might have found no differences between the two techniques. What evidence can be gathered from these results? There are many reasons for two replications producing contradictory results. The experimental methodology (measurement errors, factor control, randomization, masking, etc.) could be behind some, but contextual factors (project type, participants involved, application domain, etc.) could cause others.
Aggregation has been studied primarily in established experimental disciplines like medicine. The solution in these cases is to use meta-analysis [Cooper 94]. The introduction of meta-analysis as a method of aggregation completely revolutionized medical methodology. It prompted the appearance of "evidence-based medicine", which transformed medical practice [Straus 05]. In view of its success, it has been proposed that the procedures used in medicine should be transferred to ESE [Kitchenham 04]. These proposals (where aggregation is part of a process of systematic review) have been used successfully in the steps focusing on reviewing the characteristics of a set of experiments (as is the case of [Dyba 06] or [Hannay 07]).
Unfortunately, meta-analysis has two conditions that many SE experiments hardly meet: It requires a large number of replications; The experiment reports must describe certain parameters without which aggregation is impossible. Experimental practice in SE (being a relatively new and so immature discipline) is characterized by both a small number of replications and deficient experiment reporting. This reality prevents a broad use of meta-analysis as main method to aggregate experiments in ESE at present. In fact, most attempts at combination of SE experiments results using meta-analysis (for instance, [Miller 00] or [Pickard 98]), turned out to be impracticable, due either to differences in how the replications were run or to missing information because of poor experiment reporting. In other words, the attempts at aggregation carried out using meta-analysis confirm the need to envisage other better suited techniques.
Fortunately, SE is not the only discipline where the concepts of meta-analysis used in medicine are not applicable. Disciplines like ecology or the social sciences have developed their own special procedures of aggregation that are less stringent than meta-analysis. However, as the procedure of aggregation requires less information, the findings are also less precise. This means that the use of less sophisticated techniques than meta-analysis has a cost in terms of the reliability of the findings. However, it is better to have a finding, even if it is less reliable, for decision making than to have nothing to go on at all.
Replication is still not deeply understood in ESE. The context of a SE experiment is extremely complex because there are so many variables involved in the phenomenon under study and because software development (and, hence, software experiments) involves human beings. To replicate ESE experiments, not only do you need to know what techniques they examined, but also how the subjects applied the technique, how the subjects were taught, what previous knowledge the subjects had, as well as a host of other details. Because of these difficulties it is still almost impossible to get an exact replication of a SE experiment if it is run in other setting by other researchers. This lack of exact replications makes traditional aggregation methods hard to apply and very often these inexact replications have been considered failed and useless.
It is difficult to predict beforehand the many factors that can alter the experiment results in a trial of an exact replication. For example, the effectiveness of a testing technique could depend on the time between when it was taught and when the experiment was run. Many other a priori unidentifiable variables like this can alter the experimental context, leading to inconsistent results from one replication to another.
The difficulty of controlling the context is not confined to SE, and other experimental disciplines face similar problems. To successfully replicate ESE experiments, it is necessary to move away from the concept of replication applied in the natural sciences and envisage approaches from other less strict experimental disciplines. In those disciplines, replication is considered to be the repetition of an experiment by other researchers in other environments with other samples in an attempt to reproduce an experiment as closest as possible [Judd 91].
Replications do not necessarily have to be exact copies of the original experiment, but fairly accurate adaptations. But the results will be aggregated differently depending on how close the replications turn out: incoherent results can provide an understanding of the robustness of the results as well as insight into what variables are behind the differences in the results. If the results of the replications are inconsistent, during aggregation the reasons for the inconsistencies should be analysed to discover new biasing variables. This will generate pieces of knowledge that specify the circumstances under which that knowledge is applicable.
Going back to the testing techniques example: identify that the time between when the technique was taught and when it was applied differed from one replication to the other; conjecture why the time between teaching and application might bias technique effectiveness; and run new experiments (or use other existing replications where that condition differ) to understand what impact it has.
So, by aggregating replications that tried to be exact but they turned out differentiated, it is possible to discover new knowledge. But a thorough analysis should be conducted of the details of all conditions that changed in an endeavour to discover new biasing variables. The traditional aggregation methods do not support this kind of analysis. Other types of methods to combine replications results are to be used.