International Advanced School of Empirical Software Engineering


Replication and Aggregation of Software Engineering Experiments

Chair: Professor Natalia Juristo, Universidad PolitÚcnica de Madrid, Spain

1.- Background

In its early days, Empirical SE (ESE) focused on studying the application of the principles of the laboratory and experiment to SE. Twenty years later, running lab experiments in ESE is a fairly well understood task. But running isolated experiments is just one step of the experimental paradigm. Other principles of experimentalism remain to be analysed and adapted to SE: Experiment reporting; Replication; Systematic reviews; Aggregation, etc.

This year, the International Advanced School on ESE (IASESE'08) focuses in two of these aspects that are much related: Replication and Aggregation. That is, how to aggregate the results to combine the findings of replications. The School will be formed of three parts:

  • Part I. Aggregation Methods for ESE
  • Part II. Different Types of Replications, Different Goals of Aggregation
  • Part III. Students Exercises

2.- Part I: Aggregation techniques for ESE

Aggregation should be taken to mean the combination of the results of more than one experiment to generate pieces of knowledge that can be used in practice to develop software. The fact that one experiment yields certain results should not be taken as evidence enough to consider these results as a proven fact. Taken separately, experiments provide partial results, while conclusive results or evidence can be gained by accumulating partial results.

For example, one set of replications might have found that testing technique A is more effective at detecting defects than technique B, whereas another set of replications might have found no differences between the two techniques. What evidence can be gathered from these results? There are many reasons for two replications producing contradictory results. The experimental methodology (measurement errors, factor control, randomization, masking, etc.) could be behind some, but contextual factors (project type, participants involved, application domain, etc.) could cause others.

Aggregation has been studied primarily in established experimental disciplines like medicine. The solution in these cases is to use meta-analysis [Cooper 94]. The introduction of meta-analysis as a method of aggregation completely revolutionized medical methodology. It prompted the appearance of "evidence-based medicine", which transformed medical practice [Straus 05]. In view of its success, it has been proposed that the procedures used in medicine should be transferred to ESE [Kitchenham 04]. These proposals (where aggregation is part of a process of systematic review) have been used successfully in the steps focusing on reviewing the characteristics of a set of experiments (as is the case of [Dyba 06] or [Hannay 07]).

Unfortunately, meta-analysis has two conditions that many SE experiments hardly meet: It requires a large number of replications; The experiment reports must describe certain parameters without which aggregation is impossible. Experimental practice in SE (being a relatively new and so immature discipline) is characterized by both a small number of replications and deficient experiment reporting. This reality prevents a broad use of meta-analysis as main method to aggregate experiments in ESE at present. In fact, most attempts at combination of SE experiments results using meta-analysis (for instance, [Miller 00] or [Pickard 98]), turned out to be impracticable, due either to differences in how the replications were run or to missing information because of poor experiment reporting. In other words, the attempts at aggregation carried out using meta-analysis confirm the need to envisage other better suited techniques.

Fortunately, SE is not the only discipline where the concepts of meta-analysis used in medicine are not applicable. Disciplines like ecology or the social sciences have developed their own special procedures of aggregation that are less stringent than meta-analysis. However, as the procedure of aggregation requires less information, the findings are also less precise. This means that the use of less sophisticated techniques than meta-analysis has a cost in terms of the reliability of the findings. However, it is better to have a finding, even if it is less reliable, for decision making than to have nothing to go on at all.

3.- Part II. Different Types of Replications, Different Goals of Aggregation

Aggregation is closely related to replication. Replication means other researchers repeating an experiment for the purpose of contrast earlier results. Replications are necessary to count with results from several experiments to be aggregated and generate evidence.

Replication is still not deeply understood in ESE. The context of a SE experiment is extremely complex because there are so many variables involved in the phenomenon under study and because software development (and, hence, software experiments) involves human beings. To replicate ESE experiments, not only do you need to know what techniques they examined, but also how the subjects applied the technique, how the subjects were taught, what previous knowledge the subjects had, as well as a host of other details. Because of these difficulties it is still almost impossible to get an exact replication of a SE experiment if it is run in other setting by other researchers. This lack of exact replications makes traditional aggregation methods hard to apply and very often these inexact replications have been considered failed and useless.

It is difficult to predict beforehand the many factors that can alter the experiment results in a trial of an exact replication. For example, the effectiveness of a testing technique could depend on the time between when it was taught and when the experiment was run. Many other a priori unidentifiable variables like this can alter the experimental context, leading to inconsistent results from one replication to another.

The difficulty of controlling the context is not confined to SE, and other experimental disciplines face similar problems. To successfully replicate ESE experiments, it is necessary to move away from the concept of replication applied in the natural sciences and envisage approaches from other less strict experimental disciplines. In those disciplines, replication is considered to be the repetition of an experiment by other researchers in other environments with other samples in an attempt to reproduce an experiment as closest as possible [Judd 91].

Replications do not necessarily have to be exact copies of the original experiment, but fairly accurate adaptations. But the results will be aggregated differently depending on how close the replications turn out: incoherent results can provide an understanding of the robustness of the results as well as insight into what variables are behind the differences in the results. If the results of the replications are inconsistent, during aggregation the reasons for the inconsistencies should be analysed to discover new biasing variables. This will generate pieces of knowledge that specify the circumstances under which that knowledge is applicable.

Going back to the testing techniques example: identify that the time between when the technique was taught and when it was applied differed from one replication to the other; conjecture why the time between teaching and application might bias technique effectiveness; and run new experiments (or use other existing replications where that condition differ) to understand what impact it has.

So, by aggregating replications that tried to be exact but they turned out differentiated, it is possible to discover new knowledge. But a thorough analysis should be conducted of the details of all conditions that changed in an endeavour to discover new biasing variables. The traditional aggregation methods do not support this kind of analysis. Other types of methods to combine replications results are to be used.

4.- Part III. Students Exercises

The School will end with a participatory exercise to apply the concepts and techniques taught in Part I and II. Each attendee will be able to select for the exercise between two topics: Aggregation techniques or Generation of variables from replications. Attendees will be divided in groups led by one tutor. The amount of tutors will allow us to break up into small enough groups to give personal attention to each participant's experiences and questions. Tutors will give support to participants in the application of the taught material. Groups will report-out to the larger group.

5. References

5.1. Bibliography referenced in the text

Cooper, Hedges (1994) The Handbook of Research Synthesis Russell Sage Foundation
Dyba, Kampenes, Sjoberg (2006) A systematic review of statistical power in SE experiments Information and Software Technology 48(8)
Hannay, Sj°berg, Dybň (2007) A systematic review of theory use in SE experiments IEEE Trans. on SE 33(2)
J°rgensen (2004) A review of studies on expert estimation of software development effort J. of Systems and Software 70(1-2)
Judd, Smith, Kidder (1991) Research Methods in Social Relations Jovanovich College Publishers
Kitchenham (2004) Procedures for performing systematic reviews. Keele University TR/SE-0401
Miller (2000) Applying meta-analytical procedures to SE experiments J. of Systems and Software 54(1)
Pickard, Kitchenham, Jones (1998) Combining empirical results in SE Information and Software Technology 40(14)
Straus, Richardson, Glasziou, Haynes (2005) Evidence-based Medicine. How to practice and teach EBM Elsevier

5.2. Other bibliography related with the topic

Basili, Shull, Lanubile (1999) Building knowledge through families of experiments IEEE Trans. on SE 25 (4)
Brooks, Roper, Wood, Daly, Miller (2007) Replication's Role in SE. In Shull, Singer, Sjoberg (Eds.) Guide to Advanced Empirical SE Springer-Science
Davis, Dieste, Hickey, Juristo, Moreno (2006) Effectiveness of requirements elicitation techniques: Empirical results derived from a systematic review Proc. of the IEEE Int. Conf. on Requirements Eng.
Jedlitschka, Ciolkowski (2004) Towards evidence in SE Proc. of ACM/IEEE Int. Symp. on Empirical SE
Juristo, Moreno (2001) Basics of Software Engineering Experimentation Kluwer
Juristo, Moreno, Vegas (2002a) A survey on testing technique empirical studies: How limited is our knowledge Proc. of the ACM/IEEE Int. Symp. on Empirical SE
Juristo, Moreno (2002b) Reliable knowledge for software development IEEE Software 19(5)
Kitchenham, Pfleeger, Pickard, Jones, Hoaglin, El Emam, Rosenberg (2002) Preliminary guidelines for empirical research in SE. IEEE Transactions on SE 28(8)
Miller (1999) Can results from SE experiments be safely combined? Int. Software Metrics Symp.
Miller (2005) Replicating SE experiments: A poisoned chalice or the holy grail Information and Software Technology 47
Pfleeger (1999) Albert Einstein and empirical SE IEEE Computer 32(10)
Runeson, Thelin (2004) Prospects and limitations for cross-study analyses: A study on an experiment series Proc. of the Int. Workshop on Empirical SE
Shull, Basili, Carver, Maldonado, Travassos, Mendonca, Fabbri (2002) Replicating SE experiments: Addressing the tacit knowledge problem Proc. of the ACM/IEEE Int. Symp. on Empirical SE
Shull, Carver, Travassos, Maldonado, Conradi, Basili (2003) Replicated studies: Building a body of knowledge about software reading techniques In Juristo and Moreno (Eds.) Lecture Notes on Empirical SE World Scientific
Vegas, Juristo, Moreno, Solari, Letelier (2006) Analysis of the influence of communication between researchers on experiment replication Proc. of the ACM/IEEE Int. Symp. on Empirical SE
Wohlin, Runeson, H÷st, Ohlsson, Regnell,WesslÚn (2000) Experimentation in SE: An Introduction Kluwer
Yin, Heald, (1975) Using the case survey method to analyze policy studies Administrative Science Quarterly 20(3)
International Advanced School of Empirical Software Engineering 2008 - Kaiserslautern