Chapter 6. Evaluation Research
Ian Norman and Charlotte Humphrey
▪ What is evaluation?
▪ Origins
▪ Evaluation and research
▪ Evaluating complex interventions
▪ Realistic evaluation
▪ Conclusion
What is evaluation?
Summative evaluation
The most familiar and celebrated mode of evaluation in the health arena is the experimental study, ideally in the form of the randomised controlled trial (Chapter 18), whose purpose is to answer questions about effectiveness in respect of a specific intervention. The question at issue in such a study is, does it work? The criteria for effectiveness clearly vary according to what the intervention is intended to achieve. Evaluation may focus, for example, on the success of a treatment for head lice, or the acceptability to patients and staff of a new appointments system.
Sometimes interventions are compared with one another in respect of several different parameters of effectiveness at the same time. The question then is: which works better? For example, an economic evaluation study may assess the relative costs and benefits (in terms of efficacy or other desirable features) of two different modes of treatment of the same condition. A third form of evaluative question is, does it work well enough? For example, decisions about whether to offer a new screening test to a particular at-risk population group may depend on assessment of the likely health gain, taking into account the consequences of false positive and false negative results, likely level of take-up and risks associated with the test itself. In other cases, findings from evaluation studies may be used to inform decisions about how to distribute resources between different areas of care. The question then is whether the benefits gained from investing in one particular activity are sufficient to justify the opportunity costs.
In experimental studies, the outcomes of interest are specified in advance, reflecting the objectives of the intervention. However, it is increasingly recognised that interventions at any level are likely to have both intended and unanticipated effects. In some cases, the latter may prove key in deciding whether or not the intervention is worthwhile or unacceptable. Where such a possibility is anticipated, the first question may be a much more open one, simply, what are the effects of the intervention? Evaluation studies that set out to identify all significant outcomes of an intervention are necessarily more exploratory and descriptive, since they must collect information across a wide range of relevant parameters that may not lend themselves to formal measurement. However, all investigations designed to assess or identify impacts may be grouped in the category of summative or outcome evaluation studies. They share the aim of informing decisions, for example, whether or not an intervention or new way of working should be repeated, continued, extended or applied elsewhere. We consider the experimental approach to evaluation in more detail later in this chapter.
Illuminative evaluation
In traditional experiments the focus of attention is on specifying what is to be done and then measuring the effects. Relatively little attention is given to monitoring the actual process of implementation, since the procedures are presumed to be carried out as specified and any effects are assumed to be attributable to known events. This approach, where the process of implementation remains largely unexamined, has been called the ‘black box’ model of evaluation. However, in a context like health care, where interventions to improve services are often relatively complex and take place in an uncontrolled environment, such assumptions cannot so easily be made.
In these circumstances it may be unclear how or why results came about, or to what they may be attributed. In recognition of these problems, there is increasing appreciation of the benefits of tracking the implementation process in detail, checking whether what is supposed to happen at each stage actually occurs and asking if so, why so, if not, why not, so that the eventual outcomes can be better understood. This type of study is known as illuminative evaluation – it opens up the black box, casting light on how, why and in what circumstances particular outcomes are achieved. Such studies help specify which elements in the process are essential to achieving success and what contextual features in the environment may facilitate or obstruct progress.
Formative evaluation
Elements of both the evaluative approaches just outlined are also frequently employed during the development of an intervention or programme of work for a different purpose, which is not to make judgements about its overall value or applicability for use elsewhere, but to improve and refine the ongoing implementation process. This monitoring and scrutiny of progress is called formative evaluation. Its role is to inform those involved in delivering and developing the intervention about how things are going and identify what aspects need to be tweaked or more fundamentally reconsidered. While questions about the effectiveness of particular elements of the intervention may be asked, formal experiments are unlikely to be undertaken in formative evaluation, since the objective is not to determine whether something works in general, but how well it is working in the present case. For such purposes, formative evaluation is more likely to make use of reflective cycles for quality improvement such as plan–do–study–act (Langley et al 1996). Methods used in illuminative evaluation, such as, for example, asking participants to make notes on their experiences of particular events, may be used in formative evaluation, to provide information and explanation about what took place, and to identify problems.
Origins
The origins of evaluation owe much to Popper (1945), who advocated ‘piecemeal social engineering’, that is the introduction of modest interventions to deal with social problems, together with procedures to check that these interventions are having their intended effects, and also any unintended or unwanted effects (Tilley 2000). Popper’s approach was in contrast to the utopian social engineering advocated by neo-Marxists and revolutionaries of his time, of which Popper was deeply suspicious. The danger of utopian social engineering, in Popper’s view, was that it was likely to produce unintended, uncontrollable consequences, which do more harm than good. In contrast, piecemeal social engineering proceeds through a process of trial and error learning whereby the theories built into well-targeted social reforms, designed to tackle particular social problems, are checked, refined and improved, through a series of social experiments. Such experiments would, according to Popper, produce measurable social benefits and contribute to the development of social science. As Popper put it:
The only course for the social sciences is to … tackle the practical problems of our time with the help of the theoretical methods which are fundamentally the same in all sciences. I mean the methods of trial and error, of inventing hypotheses which can be practically tested, and of submitting these to practical tests. A social technology is needed whose results can be tested by piecemeal social engineering. (Popper 1945, p. 222)
Campbell, as Popper before him, saw evaluation and social reform as hand in hand. In his influential paper ‘Reforms as experiments’, Campbell (1969) set out the role of experimentation in evaluation studies, which was to evaluate the effectiveness of social reforms and thereby contribute to our understanding of ‘what works’. This knowledge might then be drawn on to inform future social programmes to address specific problems.
These ideas were picked up wholesale by policy makers from the 1960s, helped by the US government setting aside a proportion of the budget of the many social programmes initiated at the time, for evaluation. Thus, evaluation research became an industry.
Evaluation and research
Some of the research designs discussed elsewhere in this book are methods of evaluation. The great majority of the other design, data collection techniques and methods of analysis covered in other chapters have also been used in evaluation studies, and many, such as interviews and surveys, are regularly employed. This is unsurprising, since evaluation studies need to obtain valid, reliable and trustworthy data and to draw convincing, accurate and defensible conclusions. However, the fact that evaluation studies do use research methods does not mean that all evaluation is research.
A basic definition of research is that it uses systematic procedures to generate new knowledge or build theory that has generalisable applicability. Properly conducted experimental studies are often seen as the paradigm case of research as defined in these terms. However, generalisability is often not a priority for those commissioning or undertaking evaluation. Hence many evaluation studies are not designed to deliver in this respect. One of the commonest reasons to undertake evaluation is to provide external stakeholders with evidence as to whether the initiative or programme is proceeding according to expectations. Whether the achievements are repeatable elsewhere may be of secondary concern. Another frequent purpose of evaluative work is to collect information for use in a formative way to develop the capacity of the initiative to meet its own objectives. Again, the preoccupation of the participants is likely to be merely with immediate local application of the results.
Nevertheless, as already indicated, some forms of evaluation, such as randomised controlled trials, are specifically designed to support generalisations about effectiveness. Other approaches, such as realist evaluation (a variant of what we have referred to as illuminative evaluation), which is discussed in more detail below, are designed to generate theoretical understandings of what works, for whom, in what circumstances thereby generating theories about causative mechanisms that have the potential for broader application. Finally, evaluation studies sometimes produce outcomes of more general interest because they involve analysis of unusual data sets, or generate investigations that would not otherwise be undertaken.
The debate between advocates of summative evaluation and illuminative evaluation is particularly relevant to the evaluation of nursing services. A feature of these is that they are often ‘complex’ in that they comprise a number of components that act both independently and interdependently. Our aim in the remainder of this chapter is to consider the limitations of summative evaluation with respect to the evaluation of complex interventions.
Evaluating complex interventions
As described above, the ‘gold-standard’ approach to summative evaluation, that is establishing the effects of an ‘intervention’, involves experimentation, whereby the evaluator constructs equivalent experimental and control groups, applies the intervention to the experimental group alone and compares the changes that have taken place between the two groups. Ideally, there should be random allocation and sufficient numbers of subjects to ensure that there are no differences between the two groups before the intervention is applied.
Where random allocation is not practicable, quasi-experimental designs are acceptable. In such cases experimental and control communities are selected so as to be as similar as possible. In both experimental and quasi-experimental designs if the expected change occurs in the experimental, but not in the control condition, and there are no unwanted side effects associated with the experimental group, then it can be concluded that the intervention is effective and there are grounds for its more general application.
Whilst there are a variety of experimental designs, some of considerable complexity (see, for example, Portney & Watkins 1993), the basic principles described above apply.