Reliability and validity

Geri LoBiondo-Wood and Judith Haber

Learning outcomes

After reading this chapter, you should be able to do the following:

• Discuss how measurement error can affect the outcomes of a research study.

• Discuss the purposes of reliability and validity.

• Define reliability.

• Discuss the concepts of stability, equivalence, and homogeneity as they relate to reliability.

• Compare and contrast the estimates of reliability.

• Define validity.

• Compare and contrast content, criterion-related, and construct validity.

• Identify the criteria for critiquing the reliability and validity of measurement tools.

• Use the critiquing criteria to evaluate the reliability and validity of measurement tools.

• Discuss how evidence related to reliability and validity contributes to the strength and quality of evidence provided by the findings of a research study and applicability to practice.

Go to Evolve at http://evolve.elsevier.com/LoBiondo/ for review questions, critiquing exercises, and additional research articles for practice in reviewing and critiquing.

Measurement of nursing phenomena is a major concern of nursing researchers. Unless measurement instruments validly (accurately) and reliably (consistently) reflect the concepts of the theory being tested, conclusions drawn from a study will be invalid or biased and will not advance the development of evidence-based practice. Issues of reliability and validity are of central concern to researchers, as well as to an appraiser of research. From either perspective, the instruments that are used in a study must be evaluated. Researchers often face the challenge of developing new instruments and, as part of that process, establishing the reliability and validity of those instruments. The growing importance of measurement issues, instrument development, and related issues (e.g., reliability and validity) is evident in the Journal of Nursing Measurement and other nursing research journals.

Nurse investigators use instruments that have been developed by researchers in nursing and other disciplines. When reading studies, you must assess the reliability and validity of the instruments to determine the soundness of these selections in relation to the concepts (concepts are often called constructs in instrument development studies) or variables under study. The appropriateness of instruments and the extent to which reliability and validity are demonstrated have a profound influence on the strength of the findings and the extent to which bias is present. Invalid measures produce invalid estimates of the relationships between variables, thus introducing bias, which affects the study’s internal and external validity. As such, the assessment of reliability and validity is an extremely important critical appraisal skill for assessing the strength and quality of evidence provided by the design and findings of a study and its applicability to practice.

Regardless of whether a new or already developed instrument is used in a study, evidence of reliability and validity is of crucial importance. This chapter examines the major types of reliability and validity and demonstrates the applicability of these concepts to the evaluation of instruments in nursing research and evidence-based practice.

Reliability, validity, and measurement error

Reliability is the ability of an instrument to measure the attributes of a variable or construct consistently. Validity is the extent to which an instrument measures the attributes of a concept accurately. Each of these properties will be discussed later in the chapter. To understand reliability and validity, you need to understand potential errors related to instruments. Researchers may be concerned about whether the scores that were obtained for a sample of subjects were consistent, true measures of the behaviors and thus an accurate reflection of the differences between individuals. The extent of variability in test scores that is attributable to error rather than a true measure of the behaviors is the error variance. Error in measurement can occur in multiple ways.

An observed test score that is derived from a set of items actually consists of the true score plus error (Figure 15-1). The error may be either chance error or random error, or it may be systematic or constant error. Validity is concerned with systematic error, whereas reliability is concerned with random error. Chance or random errors are errors that are difficult to control (e.g., a respondent’s anxiety level at the time of testing). Random errors are unsystematic in nature. Random errors are a result of a transient state in the subject, the context of the study, or the administration of an instrument. For example, perceptions or behaviors that occur at a specific point in time (e.g., anxiety) are known as a state or transient characteristic and are often beyond the awareness and control of the examiner. Another example of random error is in a study that measures blood pressure. Random error resulting in different blood pressure readings could occur by misplacement of the cuff, not waiting for a specific time period before taking the blood pressure, or placing the arm randomly in relationship to the heart while measuring blood pressure.

FIGURE 15-1 Components of observed scores.

Systematic or constant error is measurement error that is attributable to relatively stable characteristics of the study sample that may bias their behavior and/or cause incorrect instrument calibration. Such error has a systematic biasing influence on the subjects’ responses and thereby influences the validity of the instruments. For instance, level of education, socioeconomic status, social desirability, response set, or other characteristics may influence the validity of the instrument by altering measurement of the “true” responses in a systematic way. For example, a subject is completing a survey examining attitudes about caring for elderly patients. If the subject wants to please the investigator, items may constantly be answered in a socially desirable way rather than how the individual actually feels, thus making the estimate of validity inaccurate. Systematic error occurs also when an instrument is improperly calibrated. Consider a scale that consistently gives a person’s weight at 2 pounds less than the actual body weight. The scale could be quite reliable (i.e., capable of reproducing the precise measurement), but the result is consistently invalid.

The concept of error is important when appraising instruments in a study. The information regarding the instruments’ reliability and validity is found in the instrument or measures section of a study, which can be separately titled or appear as a subsection of the methods section of a research report, unless the study is a psychometric or instrument development study (see Chapter 10).

HELPFUL HINT

Research articles vary considerably in the amount of detail included about reliability and validity. When the focus of a study is tool development, psychometric evaluation—including extensive reliability and validity data—is carefully documented and appears throughout the article rather than briefly in the “Instruments” or “Measures” section, as in articles reporting on the results of individual studies.

Validity

Validity is the extent to which an instrument measures the attributes of a concept accurately. When an instrument is valid, it truly reflects the concept it is supposed to measure. A valid instrument that is supposed to measure anxiety does so; it does not measure some other concept, such as stress. A measure can be reliable but not valid. Let us say that a researcher wanted to measure anxiety in patients by measuring their body temperatures. The researcher could obtain highly accurate, consistent, and precise temperature recordings, but such a measure may not be a valid indicator of anxiety. Thus the high reliability of an instrument is not necessarily congruent with evidence of validity. A valid instrument, however, is reliable. An instrument cannot validly measure the attribute of interest if it is erratic, inconsistent, or inaccurate. There are three types of validity that vary according to the kind of information provided and the purpose of the instrument (i.e., content, criterion-related, and construct validity). As you appraise research articles you will want to evaluate whether sufficient evidence of validity is present and whether the type of validity is appropriate to the study’s design and instruments used in the study.

As you read the instruments or measures sections of studies, you will notice that validity data are reported much less frequently than reliability data. DeVon and colleagues (2007) note that adequate validity is frequently claimed, but rarely is the method specified. This lack of reporting, largely due to publication space constraints, shows the importance of critiquing the quality of the instruments and the conclusions (see Chapters 14 and 17).

EVIDENCE-BASED PRACTICE TIP

Selecting measurement instruments that have strong evidence of validity increases your confidence in the study findings—that the researcher actually measured what she or he intended to measure.

Content validity

Content validity represents the universe of content, or the domain of a given variable/construct. The universe of content provides the basis for developing the items that will adequately represent the content. When an investigator is developing an instrument and issues of content validity arise, the concern is whether the measurement instrument and the items it contains are representative of the content domain that the researcher intends to measure. The researcher begins by defining the concept and identifying the attributes or dimensions of the concept. The items that reflect the concept and its domain are developed.

When the researcher has completed this task, the items are submitted to a panel of judges considered to be experts about the concept. For example, researchers typically request that the judges indicate their agreement with the scope of the items and the extent to which the items reflect the concept under consideration. Box 15-1 provides an example of content validity.

BOX 15-1 PUBLISHED EXAMPLES OF CONTENT VALIDITY AND CONTENT VALIDITY INDEX

The following text from various articles describes how content validity and content validity index can be determined in an article:

Content validity

“A panel of 13 experts evaluated content validity for the Perceived Self-Efficacy for Fatigue Self-Management. Experts were selected using selection criteria established by Grant and Davis and had experience in fatigue, clinical oncology, chronic illness, self-efficacy theory, research methods, statistics, or a combination of these. The expert panel were provided conceptual definitions, the measurement model, a description of the population and setting in which the instrument would be used. The panel identified items that were not stated clearly and commented on the items’ representativeness and the instrument’s comprehensiveness. Panel feedback was incorporated and concurrence achieved that the items were appropriate and relevant for persons with a chronic illness who were experiencing fatigue” (Hoffman et al., 2011, p. 169).

Content validity index

“For the original Relationships with Health Care Provider Scale (RHCPS), the Item-level Content Validity Index (I-CVI) was calculated by a panel of five content experts rating each scale’s item for its relevance to the construct of health care relationships. Experts were nurses with masters degrees or PhDs who were clinical providers or clinical researchers with experience with instrument development. The ratings were on a 4-point scale with a response format of 1 = not relevant to 4 = highly relevant. The I-CVI for each item was computed based on the percentage of experts giving a rating of 3 or 4, indicating item relevance …. The content validity index for the total scale (S-CVI), calculated by averaging the I-CVI responses from the five experts and dividing by the number of items, was equal to .96. A rating of .90 is considered to be an acceptable standard for an S-CVI” (Anderson, et al., 2011, p. 7).

Another method used to establish content validity is the content validity index (CVI). The content validity index moves beyond the level of agreement of a panel of expert judges and calculates an index of interrater agreement or relevance. This calculation gives a researcher more confidence or evidence that the instrument truly reflects the concept or construct. When reading the instrument section of a research article, note that the authors will comment if a CVI was used to assess the content validity of an instrument. When reading a psychometric study that reports the development of an instrument, you will find great detail and a much longer section of how exactly the researchers calculated the CVI and the acceptable item cut-offs. In the scientific literature there has been discussion of accepting a CVI of .78 to 1.0 depending on the number of experts (DeVon et al., 2007; Lynn, 1986). An example from a study that used CVI is presented in Box 15-1. A subtype of content validity is face validity, which is a rudimentary type of validity that basically verifies that the instrument gives the appearance of measuring the concept. It is an intuitive type of validity in which colleagues or subjects are asked to read the instrument and evaluate the content in terms of whether it appears to reflect the concept the researcher intends to measure.

EVIDENCE-BASED PRACTICE TIP

If face and/or content validity, the most basic types of validity, was (were) the only type(s) of validity reported in a research article, you would not appraise the measurement instrument(s) as having strong psychometric properties, which would negatively influence your confidence about the study findings.

Criterion-related validity

Criterion-related validity indicates to what degree the subject’s performance on the instrument and the subject’s actual behavior are related. The criterion is usually the second measure, which assesses the same concept under study. For example, in a study by Sherman and colleagues (2012) investigating the effects of psychoeducation and telephone counseling on the adjustment of women with early-stage breast cancer, criterion-related validity was supported by correlating amount of distress experienced (ADE) scores measured by the Breast Cancer Treatment Response Inventory (BCTRI) and total scores from the Symptom Distress Scale (r = .86; p < .000). Two forms of criterion-related validity are concurrent and predictive.

Concurrent validity refers to the degree of correlation of one test with the scores of another more established instrument of the same concept when both are administered at the same time. A high correlation coefficient indicates agreement between the two measures and evidence of concurrent validity.

Predictive validity refers to the degree of correlation between the measure of the concept and some future measure of the same concept. Because of the passage of time, the correlation coefficients are likely to be lower for predictive validity studies. Examples of concurrent and predictive validity as they appear in research articles are illustrated in Box 15-2.

BOX 15-2 PUBLISHED EXAMPLES OF REPORTED CRITERION-RELATED VALIDITY

Concurrent validity

Concurrent validity of the PTSD Checklist (PCL) has been supported by the statistically significant correlations between PCL scores and the Clinician-Administered PTSD Scale (Melvin et al., 2012; Appendix D).

Predictive validity

In a study investigating family caregiving of older Chinese people with dementia, predictive validity of the Chinese version of the Attitudinal Familism Scale (AFS) was indicated by a significant positive correlation between familism and membership in an older cohort born in 1949 before China opened to the values of other countries (Liu et al., 2012).

Construct validity

Construct validity is based on the extent to which a test measures a theoretical construct, attribute, or trait. It attempts to validate the theory underlying the measurement by testing of the hypothesized relationships. Testing confirms or fails to confirm the relationships that are predicted between and/or among concepts and, as such, provides more or less support for the construct validity of the instruments measuring those concepts. The establishment of construct validity is complex, often involving several studies and approaches. The hypothesis-testing, factor analytical, convergent and divergent, and contrasted-groups approaches are discussed below. Box 15-3 provides examples of different types of construct validity as it is reported in published research articles.

BOX 15-3 PUBLISHED EXAMPLES OF REPORTED CONSTRUCT VALIDITY

The following examples from various articles describe how construct validity can be presented in an article.

Contrasted groups (known groups)

Melvin and colleagues (2012; Appendix D) used the Revised Dyadic Adjustment Scale (RDAS), and reported that the RDAS showed a .97 correlation with the original scale and good discrimination between distressed and non-distressed couples in civilian population. In this study the RDAS was going to be used for the first time in a population of military couples and then the contrasted groups approach could be used to assess if the RDAS discriminated between distressed and non-distressed couples in a military population.

Convergent validity

“Convergent construct validity of the Breast Cancer Treatment Response Inventory (BCTRI) was supported by correlating the amount of distress (ADE) score, measured by the BCTRI, and total scores from the Symptom Distress Scale (r = .86, p < .000)” (McCorkle & Young, 1978; Sherman et al, 2012).

Divergent (discriminant) validity

The Psychosocial Adjustment to Illness Scale (PAIS), a 46-item scale that assesses the impact of the illness on seven domains of adjustment, was used as a measure of social adjustment in a population of breast cancer patients. Findings confirmed discriminant validity with correlations between the total PAIS score of .81 for the Global Adjustment to Illness Scale, .60 for the SCL-90R General Severity Index, and .69 for the Affect Balance Scale (Sherman et al., 2012).

Factor analysis

“In principal components factor analysis with varimax rotation, the scree plot suggested a one- or two-factor structure. The Kaiser-Meyer-Olkin (.96) and Barlett’s tests (x2 [105, n = 431] = 5306.7, p < .001) indicated a high degree of common variance for items of the HCR Trust Scale” (Bova et al., 2012, p. 402).

Hypothesis testing

In a study to identify predictors of caregiver strain and satisfaction associated with caring for veterans with chronic illness, construct validity of the Caregiver Strain Index (CSI) was supported by significant correlations between the physical and emotional health of the caregiver and subjective views of the caregiving situation. Positive responses to seven or more items indicate a high level of strain (Wakefield et al., 2012).