Reliability and validity


Reliability and validity

Geri LoBiondo-Wood and Judith Haber


Go to Evolve at for review questions, critiquing exercises, and additional research articles for practice in reviewing and critiquing.

Measurement of nursing phenomena is a major concern of nursing researchers. Unless measurement instruments validly (accurately) and reliably (consistently) reflect the concepts of the theory being tested, conclusions drawn from a study will be invalid or biased and will not advance the development of evidence-based practice. Issues of reliability and validity are of central concern to researchers, as well as to an appraiser of research. From either perspective, the instruments that are used in a study must be evaluated. Researchers often face the challenge of developing new instruments and, as part of that process, establishing the reliability and validity of those instruments. The growing importance of measurement issues, instrument development, and related issues (e.g., reliability and validity) is evident in the Journal of Nursing Measurement and other nursing research journals.

Nurse investigators use instruments that have been developed by researchers in nursing and other disciplines. When reading studies, you must assess the reliability and validity of the instruments to determine the soundness of these selections in relation to the concepts (concepts are often called constructs in instrument development studies) or variables under study. The appropriateness of instruments and the extent to which reliability and validity are demonstrated have a profound influence on the strength of the findings and the extent to which bias is present. Invalid measures produce invalid estimates of the relationships between variables, thus introducing bias, which affects the study’s internal and external validity. As such, the assessment of reliability and validity is an extremely important critical appraisal skill for assessing the strength and quality of evidence provided by the design and findings of a study and its applicability to practice.

Regardless of whether a new or already developed instrument is used in a study, evidence of reliability and validity is of crucial importance. This chapter examines the major types of reliability and validity and demonstrates the applicability of these concepts to the evaluation of instruments in nursing research and evidence-based practice.

Reliability, validity, and measurement error

Reliability is the ability of an instrument to measure the attributes of a variable or construct consistently. Validity is the extent to which an instrument measures the attributes of a concept accurately. Each of these properties will be discussed later in the chapter. To understand reliability and validity, you need to understand potential errors related to instruments. Researchers may be concerned about whether the scores that were obtained for a sample of subjects were consistent, true measures of the behaviors and thus an accurate reflection of the differences between individuals. The extent of variability in test scores that is attributable to error rather than a true measure of the behaviors is the error variance. Error in measurement can occur in multiple ways.

An observed test score that is derived from a set of items actually consists of the true score plus error (Figure 15-1). The error may be either chance error or random error, or it may be systematic or constant error. Validity is concerned with systematic error, whereas reliability is concerned with random error. Chance or random errors are errors that are difficult to control (e.g., a respondent’s anxiety level at the time of testing). Random errors are unsystematic in nature. Random errors are a result of a transient state in the subject, the context of the study, or the administration of an instrument. For example, perceptions or behaviors that occur at a specific point in time (e.g., anxiety) are known as a state or transient characteristic and are often beyond the awareness and control of the examiner. Another example of random error is in a study that measures blood pressure. Random error resulting in different blood pressure readings could occur by misplacement of the cuff, not waiting for a specific time period before taking the blood pressure, or placing the arm randomly in relationship to the heart while measuring blood pressure.

Systematic or constant error is measurement error that is attributable to relatively stable characteristics of the study sample that may bias their behavior and/or cause incorrect instrument calibration. Such error has a systematic biasing influence on the subjects’ responses and thereby influences the validity of the instruments. For instance, level of education, socioeconomic status, social desirability, response set, or other characteristics may influence the validity of the instrument by altering measurement of the “true” responses in a systematic way. For example, a subject is completing a survey examining attitudes about caring for elderly patients. If the subject wants to please the investigator, items may constantly be answered in a socially desirable way rather than how the individual actually feels, thus making the estimate of validity inaccurate. Systematic error occurs also when an instrument is improperly calibrated. Consider a scale that consistently gives a person’s weight at 2 pounds less than the actual body weight. The scale could be quite reliable (i.e., capable of reproducing the precise measurement), but the result is consistently invalid.

The concept of error is important when appraising instruments in a study. The information regarding the instruments’ reliability and validity is found in the instrument or measures section of a study, which can be separately titled or appear as a subsection of the methods section of a research report, unless the study is a psychometric or instrument development study (see Chapter 10).


Validity is the extent to which an instrument measures the attributes of a concept accurately. When an instrument is valid, it truly reflects the concept it is supposed to measure. A valid instrument that is supposed to measure anxiety does so; it does not measure some other concept, such as stress. A measure can be reliable but not valid. Let us say that a researcher wanted to measure anxiety in patients by measuring their body temperatures. The researcher could obtain highly accurate, consistent, and precise temperature recordings, but such a measure may not be a valid indicator of anxiety. Thus the high reliability of an instrument is not necessarily congruent with evidence of validity. A valid instrument, however, is reliable. An instrument cannot validly measure the attribute of interest if it is erratic, inconsistent, or inaccurate. There are three types of validity that vary according to the kind of information provided and the purpose of the instrument (i.e., content, criterion-related, and construct validity). As you appraise research articles you will want to evaluate whether sufficient evidence of validity is present and whether the type of validity is appropriate to the study’s design and instruments used in the study.

As you read the instruments or measures sections of studies, you will notice that validity data are reported much less frequently than reliability data. DeVon and colleagues (2007) note that adequate validity is frequently claimed, but rarely is the method specified. This lack of reporting, largely due to publication space constraints, shows the importance of critiquing the quality of the instruments and the conclusions (see Chapters 14 and 17).

Content validity

Content validity represents the universe of content, or the domain of a given variable/construct. The universe of content provides the basis for developing the items that will adequately represent the content. When an investigator is developing an instrument and issues of content validity arise, the concern is whether the measurement instrument and the items it contains are representative of the content domain that the researcher intends to measure. The researcher begins by defining the concept and identifying the attributes or dimensions of the concept. The items that reflect the concept and its domain are developed.

When the researcher has completed this task, the items are submitted to a panel of judges considered to be experts about the concept. For example, researchers typically request that the judges indicate their agreement with the scope of the items and the extent to which the items reflect the concept under consideration. Box 15-1 provides an example of content validity.


The following text from various articles describes how content validity and content validity index can be determined in an article:

Content validity

“A panel of 13 experts evaluated content validity for the Perceived Self-Efficacy for Fatigue Self-Management. Experts were selected using selection criteria established by Grant and Davis and had experience in fatigue, clinical oncology, chronic illness, self-efficacy theory, research methods, statistics, or a combination of these. The expert panel were provided conceptual definitions, the measurement model, a description of the population and setting in which the instrument would be used. The panel identified items that were not stated clearly and commented on the items’ representativeness and the instrument’s comprehensiveness. Panel feedback was incorporated and concurrence achieved that the items were appropriate and relevant for persons with a chronic illness who were experiencing fatigue” (Hoffman et al., 2011, p. 169).

Content validity index

“For the original Relationships with Health Care Provider Scale (RHCPS), the Item-level Content Validity Index (I-CVI) was calculated by a panel of five content experts rating each scale’s item for its relevance to the construct of health care relationships. Experts were nurses with masters degrees or PhDs who were clinical providers or clinical researchers with experience with instrument development. The ratings were on a 4-point scale with a response format of 1 = not relevant to 4 = highly relevant. The I-CVI for each item was computed based on the percentage of experts giving a rating of 3 or 4, indicating item relevance …. The content validity index for the total scale (S-CVI), calculated by averaging the I-CVI responses from the five experts and dividing by the number of items, was equal to .96. A rating of .90 is considered to be an acceptable standard for an S-CVI” (Anderson, et al., 2011, p. 7).

Another method used to establish content validity is the content validity index (CVI). The content validity index moves beyond the level of agreement of a panel of expert judges and calculates an index of interrater agreement or relevance. This calculation gives a researcher more confidence or evidence that the instrument truly reflects the concept or construct. When reading the instrument section of a research article, note that the authors will comment if a CVI was used to assess the content validity of an instrument. When reading a psychometric study that reports the development of an instrument, you will find great detail and a much longer section of how exactly the researchers calculated the CVI and the acceptable item cut-offs. In the scientific literature there has been discussion of accepting a CVI of .78 to 1.0 depending on the number of experts (DeVon et al., 2007; Lynn, 1986). An example from a study that used CVI is presented in Box 15-1. A subtype of content validity is face validity, which is a rudimentary type of validity that basically verifies that the instrument gives the appearance of measuring the concept. It is an intuitive type of validity in which colleagues or subjects are asked to read the instrument and evaluate the content in terms of whether it appears to reflect the concept the researcher intends to measure.

Criterion-related validity

Criterion-related validity indicates to what degree the subject’s performance on the instrument and the subject’s actual behavior are related. The criterion is usually the second measure, which assesses the same concept under study. For example, in a study by Sherman and colleagues (2012) investigating the effects of psychoeducation and telephone counseling on the adjustment of women with early-stage breast cancer, criterion-related validity was supported by correlating amount of distress experienced (ADE) scores measured by the Breast Cancer Treatment Response Inventory (BCTRI) and total scores from the Symptom Distress Scale (r = .86; p < .000). Two forms of criterion-related validity are concurrent and predictive.

Concurrent validity refers to the degree of correlation of one test with the scores of another more established instrument of the same concept when both are administered at the same time. A high correlation coefficient indicates agreement between the two measures and evidence of concurrent validity.

Predictive validity refers to the degree of correlation between the measure of the concept and some future measure of the same concept. Because of the passage of time, the correlation coefficients are likely to be lower for predictive validity studies. Examples of concurrent and predictive validity as they appear in research articles are illustrated in Box 15-2.

Construct validity

Construct validity is based on the extent to which a test measures a theoretical construct, attribute, or trait. It attempts to validate the theory underlying the measurement by testing of the hypothesized relationships. Testing confirms or fails to confirm the relationships that are predicted between and/or among concepts and, as such, provides more or less support for the construct validity of the instruments measuring those concepts. The establishment of construct validity is complex, often involving several studies and approaches. The hypothesis-testing, factor analytical, convergent and divergent, and contrasted-groups approaches are discussed below. Box 15-3 provides examples of different types of construct validity as it is reported in published research articles.


The following examples from various articles describe how construct validity can be presented in an article.

Contrasted groups (known groups)

Melvin and colleagues (2012; Appendix D) used the Revised Dyadic Adjustment Scale (RDAS), and reported that the RDAS showed a .97 correlation with the original scale and good discrimination between distressed and non-distressed couples in civilian population. In this study the RDAS was going to be used for the first time in a population of military couples and then the contrasted groups approach could be used to assess if the RDAS discriminated between distressed and non-distressed couples in a military population.

Stay updated, free articles. Join our Telegram channel

Feb 15, 2017 | Posted by in NURSING | Comments Off on Reliability and validity

Full access? Get Clinical Tree

Get Clinical Tree app for offline access