2
QUALITIES OF EFFECTIVE ASSESSMENT PROCEDURES: VALIDITY, RELIABILITY, AND USABILITY
How does a teacher know whether a test or another assessment instrument is good? If assessment results will be used to make important educational decisions, such as assigning grades and determining whether students are eligible for graduation, teachers must have confidence in their interpretations of test scores. Some high-stakes educational decisions have consequences for faculty members and administrators as well as students. Good assessments produce results that can be used to make appropriate inferences about learners’ knowledge and abilities and thus facilitate effective decision-making. In addition, assessment tools should be practical and easy to use.
Two important questions have been posed to guide the process of constructing or proposing tests and other assessments:
1. To what extent will the interpretation of the scores be appropriate, meaningful, and useful for the intended application of the results?
2. What are the consequences of the particular uses and interpretations that are made of the results (Miller, Linn, & Gronlund, 2013, p. 70)?
This chapter explains the concept of assessment validity, the role of reliability, and their effects on the interpretive quality of assessment results. It also discusses important practical considerations that might affect the choice or development of tests and other instruments.
24Assessment Validity
Definitions of validity have changed over time. Early definitions, formed in the 1940s and early 1950s, emphasized the validity of an assessment tool itself. Tests were characterized as valid or not, apart from consideration of how they were used. It was common in that era to support a claim of validity with evidence that a test correlated well with another “true” criterion. The concept of validity changed, however, in the 1950s through the 1970s to focus on evidence that an assessment tool is valid for a specific purpose. Most measurement textbooks of that era classified validity by three types—content, criterion-related, and construct—and suggested that validation of a test should include more than one approach. In the 1980s, the understanding of validity shifted again, to an emphasis on providing evidence to support the particular inferences that teachers make from assessment results. Validity was defined in terms of the appropriateness and usefulness of the inferences made from assessments, and assessment validation was seen as a process of collecting evidence to support those inferences. The usefulness of the validity “triad” also was questioned; increasingly, measurement experts recognized that construct validity was the key element and unifying concept of validity (Goodwin, 1997; Goodwin & Goodwin, 1999).
The current philosophy of validity continues to focus not on assessment tools themselves or on the appropriateness of using a test for a specific purpose, but on the meaningfulness of the interpretations that teachers make of assessment results. Tests and other assessment instruments yield scores that teachers use to make inferences about how much learners know or what they can do. Validity refers to the adequacy and appropriateness of those interpretations and inferences and how the assessment results are used (Miller et al., 2013). The emphasis is on the consequences of measurement: Does the teacher make accurate interpretations about learners’ knowledge or ability based on their assessment scores? Assessment experts increasingly suggest that in addition to collecting evidence to support the accuracy of inferences made, evidence also should be collected about the intended and unintended consequences of the use of a test (Brookhart & Nitko, 2019; Goodwin, 1997; Goodwin & Goodwin, 1999).
Validity does not exist on an all-or-none basis (Miller et al., 2013); there are degrees of validity depending on the purpose of the assessment and how the results are to be used. A given assessment may be used for many different purposes, and inferences about the results may have greater validity for one purpose than for another. For example, a test designed to measure knowledge of perioperative nursing guidelines may produce results that have high validity for the purpose of determining certification for perioperative staff nurses, but the results may have low validity for assigning grades to students in a perioperative nursing elective course. In addition, validity evidence may change over time, so validation of inferences must not be considered a one-time event.
25No one assessment will produce results that are perfectly valid for a given purpose. Combining results from several different types of assessments, such as tests, written assignments, and class participation, improves the validity of the decisions made about students’ attainments. In addition, weighing one assessment outcome too heavily in relation to others, such as basing course grades almost exclusively on test scores, results in lowered validity (Brookhart & Nitko, 2019).
Validity now is considered a unitary concept (Brookhart & Nitko, 2019; Miller et al., 2013). The concept of validity in testing is described in the Standards for Educational and Psychological Testing prepared by a joint committee of the American Educational Research Association (AERA), American Psychological Association (APA), and National Council on Measurement in Education (NCME). The most recent Standards (2014) no longer includes the view that there are different types of validity—for example, construct, criterion-related, and content.
Instead, there is a variety of sources of evidence to support the validity of the interpretation and use of assessment results. The strongest case for validity can be made when evidence is collected regarding four major considerations for validation:
1. Content
2. Construct
3. Assessment–criterion relationships
4. Consequences (Miller et al., 2013, p. 74)
Each of these considerations is discussed as to how it can be used in nursing education settings.
Content Considerations
The goal of content validation is to determine the degree to which a sample of assessment tasks accurately represents the domain of content or abilities about which the teacher wants to interpret assessment results. Tests and other assessment measures usually contain only a sample of all possible items or tasks that could be used to assess the domain of interest. However, interpretations of assessment results are based on what the teacher believes to be the universe of items that could have been generated. In other words, when a student correctly answers 83% of the items on a women’s health nursing final examination, the teacher usually infers that the student probably would answer correctly 83% of all items in the universe of women’s health nursing content. The test score thus serves as an indicator of the student’s true standing in the larger domain. Although this type of generalization is commonly made, it should be noted that the domains of achievement in nursing education involve complex understandings and integrated performances, about which it is difficult to judge the representativeness of a sample of assessment tasks (Miller et al., 2013).
26A superficial conclusion could be made about the match between a test’s appearance and its intended use by asking a panel of experts to judge whether the test appears to be based on appropriate content. This type of judgment, sometimes referred to as face validity, is not sufficient evidence of content representativeness and should not be used as a substitute for rigorous appraisal of sampling adequacy (Miller et al., 2013).
Efforts to include suitable content on an assessment can and should be made during its development. This process begins with defining the universe of content. The content definition should be related to the purpose for which the test will be used. For example, if a test is supposed to measure a new staff nurse’s understanding of hospital safety policies and procedures presented during orientation, the teacher first defines the universe of content by outlining the knowledge about policies that the staff nurse needs to function satisfactorily. The teacher then uses professional judgment to write or select test items that satisfactorily represent this desired content domain. A system for documenting this process, the construction of a test blueprint or table of specifications, is described in Chapter 3, Planning for Testing.
If the teacher needs to select an appropriate assessment for a particular use, for example, choosing a standardized achievement test, content validation is also of concern. A published test may or may not be suitable for the intended use in a particular nursing education program or with a specific group of learners. The ultimate responsibility for appropriate use of an assessment and interpretation of results lies with the teacher (AERA, APA, & NCME, 2014; Miller et al., 2013). To determine the extent to which an existing test is suitable, experts in the domain review the assessment, item by item, to determine whether the items or tasks are relevant and satisfactorily represent the defined domain, represented by the table of specifications, and the desired learning outcomes. Because these judgments admittedly are subjective, the trustworthiness of this evidence depends on clear instructions to the experts and estimation of rater reliability.
Construct Considerations
Construct validity has been proposed as the “umbrella” under which all types of assessment validation belong (Goodwin, 1997; Goodwin & Goodwin, 1999). Content validation determines how well test scores represent a given domain and is important in evaluating assessments of achievement. When teachers need to make inferences from assessment results to more general abilities and characteristics, however, such as clinical reasoning or communication ability, a critical consideration is the construct that the assessment is intended to measure (Miller et al., 2013).
A construct is an individual characteristic that is assumed to exist because it explains some observed behavior. As a theoretical construction, it cannot be observed directly, but it can be inferred from performance on an assessment. Construct validation is the 27process of determining the extent to which assessment results can be interpreted in terms of a given construct or set of constructs. Two questions, applicable to both teacher-constructed and published assessments, are central to the process of construct validation:
1. How adequately does the assessment represent the construct of interest (construct representation)?
2. Is the observed performance influenced by any irrelevant or ancillary factors (construct relevance)? (Miller et al., 2013, p. 81)
Assessment validity is reduced to the extent that important elements of the construct are underrepresented in the assessment. For example, if the construct of interest is clinical problem-solving ability, the validity of a clinical performance assessment would be weakened if it focused entirely on problems defined by the teacher, because the learner’s ability to recognize and define clinical problems is an important aspect of clinical problem-solving (Gaberson & Oermann, 2018).
The influence of factors that are unrelated or irrelevant to the construct of interest also reduces assessment validity (Brookhart & Nitko, 2019). For example, students who are non-native English speakers may perform poorly on an assessment of clinical problem-solving, not because of limited ability to recognize, identify, and solve problems, but because of unfamiliarity with language or cultural colloquialisms used by patients or teachers (Bosher, 2009; Bosher & Bowles, 2008). Another potential construct-irrelevant factor is writing skill. For example, the ability to communicate clearly and accurately in writing may be an important outcome of a nursing education program, but the construct of interest for a course writing assignment is clinical problem-solving. To the extent that student scores on that assignment are affected by spelling or grammatical errors, the construct-relevant validity of the assessment is reduced. Testwiseness, performance anxiety, and learner motivation are additional examples of possible construct-irrelevant factors that may undermine assessment validity (Miller et al., 2013).
Construct validation for a teacher-made assessment occurs primarily during its development by collecting evidence of construct representation and construct relevance from a variety of sources. Test manuals for published tests should include evidence that these methods were used to generate evidence of construct validity. Methods used in construct validation include:
1. Defining the domain to be measured. The assessment specifications should clearly define the meaning of the construct so that it is possible to judge whether the assessment includes relevant and representative tasks.
2. Analyzing the process of responding to tasks required by the assessment. The teacher can administer an assessment task to the learners (e.g., a multiple-choice item that purportedly assesses clinical reasoning) and ask them 28to think aloud while they perform the test (e.g., explain how they arrived at the answer they chose). This method may reveal that students were able to identify the correct answer because the same example was used in class or in an assigned reading, not because they were able to analyze the situation critically.
3. Comparing assessment results of known groups. Sometimes it is reasonable to expect that scores on a particular measure will differ from one group to another because members of those groups are known to possess different levels of the ability being measured. For example, if the purpose of a test is to measure students’ ability to solve pediatric clinical problems, students who achieve high scores on this test would be assumed to be better problem solvers than students who achieve low scores. To collect evidence in support of this assumption, the teacher might design a study to determine whether student scores on the test are correlated with their scores on a standardized test of clinical problem-solving in nursing. The teacher could divide the sample of students into two groups based on their standardized test scores: those who scored high on the standardized test in one group and those whose standardized test scores were low in the other group. Then the teacher would compare the teacher-made test scores of the students in both groups. If the teacher’s hypothesis is confirmed (i.e., if the students with high-standardized test scores obtained high scores on the teacher-made test), this evidence could be used as partial support for construct validation (Miller et al., 2013).
Group-comparison techniques also have been used in studies of test bias or test fairness. Approaches to detection of test bias have looked for differential item functioning (DIF) related to test-takers’ race, gender, or culture. If test items function differently for members of groups with characteristics that do not directly relate to the variable of interest, differential validity of inferences from the test scores may result. Issues related to test bias are discussed more fully in Chapter 16, Social, Ethical, and Legal Issues.
4. Comparing assessment results before and after a learning activity. It is reasonable to expect that assessments of student performance would improve during instruction, whether in the classroom or in the clinical area, but assessment results should not be affected by other variables such as anxiety or memory of the preinstruction assessment content. For example, evidence that assessment scores improve following instruction but are unaffected by an intervention designed to reduce students’ test anxiety would support the assessment’s construct validity (Miller et al., 2013).
5. Correlating assessment results with other measures. Scores produced by a particular assessment should correlate well with scores of other measures of the 29same construct but show poor correlation with measures of a different construct. For example, teachers’ ratings of students’ performance in pediatric clinical settings should correlate highly with scores on a final exam testing knowledge of nursing care of children, but may not correlate satisfactorily with their classroom or clinical performance in a women’s health course. These correlations may be used to support the claim that a test measures the construct of interest (Miller et al., 2013).
Assessment–Criterion Relationship Considerations
This approach to obtaining validity evidence focuses on predicting future performance (the criterion) based on current assessment results. For example, nursing faculties often use scores from a standardized comprehensive exam given in the final academic semester or quarter to predict whether prelicensure students are likely to be successful on the NCLEX® (National Council Licensure Examination; the criterion measure). Obtaining this type of evidence involves a predictive validation study (Miller et al., 2013).
If teachers want to use assessment results to estimate students’ performance on another assessment (the criterion measure) at the same time, the validity evidence is concurrent, and obtaining this type of evidence requires a concurrent validation study. This type of evidence may be desirable for making a decision about whether one test or measurement instrument may be substituted for another, more resource-intensive one. For example, a staff development educator may want to collect concurrent validity evidence to determine whether a checklist with a rating scale can be substituted for a less efficient narrative appraisal of a staff nurse’s competence.
Teachers rarely conduct formal studies of the extent to which the scores on assessments that they have constructed are correlated with criterion measures. In some cases, adequate criterion measures are not available; the test in use is considered to be the best instrument that has been devised to measure the ability in question. If better measures were available, they might be used instead of the test being validated. However, for tests with high-stakes outcomes, such as licensure and certification, this type of validity evidence is crucial. Multiple criterion measures often are used so that the strengths of one measure may offset the weaknesses of others.
The relationship between assessment scores and those obtained on the criterion measure usually is expressed as a correlation coefficient. A desired level of correlation between the two measures cannot be recommended because the correlation may be influenced by a number of factors, including test length, variability of scores in the distribution, and the amount of time between measures. The teacher who uses the test must use good professional judgment to determine what magnitude of correlation is considered adequate for the intended use of the assessment for which criterion-related evidence is desired.
Consideration of Consequences
30Incorporating concern about the consequences of assessment into the concept of validity is a relatively recent trend. Assessment has both intended and unintended consequences, and teachers and administrators must consider those consequences when judging whether or not they are using the results validly (Brookhart & Nitko, 2019). For example, the faculties of many undergraduate nursing programs have adopted programs of achievement testing that are designed to assess student performance throughout the nursing curriculum. The intended positive consequence of such testing is to identify students at risk of failure on the NCLEX, and to use this information to design remediation programs to increase student learning. Unintended negative consequences, however, may include increased student anxiety, decreased time for instruction relative to increased time allotted for testing, and tailoring instruction to more closely match the content of the tests while focusing less intently on other important aspects of the curriculum that will not be tested on the NCLEX. The intended consequence of using standardized comprehensive exam scores to predict success on the NCLEX may be to motivate students whose assessment results predict failure to remediate and prepare more thoroughly for the licensure exam. But an unintended consequence might be that students whose comprehensive exam scores predict NCLEX success may decide not to prepare further for that important exam, risking a negative outcome.
Ultimately, assessment validity requires an evaluation of interpretations and use of assessment results. The concept of validity thus has expanded to include consideration of the consequences of assessment use and how results are interpreted to students, teachers, and other stakeholders. An adequate consideration of consequences must include both intended and unintended effects of assessment, particularly when assessment results are used to make high-stakes decisions (Miller et al., 2013).
Influences on Validity
A number of factors affect the validity of assessment results, including characteristics of the assessment itself, the administration and scoring procedures, and the test-takers. Teachers should be alert to these factors when constructing assessments or choosing published ones (Miller et al., 2013).
Characteristics of the Assessment
Many factors can prevent the assessment items or tasks from functioning as intended, thereby decreasing the validity of the interpretations of the assessment results. Such factors include unclear directions, ambiguous statements, oversampling of easy-to-assess aspects, too few assessment items, poor arrangement of assessment items, an 31obvious pattern of correct answers, and clerical errors in test construction (Miller et al., 2013). Ways to prevent test-construction errors such as these are addressed in Chapters 3, 4, 5, 6, 7, and 10.
Assessment Administration and Scoring Factors
On teacher-made assessments, factors such as insufficient time, inconsistency in giving aid to students who ask questions during the assessment, cheating, and scoring errors may lower validity. On published assessments, an additional factor may be failure to follow the standard directions, including time limits (Miller et al., 2013).
Student Characteristics
Some invalid interpretations of assessment results are the result of personal factors that influence a student’s performance on the assessment. For example, a student may have had an emotionally upsetting event such as an auto accident or death in the family just prior to the assessment, test anxiety may prevent the student from performing according to true ability level, or the student may not be motivated to exert maximum effort on the assessment. These and similar factors may modify student responses on the assessment and distort the results, leading to lower validity (Miller et al., 2013).
Reliability
Reliability refers to the consistency of assessment results. If an assessment produces reliable scores, the same group of students would achieve approximately the same scores if the same assessment were given on another occasion, assuming that no further learning had taken place during the time interval. Each assessment produces a limited measure of performance at a specific time. If this measurement is reasonably consistent over time, with different raters, or with different samples of the same domain, teachers can be more confident in the assessment results.
Perfect consistency is indicated by a reliability coefficient of 1.00. However, this value is virtually never obtained in real educational settings. Standardized achievement tests usually have reliability coefficients in the .85 to .95 range (Brookhart & Nitko, 2019), but teacher-made tests rarely demonstrate this level of consistency because many extraneous factors may influence the measurement of performance.
Assessment results may be inconsistent because:
1. The behavior being measured is unstable over time because of fluctuations in memory, attention, and effort; intervening learning experiences; or varying emotional or health status.
2. 32The sample of tasks varies from one assessment to another, and some students find one assessment to be easier than another because it contains tasks related to topics they know well.
3. Assessment conditions vary significantly between assessments.
4. Scoring procedures are inconsistent (the same rater may use different criteria on different assessments, or different raters may not reach perfect agreement on the same assessment).
These and other factors introduce a certain but unknown amount of error into every measurement. Methods of determining assessment reliability, therefore, are means of estimating how much measurement error is present under varying assessment conditions. When assessment results are reasonably consistent, there is less measurement error and greater reliability (Miller et al., 2013).
For purposes of understanding sources of inconsistency, it is helpful to view an assessment score as having two components, a true score and an error score, represented by the following equation:
A student’s actual assessment score (X) is also known as the observed or obtained score. That student’s hypothetical true score (T) cannot be measured directly because it is the average of all scores the student would obtain if tested on many occasions with the same test. The observed score contains a certain amount of measurement error (E), which may be a positive or a negative value. This error of measurement, representing the difference between the observed score and the true score, results in a student’s obtained score being higher or lower than his or her true score (Brookhart & Nitko, 2019). If it were possible to measure directly the amount of measurement error that occurred on each testing occasion, two of the values in this equation would be known (X and E), and we would be able to calculate the true score (T). However, we can only estimate indirectly the amount of measurement error, leaving us with a hypothetical true score. Therefore, teachers need to recognize that the obtained score on any test is only an estimate of what the student really knows about the domain being tested.
For example, Matt may obtain a higher score than Kelly on a community health nursing unit test because Matt truly knows more about the content than Kelly does. Test scores should reflect this kind of difference, and if the difference in knowledge is the only explanation for the score difference, no error is involved. However, there may be other potential explanations for the difference between Kelly’s and Matt’s test scores. Matt may have behaved dishonestly to obtain a copy of the test in advance; knowing which items would be included, he had the opportunity to use this unauthorized resource to determine the correct answers to those items. In his case, measurement error would have increased Matt’s obtained score. Kelly may have worked 33overtime the night before the test and may not have gotten enough sleep to allow her to feel alert during the test. Thus, her performance may have been affected by her fatigue and her decreased ability to concentrate, resulting in an obtained score lower than her true score. One goal of assessment designers, therefore, is to maximize the amount of score variance that explains real differences in ability and to minimize the amount of random error variance of scores.
The following points further explain the concept of assessment reliability (Brookhart & Nitko, 2019; Miller et al., 2013):
1. Reliability pertains to assessment results, not to the assessment instrument itself. The reliability of results produced by a given instrument will vary depending on the characteristics of the students being assessed and the circumstances under which it is used. Reliability should be estimated with each use of an assessment instrument.
2. A reliability estimate always refers to a particular type of consistency. Assessment results may be consistent over different periods of time, or different samples of the domain, or different raters or observers. It is possible for assessment results to be reliable in one or more of these respects but not in others. The desired type of reliability evidence depends on the intended use of the assessment results. For example, if the faculty wants to assess students’ ability to make sound clinical decisions in a variety of settings, a measure of consistency over time would not be appropriate. Instead, an estimate of consistency of performance across different tasks would be more useful.
3. A reliability estimate always is calculated with statistical indices. Consistency of assessment scores over time, among raters, or across different assessment measures involves determining the relationship between two or more sets of scores. The extent of consistency is expressed in terms of a reliability coefficient (a form of correlation coefficient) or a standard error of measurement (SEM). A reliability coefficient differs from a validity coefficient (described earlier) in that it is based on agreement between two sets of assessment results from the same procedure instead of agreement with an external criterion.
4. Reliability is an essential but insufficient condition for validity. Teachers cannot make valid inferences from inconsistent assessment results. Conversely, highly consistent results may indicate only that the assessment measured the wrong construct (although doing it very reliably). Thus, low reliability always produces a low degree of validity, but a high reliability estimate does not guarantee a high degree of validity. “In short, reliability merely provides the consistency that makes validity possible” (Miller et al., 2013, p. 110).
34An example may help to illustrate the relationship between validity and reliability. Suppose that the author of this chapter was given a test of her knowledge of assessment principles. The author of a textbook on assessment in nursing education might be expected to achieve a high score on such a test. However, if the test were written in Mandarin Chinese, the author’s score probably would be very low, even if she were a remarkably good guesser, because she cannot read Mandarin Chinese. If the same test were administered the following week, and every week for a month, her scores would likely be consistently low, assuming that she had not learned Mandarin Chinese in the intervals between tests. Therefore, these test scores would be considered reliable because there would be a high correlation among scores obtained on the same test over a period of several administrations. But a valid score-based interpretation of the author’s knowledge of assessment principles could not be drawn because the test was not appropriate for its intended use.
Figure 2.1 uses a target-shooting analogy to further illustrate these relationships. When they design and administer assessments, teachers attempt to consistently (reliably) measure the true value of what students know and can do (hit the bull’s-eye); if they succeed, they can make valid inferences from assessment results. Target 1 illustrates the reliability of scores that are closely grouped on the bull’s-eye, the true score, allowing the teacher to make valid inferences about them. Target 2 displays assessment scores that are widely scattered at a distance from the true score; these scores are not reliable, contributing to a lack of validity evidence. Target 3 shows assessment scores that are reliable because they are closely grouped together, but they are still distant from the true score. The teacher would not be able to make valid interpretations of such scores (Miller et al., 2013).