Planning for Testing

3







PLANNING FOR TESTING



It was Wednesday, and Paul Johnson was caught by surprise when he looked at his office calendar and realized that a test for the course he was teaching was only 1 week away, even though he was the person who had scheduled it! Thankful that he was not teaching this course for the first time, he searched his files for the test he had used last year. When he found it, his brief review showed that some of the content was outdated and that the test did not include items on the new content he had added this year. Because of a university policy that requires a minimum of 3 business days for the copy center to reproduce a test, Paul realized that he would have to finish the necessary revisions of the test and submit it for copying no later than Friday. He would be teaching in the clinical area on Thursday and teaching a class on Friday morning, and he was preparing to go out of town to attend a conference on Saturday.


He stayed up late on Wednesday night to revise the test, planning to proofread it on Thursday after he finished his clinical teaching responsibilities. But because of a family emergency, he was not able to proofread the test that night. Trusting that he had not made any serious clerical errors, he sent the test to the copy center before his class on Friday. When he returned to the office after his conference on Tuesday, he discovered that the copy center had been flooded during a severe storm over the weekend, and none of the photocopy jobs could be completed, including his test. Paul printed another copy of the test but could not take it anywhere else to be copied that day because of a scheduled committee meeting. To complicate matters, the department secretary had called in sick that day, and Paul could not change his childcare arrangements to allow him to stay late at the office to finish copying the test on the department machine. He came in very early on Wednesday morning to use the department photocopier, and finally finished the job just before the test was scheduled to begin.


With 5 minutes to spare, Paul rushed into the classroom and distributed the still-warm test booklets. As he was congratulating himself for meeting his deadline, the first student raised a hand with a question: “On item three, is there a typo?” Then another student said, “I don’t think that the correct answer for item six is there.” A third student complained, “Item nine is missing; the numbers jump from 8 to 10” 48and a fourth student stated, “There are two ds for item 10.” Paul knew that it was going to be a long morning. But the worst was yet to come. As they were turning in their tests, students complained, “This test didn’t cover the material that I thought it would cover,” and “We spent a lot of class time analyzing case studies, but we were tested on memorization of facts.” Needless to say, Paul did not look forward to the posttest discussion the following week.


Too often, teachers give little thought to the preparation of their tests until the last minute and then rush to get the job done. A test that is produced in this manner often contains items that are poorly chosen, ambiguous, and either too easy or too difficult, as well as grammatical, spelling, and other clerical errors. The solution lies in adequate planning for test construction before the item-writing phase begins, followed by careful critique of the completed test by other teachers. Exhibit 3.1 lists the steps of the test-construction process. This chapter describes the steps involved in planning for test construction; subsequent chapters will focus on the techniques of writing test items of various formats, assembling and administering the test, and analyzing the test results.


 






EXHIBIT 3.1 CHECKLIST FOR TEST CONSTRUCTION



    Define the purpose of the test.


    Describe the population to be tested.


    Determine the optimum length of the test.


    Specify the desired difficulty and discrimination levels of the test items.


    Determine the scoring procedure or procedures to be used.


    Select item formats to be used.


    Construct a test blueprint or table of specifications.


    Write the test items.


    Have the test items critiqued.


    Determine the arrangement of items on the test.


    Write specific directions for each item format.


    Write general directions for the test and prepare a cover sheet.


    Print or type the test.


    Proofread the test.


    Reproduce the test.


    Prepare a scoring key.


    Prepare students for taking the test.






 

49Purpose and Population


All decisions involved in planning a test are based on a teacher’s knowledge of the purpose of the test and the relevant characteristics of the population of learners to be tested. The purpose for the test involves why it is to be given, what it is supposed to measure, and how the test scores will be used. For example, if a test is to be used to measure the extent to which students have met learning objectives to determine course grades, its primary purpose is summative. If the teacher expects the course grades to reflect real differences in the amount of knowledge among the students, the test must be sufficiently difficult to produce an acceptable range of scores. On the other hand, if a test is to be used primarily to provide feedback to staff nurses about their knowledge following a continuing education program, the purpose of the test is formative. If the results will not be used to make important personnel decisions, a large range of scores is not necessary, and the test items can be of moderate or low difficulty. In this chapter, the outcomes of learning are referred to as objectives, but as discussed in Chapter 1, Assessment and the Educational Process, many nurse educators refer to these as outcomes. A teacher’s knowledge of the population that will be tested will be useful in selecting the item formats to be used, determining the length of the test and the testing time required, and selecting the appropriate scoring procedures. The term population is not used here in its research sense, but rather to indicate the general group of learners who will be tested. The students’ reading levels, English-language literacy, visual acuity, health, and previous testing experience are examples of factors that might influence these decisions. For example, if the population to be tested is a group of five patients who have completed preoperative instruction for coronary bypass graft surgery, the teacher would probably not administer a test of 100 multiple-choice and matching items with a machine-scored answer sheet. However, this type of test might be most appropriate as a final course examination for a class of 75 senior nursing students.


Test Length


The length of the test is an important factor that is related to its purpose, the abilities of the students, the item formats to be used, the amount of testing time available, and the desired reliability of the test scores. As discussed in Chapter 2, Qualities of Effective Assessment Procedures: Reliability, Validity, and Usability, the reliability of test scores generally improves as the length of the assessment increases, so the teacher should attempt to include as many items as possible to adequately sample the content. However, if the purpose of the test is to measure knowledge of a small content domain with a limited number of objectives, fewer items will be needed to achieve an adequate sampling of the content.


50It should be noted that assessment length refers to the number of test items or tasks, not to the amount of time it would take the student to complete the test. Items that require the student to analyze a complex data set, draw conclusions, and supply or choose a response take more test administration time; therefore, fewer items of those types can be included on a test to be completed in a fixed time period. When the number of complex assessment tasks to be included on a test is limited by test administration time, it is better to test more frequently than to create longer tests that test less important learning goals (Brookhart & Nitko, 2019; Miller, Linn, & Gronlund, 2013).


Because test length probably is limited by the scheduled length of a testing period, it is wise to construct the test so that the majority of the students working at their normal pace will be able to attempt to answer all items. This type of test is called a power test. A speeded test is one that does not provide sufficient time for all students to respond to all items. Although most standardized tests are speeded, this type of test generally is not appropriate for teacher-made tests in which accuracy rather than speed of response is important (Brookhart & Nitko, 2019; Miller et al., 2013).


Difficulty and Discrimination Level


The desired difficulty of a test and its ability to differentiate among various levels of performance are related considerations. Both factors are affected by the purpose of the test and the way in which the scores will be interpreted and used. The difficulty of individual test items affects the average test score; the mean score of a group of students is equal to the sum of the difficulty levels of the test items. The difficulty level of each test item depends on the complexity of the task, the ability of the students who answer it, and the quality of the teaching. It also may be related to the perceived complexity of the item; if students perceive the task as too difficult, they may skip it, resulting in a lower percentage of students who answer the item correctly (Brookhart & Nitko, 2019). See Chapter 12, Test and Item Analysis: Interpreting Test Results, for a more detailed discussion of item difficulty and discrimination.


If test results are to be used to determine the relative achievement of students (i.e., norm-referenced interpretation), the majority of items on the test should be moderately difficult. The recommended difficulty level for selection-type test items depends on the number of choices allowed. The percentage of students who answer each item correctly should be about midway between 100% and the chance of guessing correctly (e.g., 50% for true–false items, 25% correct for four-alternative multiple-choice items). For example, a moderately difficult true–false item should be answered correctly by 75% to 85% of students (Waltz, Strickland, & Lenz, 2010). When the majority of items on a test are too easy or too difficult, they will not discriminate well between students with varying levels of knowledge or ability.


However, if the teacher wants to make criterion-referenced judgments, more commonly used in nursing education and practice settings, the overall concern is 51whether a student’s performance meets a set standard rather than on the actual score itself. If the purpose of the assessment is to screen out the least capable students (e.g., those failing a course), it should be relatively easy for most test-takers. However, comparing performance to a set standard does not limit assessment to testing of lower level knowledge and ability; considerations of assessment validity should guide the teacher to construct tests that adequately sample the knowledge or performance domain.


When criterion-referenced test results are reported as percentage scores, their variability (range of scores) may be similar to norm-referenced test results, but the interpretation of the range of scores would be more narrow. For example, on a final exam in a nursing course, the potential score range may be 0% to 100%, but the passing score is set at 80%. Even if there is wide variability of scores on the exam, the primary concern is whether the test correctly classifies each student as performing above or below the standard (e.g., 80%). In this case, the teacher should examine the difficulty level of test items and compare them between groups (students who met the standard and students who did not). If item difficulty levels indicate a relatively easy or relatively difficult exam, criterion-referenced decisions will still be appropriate if the measure consistently classifies students according to the performance standard (Miller et al., 2013; Waltz et al., 2010).


It is important to keep in mind that the difficulty level of test items can only be estimated in advance, depending on the teacher’s experience in testing this content and knowledge of the abilities of the students to be tested. When the test has been administered and scored, the actual difficulty index for each item can be compared with the expected difficulty, and items can be revised if the actual difficulty level is much lower or much higher than anticipated (Waltz et al., 2010). Procedures for determining how the test items actually perform are discussed in Chapter 12, Test and Item Analysis: Interpreting Test Results.


Item Formats


Some students may be particularly adept at answering essay items; others may prefer multiple-choice items. However, tests should be designed to provide information about students’ knowledge or abilities, not about their skill in taking certain types of tests. A test with a variety of item formats provides students with multiple ways to demonstrate their competence (Brookhart & Nitko, 2019). All item formats have their advantages and limitations, which are discussed in later chapters.


Selection Criteria for Item Formats


Teachers should select item formats for their tests based on a variety of factors, such as the learning outcomes to be evaluated, the specific skill to be measured, and the 52ability level of the students. Some objectives are better measured with certain item formats. For example, if the instructional objective specifies that the student will be able to “discuss the comparative advantages and disadvantages of breast and bottle-feeding,” a multiple-choice item would be inappropriate because it would not allow the teacher to evaluate the student’s ability to organize and express ideas on this topic. An essay item would be a better choice for this purpose. Essay items provide opportunities for students to formulate their own responses, drawing on prior learning, and to express their ideas in writing; these often are desired outcomes of nursing education programs.


The teacher’s time constraints for constructing the test may affect the choice of item format. In general, essay items take less time to write than multiple-choice items, but they are more difficult and time-consuming to score. A teacher who has little time to prepare a test and therefore chooses an essay format, assuming that this choice is also appropriate for the objectives to be tested, must plan for considerable time after the test is given to score it.


In nursing education programs, faculty members often develop multiple-choice items as the predominant, if not exclusive, item format because for a number of years, licensure and certification examinations contained only multiple-choice items. Although this type of test item provides essential practice for students in preparation for taking such high-stakes examinations, it contradicts the principle of selecting the most appropriate type of test item for the outcome and content to be evaluated. In addition, it limits variety in testing and creativity in evaluating student learning. Although practice with multiple-choice items questions is critical, other types of test items and evaluation strategies also are appropriate for measuring student learning in nursing. In fact, although many of the NCLEX® (National Council Licensure Examination) examination items are four-option multiple-choice items, the item pools now contain other formats such as completion and multiple response (National Council of State Boards of Nursing, 2019). Nurse educators should not limit their selection of item formats based on the myth that learners must be tested exclusively with the item format most frequently used on a licensure or certification test.


On the other hand, each change of item format on a test requires a change of task for students. Therefore, the number of different item formats to include on a test also depends on the length of the test and the level of the learner. It is generally recommended that teachers use no more than three item formats on a test. Shorter assessments, such as a 10-item quiz, may be limited to a single item format.


Objectively and Subjectively Scored Items


Another powerful and persistent myth is that some item formats evaluate students more objectively than do other formats. Although it is common to describe true–false, matching, and multiple-choice items as “objective,” objectivity refers to the 53way items are scored, not to the type of item or their content (Miller et al., 2013). Objectivity means that once the scoring key is prepared, it is possible for multiple teachers on the same occasion or the same teacher on multiple occasions to arrive at the same score. Subjectively scored items, like essay items (and short-answer items, to a lesser extent), require the judgment of the scorer to determine the degree of correctness and therefore are subject to more variability in scoring.


Selected-Response and Constructed-Response Items


Another way of classifying test items is to identify them by the type of response required of the test-taker (Miller et al., 2013; Waltz et al., 2010). Selected-response (or “choice”) items require the test-taker to select the correct or best answer from among options provided by the teacher. In this category, the item formats are true–false, matching exercises, and multiple-choice. Constructed-response (or “supply”) formats require the learner to supply an answer, and may be classified further as limited response (or short response) and extended response. These are the short-answer and essay formats. Exhibit 3.2 depicts this schema for classifying test-item formats and the variations of each type.


Scoring Procedures


Decisions about what scoring procedure or procedures to use are somewhat dependent on the choice of item formats. Student responses to short-answer, numerical calculation, and essay items, for instance, usually must be hand-scored, whether they are recorded directly on the test itself, on a separate answer sheet, or in a booklet. Answers to objective test items such as multiple choice, true–false, and matching also may be recorded on the test itself or on a separate answer sheet. Scannable answer sheets greatly increase the speed of objective scoring procedures and have the additional advantage of allowing computer-generated item analysis reports to be produced. The teacher should decide if the time and resources available for scoring a test suggest that hand scoring or electronic scoring would be preferable. In any case, this decision alone should not influence the choice of test-item format.


 






EXHIBIT 3.2 CLASSIFICATION OF TEST ITEMS BY TYPE OF RESPONSE























SELECTED-RESPONSE ITEM FORMATS
(“CHOICE” ITEMS)
CONSTRUCTED-RESPONSE ITEM FORMATS
(“SUPPLY” ITEMS)
True–false Short answer
Matching exercises Completion or fill in the blank
Multiple choice Restricted-response essay
Multiple response Extended-response essay




Only gold members can continue reading. Log In or Register to continue

Stay updated, free articles. Join our Telegram channel

Apr 18, 2020 | Posted by in NURSING | Comments Off on Planning for Testing

Full access? Get Clinical Tree

Get Clinical Tree app for offline access