Clarifying Measurement and Data Collection in Quantitative Research

Chapter 10


Clarifying Measurement and Data Collection in Quantitative Research




Key Terms



Accuracy, p. 292


Accuracy of a screening test, p. 295


Administrative data, p. 311


Alternate forms reliability, p. 290


Construct validity, p. 291


Content validity, p. 291


Data collection, p. 310


Direct measures, p. 283


Equivalence, p. 290


Error in physiological measures, p. 293


Evidence of validity from contrasting groups, p. 291


Evidence of validity from convergence, p. 291


Evidence of validity from divergence, p. 291


False negative, p. 295


False positive, p. 295


Gold standard, p. 295


Highly sensitive test, p. 296


Highly specific test, p. 296


Homogeneity, p. 290


Indirect measures, or indicators, p. 283


Internal consistency, p. 290


Interrater reliability, p. 290


Interval-level measurement, p. 285


Interview, p. 302


Levels of measurement, p. 284


Likelihood ratios, p. 298


Likert scale, p. 307


Measurement, p. 282


Measurement error, p. 286


Negative likelihood ratio, p. 298


Nominal-level measurement, p. 284


Observational measurement, p. 300


Ordinal-level measurement, p. 284


Physiological measures, p. 292


Positive likelihood ratio, p. 298


Precision, p. 292


Primary data, p. 311


Questionnaire, p. 304


Random measurement error, p. 287


Rating scales, p. 307


Ratio-level measurement, p. 286


Readability level, p. 291


Reliability, p. 287


Reliability testing, p. 288


Scale, p. 306


Secondary data, p. 311


Sensitivity, p. 296


Specificity, p. 296


Stability, p. 289


Structured interview, p. 302


Structured observational measurement, p. 300


Systematic measurement error, p. 287


Test-retest reliability, p. 289


True measure or score, p. 286


True negative, p. 295


True positive, p. 295


Unstructured interview, p. 302


Unstructured observations, p. 300


Validity, p. 290


Visual analog scale, p. 308


Measurement is a very important part of the quantitative research process. When quality measurement methods are used in a study, it improves the accuracy or validity of study outcomes or findings. Measurement is the process of assigning numbers or values to individuals’ health status, objects, events, or situations using a set of rules (Kaplan, 1963). For example, we measure a patient’s blood pressure (BP) using a measurement method such as a stethoscope, cuff, and sphygmomanometer. Then a number or value is assigned to that patient’s BP, such as 120/80 mm Hg. In research, variables are measured with the best possible measurement method available to produce trustworthy data that can be used in statistical analyses. Trustworthy data are essential if a study is to produce useful findings to guide nursing practice (Brown, 2014; Fawcett & Garity, 2009).


In critically appraising studies, you need to judge the trustworthiness of the measurement methods used. To produce trustworthy measurements, rules have been established to ensure that numbers, values, or categories will be assigned consistently from one subject (or event) to another and, eventually, if the measurement method or strategy is found to be meaningful, from one study to another. The rules of measurement established for research are similar to those used in nursing practice. For example, measuring a BP requires that the patient be allowed to rest for 5 minutes and then should be sitting, legs uncrossed, arm relaxed on a table at heart level, cuff of accurate size placed correctly on the upper arm that is free of restrictive clothing, and stethoscope correctly placed over the brachial artery at the elbow. Following these rules ensures that the patient’s BP is accurately and precisely measured and that any change in the BP reading can be attributed to a change in BP, rather than to an inadvertent error in the measurement technique.


Understanding the logic of measurement is important for critically appraising the adequacy of measurement methods in nursing studies. This chapter includes a discussion of the key concepts of measurement theory—directness of measurement, levels of measurement, measurement error, reliability, and validity. The accuracy and precision of physiological measures and sensitivity and specificity of diagnostic and screening tests are also addressed. Some of the most common measurement methods or strategies used in nursing research are briefly described. The chapter concludes with guidelines for critically appraising the data collection processes used in studies.



Concepts of Measurement Theory


Measurement theory guides the development and use of measurement methods or tools in research. Measurement theory was developed many years ago by mathematicians, statisticians, and other scholars and includes rules that guide how things are measured (Kaplan, 1963). These rules allow individuals to be consistent in how they perform measurements; thus a measurement method used by one person will consistently produce similar results when used by another person. This section discusses some of the basic concepts and rules of measurement theory, including directness of measurement, levels of measurement, measurement error, reliability, and validity.



Directness of Measurement


To measure, the researcher must first identify the object, characteristic, element, event, or situation to be measured. In some cases, identifying the object to measure and determining how to measure it are quite simple, such as when the researcher measures a person’s weight and height. These are referred to as direct measures. Direct measures involve determining the value of concrete factors such as weight, waist circumference, temperature, heart rate, BP, and respiration. Technology is available to measure many bodily functions, biological indicators, and chemical characteristics. The focus of measurement in these instances is on the accuracy and precision of the measurement method and process. If a patient’s BP is to be accurate, it must be measured with a quality stethoscope and sphygmomanometer and must be precisely or consistently measured, as discussed earlier in the introduction. In research, three BP measurements are usually taken and averaged to determine the most accurate and precise BP reading. Nurse researchers are also experienced in gathering direct measures of demographic variables such as age, gender, ethnic origin, and diagnosis.


However, in many cases in nursing, the thing to be measured is not a concrete object but an abstract idea, characteristic, or concept such as pain, stress, caring, coping, depression, anxiety, and adherence. Researchers cannot directly measure an abstract idea, but they can capture some of its elements in their measurements, which are referred to as indirect measures or indicators of the concepts. Rarely, if ever, can a single measurement strategy measure all aspects of an abstract concept. Therefore multiple measurement methods or indicators are needed, and even then they cannot be expected to measure all elements of an abstract concept. For example, multiple measurement methods might be used to describe pain in a study, which decreases the measurement error and increases the understanding of pain. The measurement methods of pain might include the FACES Pain Scale, observation (rubbing and/or guarding the area that hurts, facial grimacing, and crying), and physiological measures, such as pulse and blood pressure. Figure 10-1 demonstrates multiple measures of the concept of pain and demonstrates how having more measurement methods increases the understanding of the concept. The bold, black-rimmed largest circle represents the concept of pain and the pale-colored smaller circles represent the measurement methods. A larger circle is represented by physiological measures indicating these measures (pulse, blood pressure, and respirations) add more to the objective measurement of pain. Even with three different types of measurement methods being used, the entire concept of pain is not completely measured, as indicated by the white areas within the black-rimmed large circle.




Levels of Measurement


Various measurement methods produce data that are at different levels of measurement. The traditional levels of measurement were developed by Stevens (1946), who organized the rules for assigning numbers to objects so that a hierarchy in measurement was established. The levels of measurement, from low to high, are nominal, ordinal, interval, and ratio.



Nominal-Level Measurement


Nominal-level measurement is the lowest of the four measurement categories. It is used when data can be organized into categories of a defined property but the categories cannot be rank-ordered. For example, you may decide to categorize potential study subjects by diagnosis. However, the category “kidney stone,” for example, cannot be rated higher than the category “gastric ulcer”; similarly, across categories, “ovarian cyst” is no closer to “kidney stone” than to “gastric ulcer.” The categories differ in quality but not quantity. Therefore, it is not possible to say that subject A possesses more of the property being categorized than subject B. (Rule: The categories must not be orderable.) Categories must be established in such a way that each datum will fit into only one of the categories. (Rule: The categories must be exclusive.) All the data must fit into the established categories. (Rule: The categories must be exhaustive.) Data such as gender, race and ethnicity, marital status, and diagnoses are examples of nominal data. The rules for the four levels of measurement are summarized in Figure 10-2.




Ordinal-Level Measurement


With ordinal-level measurement, data are assigned to categories that can be ranked. (Rule: The categories can be ranked [see Figure 10-2].) To rank data, one category is judged to be (or is ranked) higher or lower, or better or worse, than another category. Rules govern how the data are ranked. As with nominal data, the categories must be exclusive (each datum fits into only one category) and exhaustive (all data fit into at least one category). With ordinal data, the quantity also can be identified (Stevens, 1946). For example, if you are measuring intensity of pain, you may identify different levels of pain. You probably will develop categories that rank these different levels of pain, such as excruciating, severe, moderate, mild, and no pain. However, in using categories of ordinal measurement, you cannot know with certainty that the intervals between the ranked categories are equal. A greater difference may exist between mild and moderate pain, for example, than between excruciating and severe pain. Therefore ordinal data are considered to have unequal intervals.


Many scales used in nursing research are ordinal levels of measurement. For example, it is possible to rank degrees of coping, levels of mobility, ability to provide self-care, or levels of dyspnea on an ordinal scale. For dyspnea with activities of daily living (ADLs), the scale could be:



The measurement is ordinal because it is not possible to claim that equal distances exist between the rankings. A greater difference may exist between the ranks of 1 and 2 than between the ranks of 2 and 3.



Interval-Level Measurement


Interval-level measurement uses interval scales, which have equal numerical distances between intervals. These scales follow the rules of mutually exclusive, exhaustive, and ranked categories and are assumed to represent a continuum of values. (Rule: The categories must have equal intervals between them [see Figure 10-2].) Therefore the magnitude of the attribute can be more precisely defined. However, it is not possible to provide the absolute amount of the attribute, because the interval scale lacks a zero point. Temperature is the most commonly used example of an interval scale. The difference between the temperatures of 70° F and 80° F is 10° F and is the same as the difference between the temperatures of 30° F and 40° F. Changes in temperature can be measured precisely. However, a temperature of 0° F does not indicate the absence of temperature.



Ratio-Level Measurement


Ratio-level measurement is the highest form of measurement and meets all the rules of other forms of measurement—mutually exclusive categories, exhaustive categories, ordered ranks, equally spaced intervals, and a continuum of values. Interval- and ratio-level data can be added, subtracted, multiplied, and divided because of the equal intervals and continuum of values of these data. Thus interval and ratio data can be analyzed with statistical techniques of greater precision and strength to determine significant relationships and differences (Grove, 2007). In addition, ratio-level measures have absolute zero points. (Rule: The data must have absolute zero [see Figure 10-2].) Weight, length, and volume are commonly used as examples of ratio scales. All three have absolute zeros, at which a value of zero indicates the absence of the property being measured; zero weight means the absence of weight. Because of the absolute zero point, such statements as “Subject A weighs 25 more pounds than subject B” or “Medication container A holds two times as much as container B” can be justified (Stevens, 1946).


In critically appraising a study, you need to determine the level of measurement achieved for each measurement method. Researchers try to achieve the highest level of measurement possible for a variable because more rigorous statistical analyses can be conducted on interval- and ratio- level data to describe variables, determine relationships among variables, and examine differences among groups.



Measurement Error


The ideal perfect measure is referred to as the true measure or score. However, some error is always present in any measurement strategy. Measurement error is the difference between the true measure and what is actually measured (Grove, Burns, & Gray, 2013). The amount of error in a measure varies from considerable error in one measurement to very little in another. Measurement error exists with direct and indirect measures. With direct measures, both the object and measurement method are visible. Direct measures, which generally are expected to be highly accurate, are subject to error. For example, a weight scale may be inaccurate for 0.5 pound, precisely calibrated BP equipment might decrease in precision with use, or a tape measure may not be held at exactly the same tension in measuring the waist of each patient. A subject in a study may be 65 years old but may write illegibly on the demographic form. As a result, the age may be entered inaccurately into the study database.


With indirect measures, the element being measured cannot be seen directly. For example, you cannot see pain. You may observe behaviors or hear words that you think represent pain, but pain is a sensation that is not always clearly recognized or expressed by the person experiencing it. The measurement of pain is usually conducted with a scale but can also include observation and physiological measures as shown in Figure 10-1. Efforts to measure concepts such as pain usually result in measuring only part of the concept. Sometimes measures may identify some aspects of the concept but may include other elements that are not part of the concept. In Figure 10-1, the measurement methods of scale, observation, and physiological measures include factors other than pain, as indicated by the parts of the circles that are outside the black-rimmed circle of the concept pain. For example, measurement methods for pain might be measuring aspects of anxiety and fear in addition to pain. However, using multiple methods to measure a concept or variable usually decreases the measurement error and increases the understanding of the concept being measured.


Two types of error are of concern in measurement, random error and systematic error. The difference between random and systematic error is in the direction of the error. In random measurement error, the difference between the measured value and the true value is without pattern or direction (random). In one measurement, the actual value obtained may be lower than the true value, whereas in the next measurement, the actual value obtained may be higher than the true value. A number of chance situations or factors can occur during the measurement process that can result in random error (Waltz, Strickland, & Lenz, 2010). For example, the person taking the measurements may not use the same procedure every time, a subject completing a paper and pencil scale may accidentally mark the wrong column, or the person entering the data into a computer may punch the wrong key. The purpose of measuring is to estimate the true value, usually by combining a number of values and calculating an average. An average value, such as the mean, is a closer estimate of the true measurement. As the number of random errors increases, the precision of the estimate decreases.


Measurement error that is not random is referred to as systematic error. In systematic measurement error, the variation in measurement values from the calculated average is primarily in the same direction. For example, most of the variation may be higher or lower than the average that was calculated. Systematic error occurs because something else is being measured in addition to the concept. For example, a paper and pencil rating scale designed to measure hope may actually also be measuring perceived support. When measuring subjects’ weights, a scale that shows weights that are 2 pounds over the true weights will give measures with systematic error. All the measured weights will be high, and as a result the mean will be higher than if an accurate weight scale were used. Some systematic error occurs in almost any measure. Because of the importance of this type of error in a study, researchers spend considerable time and effort refining their instruments to minimize systematic measurement error (Waltz et al., 2010).


In critically appraising a published study, you will not be able to judge the extent of measurement error directly. However, you may find clues about the amount of measurement error in the published report. For example, if the researchers have described the method of measurement in great detail and provided evidence of accuracy and precision of the measurement, then the probability of error typically is reduced. The measurement errors for BP readings can be minimized by checking the BP cuff and sphygmomanometer for accuracy and recalibrating them periodically during data collection, obtaining three BP readings and averaging them to determine one BP reading for each subject, and having a trained nurse using a protocol to take the BP readings. If a checklist of pain behaviors is developed for observation, less error occurs than if the observations for pain are unstructured. Measurement will also be more precise if researchers use a well-developed, reliable, and valid scale, such as the FACES Pain Scale, instead of developing a new pain scale for their study. In published studies, look for the steps that researchers have taken to decrease measurement error and increase the quality of their study findings.



Reliability


Reliability is concerned with the consistency of a measurement method. For example, if you are using a paper and pencil scale to measure depression, it should indicate similar depression scores each time a subject completes it within a short period of time. A scale that does not produce similar scores for a subject with repeat testing is considered unreliable and results in increased measurement error (Kerlinger & Lee, 2000; Waltz et al., 2010). For example, the Center for Epidemiologic Studies Depression Scale (CES-D) was developed to diagnose depression in mental health patients (Radloff, 1977). The CES-D has proven to be a quality measure of depression in research over the last 40 years. Figure 10-3 illustrates this 20-item Likert scale. If the items on this scale consistently measure what it was developed to measure, depression, then this scale is considered to be both reliable and valid. The different types of reliability and validity testing are discussed in the next sections (outlined in Table 10-1).



Table 10-1


Determining the Quality of Measurement Methods






















Quality Indicator Description
Reliability Test-retest reliability: Repeated measures with a scale or instrument to determine the consistency or stability of the instrument in measuring a concept
Alternate forms reliability: Comparison of two paper and pencil instruments to determine their equivalence in measuring a concept
Interrater reliability: Comparison of two observers or judges in a study to determine their equivalence in making observations or judging events
Homogeneity or internal consistency reliability: Reliability testing used primarily with multi-item scales in which each item on the scale is correlated with all other items to determine the consistency of the scale in measuring a concept
Validity Content validity: Examines the extent to which a measurement method includes all the major elements relevant to the concept being measured.
Evidence of validity from contrasting groups: Instrument or scale given to two groups expected to have opposite or contrasting scores; one group scores high on the scale and the other scores low.
Evidence of validity from convergence: Two scales measuring the same concept are administered to a group at the same time, and the subjects’ scores on the scales should be positively correlated. For example, subjects completing two scales to measure depression should have positively correlated scores.
Evidence of validity from divergence: Two scales that measure opposite concepts, such as hope and hopelessness, are administered to subjects at the same time and should result in negatively correlated scores on the scales.
Readability Readability level: Conducted to determine the participants’ ability to read and comprehend the items on an instrument. Researchers need to report the level of education that subjects need to read the instrument. Readability must be appropriate to promote reliability and validity of an instrument.
Precision Precision of physiological measure: Degree of consistency or reproducibility of the measurements made with physiological instruments or equipment; comparable to reliability for paper and pencil scales.
Accuracy Accuracy of physiological measure: Addresses the extent to which the physiological instrument or equipment measures what it is supposed to measure in a study; comparable to validity for paper and pencil scales.



Reliability Testing


Reliability testing is a measure of the amount of random error in the measurement technique. It takes into account such characteristics as dependability, precision, stability, consistency, and reproducibility (Grove et al., 2013; Waltz et al., 2010). Because all measurement techniques contain some random error, reliability exists in degrees and usually is expressed as a correlation coefficient (r). Cronbach’s alpha coefficient is the most commonly used measure of reliability for scales with multiple items (see the following discussion of homogeneity for more details). Estimates of reliability are specific to the sample being tested. Thus high reliability values reported for an established instrument do not guarantee that reliability will be satisfactory in another sample or with a different population. Researchers need to perform reliability testing on each instrument used in a study to ensure that it is reliable for that study (Bialocerkowski, Klupp, & Bragge, 2010; DeVon et al., 2007).


Reliability testing focuses on the following three aspects of reliability—stability, equivalence, and homogeneity (see Table 10-1). Stability is concerned with the consistency of repeated measures of the same attribute with the use of the same scale or instrument. It is usually referred to as test-retest reliability. This measure of reliability is generally used with physical measures, technological measures, and paper and pencil scales. Use of the technique requires an assumption that the factor to be measured remains the same at the two testing times and that any change in the value or score is a consequence of random error. For example, physiological measures such as BP equipment can be tested and then immediately retested, or the equipment can be used for a time and then retested to determine the necessary frequency of recalibration. Researchers need to include test-retest reliability results in their published studies to document the reliability of their measurement methods. For example, the CES-D (see Figure 10-3) has been used frequently in nursing studies over the years and has demonstrated test-retest reliability ranging from r = 0.51 to 0.67 in 2- to 8-week intervals. This is very solid test-retest reliability for this scale, indicating that it is consistently measuring depression with repeat testing and recognizing that subjects’ levels of depression vary somewhat over time (Locke & Putnam, 2002; Sharp & Lipsky, 2002).


Reliability testing can also include equivalence, which involves the comparison of two versions of the same paper and pencil instrument or of two observers measuring the same event. Comparison of two observers or two judges in a study is referred to as interrater reliability. Studies that include collecting observational data or the making of judgments by two or more data gatherers require the reporting of interrater reliability. There is no absolute value below which interrater reliability is unacceptable. However, any value below 0.80 should generate serious concern about the reliability of the data, data gatherer, or both. The interrater reliability value is best to be 0.90 or 90%, which means 90% reliability and 10% random error, or higher.


Comparison of two paper and pencil instruments is referred to as alternate forms reliability, or parallel forms reliability. Alternative forms of instruments are of more concern in the development of normative knowledge testing such as the Scholastic Aptitude Test (SAT), which is used as a college entrance requirement. The SAT has been used for decades, and there are many forms of this test, with a variety of items included on each. These alternate forms of the SAT were developed to measure students’ knowledge consistently and protect the integrity of the test.


Homogeneity is a type of reliability testing used primarily with paper and pencil instruments or scales to address the correlation of each question to the other questions within the scale. Questions on a scale are also called items. The principle is that each item should be consistently measuring a concept such as depression and so should be highly correlated with the other items. Homogeneity testing examines the extent to which all the items in the instrument consistently measure the construct and is a test of internal consistency. The statistical procedure used for this process is Cronbach’s alpha coefficient for interval- and ratio-level data. On some scales, the person responding selects between two options, such as yes and no. The resulting data are dichotomous, and the Kuder-Richardson formula (K-R 20) is used to estimate internal consistency. A Cronbach alpha coefficient of 1.00 indicates perfect reliability, and a coefficient of 0.00 indicates no reliability (Waltz et al., 2010). A reliability of 0.80 is usually considered a strong coefficient for a scale that has documented reliability and has been used in several studies, such as the CES-D (Grove et al., 2013; Radloff, 1977). The CES-D has strong internal consistency reliability, with Cronbach’s alphas ranging from 0.84 to 0.90 in field studies (Locke & Putnam, 2002; Sharp & Lipsky, 2002). For relatively new scales, a reliability of 0.70 is considered acceptable because the scale is being refined and used with a variety of samples. The stronger correlation coefficients, which are closer to 1.0, indicate less random error and a more reliable scale. A research report needs to include the results from stability, equivalence, and/or homogeneity reliability testing done on a measurement method from previous research and in the present study (Grove et al., 2013). A measurement method must be reliable if it is to be considered a valid measure for a study concept.



Validity


The validity of an instrument is a determination of how well the instrument reflects the abstract concept being examined. Validity, like reliability, is not an all or nothing phenomenon; it is measured on a continuum. No instrument is completely valid, so researchers determine the degree of validity of an instrument rather than whether validity exists (DeVon et al., 2007; Waltz et al., 2010). Validity will vary from one sample to another and one situation to another; therefore validity testing evaluates the use of an instrument for a specific group or purpose, rather than the instrument itself. An instrument may be valid in one situation but not another. For example, the CES-D was developed to measure the depression of patients in mental health settings. Will the same scale be valid as a measure of the depression of cancer patients? Researchers determine this by pilot-testing the scale to examine the validity of the instrument in a new population. In addition, the original CES-D (see Figure 10-3) was developed for adults, but the scale has been refined and tested with young children (4 to 6 years of age), school-age children, adolescents, and older adults. Thus different versions of this scale can be used with those of all ages, ranging from 4 years old to geriatric age (Sharp & Lipsky, 2002).


In this text, validity is considered a single broad method of measurement evaluation, referred to as construct validity, and includes content and predictive validity (Rew, Stuppy, & Becker, 1988). Content validity examines the extent to which the measurement method or scale includes all the major elements or items relevant to the construct being measured. The evidence for content validity of a scale includes the following: (1) how well the items of the scale reflect the description of the concept in the literature; (2) the content experts’ evaluation of the relevance of items on the scale that might be reported as an index (Grove et al., 2013); and (3) the potential subjects’ responses to the scale items.


Paper and pencil and electronic instruments or scales must be at a level that potential study subjects can read and understand. Readability level focuses on the study participants’ ability to read and comprehend the content of an instrument or scale. Readability is essential if an instrument is to be considered valid and reliable for a sample (see Table 10-1). Assessing the level of readability of an instrument is relatively simple and takes about 10 to 15 minutes. More than 30 readability formulas are available. These formulas use counts of language elements to provide an index of the probable degree of difficulty of comprehending the scale (Grove et al., 2013). Readability formulas are now a standard part of word-processing software.


Three common types of validity presented in published studies include evidence of validity from (1) contrasting groups, (2) convergence, and (3) divergence. An instrument’s evidence of validity from contrasting groups can be tested by identifying groups that are expected (or known) to have contrasting scores on an instrument. For example, researchers select samples from a group of individuals with a diagnosis of depression and a group that does not have this diagnosis. You would expect these two groups of individuals to have contrasting scores on the CES-D. The group with the diagnosis of depression would be expected to have higher scores than those without the depression diagnosis, which would add to the construct validity of this scale.


Evidence of validity from convergence is determined when a relatively new instrument is compared with an existing instrument(s) that measures the same construct. The instruments, the new and existing ones, are administered to a sample at the same time, and the results are evaluated with correlational analyses. If the measures are strongly positively correlated, the validity of each instrument is strengthened. For example, the CES-D has shown positive correlations ranging from 0.40 to 0.80 with the Hamilton Rating Scale for Depression, which supports the convergent validity of both scales (Locke & Putnam, 2002; Sharp & Lipsky, 2002).


Sometimes instruments can be located that measure a concept opposite to the concept measured by the newly developed instrument. For example, if the newly developed instrument is a measure of hope, you could make a search for an instrument that measures hopelessness or despair. Having study participants complete both these scales is a way to examine evidence of validity from divergence. Correlational procedures are performed with the measures of the two concepts. If the divergent measure (hopelessness scale) is negatively correlated (such as − 0.4 to − 0.8) with the other instrument (hope scale), validity for each of the instruments is strengthened (Waltz et al., 2010).


The evidence of an instrument’s validity from previous research and the current study needs to be included in the published report. In critically appraising a study, you need to judge the validity of the measurement methods that were used. However, you cannot consider validity apart from reliability (see Table 10-1). If a measurement method does not have acceptable reliability or is not consistently measuring a concept, then it is not valid.



Accuracy, Precision, and Error of Physiological Measures


Physiological measures are measurement methods used to quantify the level of functioning of living beings (Ryan-Wenger, 2010). The precision, accuracy, and error of physiological and biochemical measures tend not to be reported or are minimally covered in published studies. These routine physiological measures are assumed to be accurate and precise, an assumption that is not always correct. Some of the most common physiological measures used in nursing studies include BP, heart rate, weight, body mass index, and laboratory values. Sometimes researchers obtain these measures from the patient’s record, with no consideration given to their accuracy. For example, how many times have you heard a nurse ask a patient his or her height or weight, rather than measuring or weighing the patient? Thus researchers using physiological measures need to provide evidence of the measures’ accuracy, precision, and potential for error (see Table 10-1; Gift & Soeken, 1988; Ryan-Wenger, 2010).



Accuracy


Accuracy is comparable to validity in that it addresses the extent to which the instrument measures what it is supposed to measure in a study (Ryan-Wenger, 2010). For example, oxygen saturation measurements with pulse oximetry are considered comparable with measures of oxygen saturation with arterial blood gases. Because pulse oximetry is an accurate measure of oxygen saturation, it has been used in studies because it is easier, less expensive, less painful, and less invasive for research participants. Researchers need to document that previous research has been conducted to determine the accuracy of pulse oximetry for the measurement of individuals’ oxygen saturation levels in their study.



Precision


Precision is the degree of consistency or reproducibility of measurements made with physiological instruments. Precision is comparable to reliability. The precision of most physiological equipment depends on following the manufacturer’s instructions for care and routine testing of the equipment. Test-retest reliability is appropriate for physiological variables that have minimal fluctuations, such as cholesterol (lipid) levels, bone mineral density, or weight of adults (Ryan-Wenger, 2010). Test-retest reliability can be inappropriate if the variables’ values frequently fluctuate with various activities, such as with pulse, respirations, and BP. However, test-retest is a good measure of precision if the measurements are taken in rapid succession. For example, the national BP guidelines encourage taking three BP readings 1 to 2 minutes apart and then averaging them to obtain the most precise and accurate measure of BP (http://www.nhlbi.nih.gov/guidelines/hypertension).



Error


Sources of error in physiological measures can be grouped into the following five categories: environment, user, subject, equipment, and interpretation. The environment affects the equipment and subject. Environmental factors might include temperature, barometric pressure, and static electricity. User errors are caused by the person using the equipment and may be associated with variations by the same user, different users, or changes in supplies or procedures used to operate the equipment. Subject errors occur when the subject alters the equipment or the equipment alters the subject. In some cases, the equipment may not be used to its full capacity. Equipment error may be related to calibration or the stability of the equipment. Signals transmitted from the equipment are also a source of error and can result in misinterpretation (Gift & Soeken, 1988). Researchers need to report the protocols followed or steps taken to prevent errors in their physiological and biochemical measures in their published studies (Ryan-Wenger, 2010; Stone & Frazier, 2010).




image Research Example


Directness, Level of Measurement, Reliability, and Validity of Scales, Accuracy, Precision, and Error of Physiological Measures



Research Excerpt


Whittenmore, Melkus, Wagner, Dziura, Northrup, and Grey (2009) studied the effects of a lifestyle change program on the outcomes for patients with type 2 diabetes. The lifestyle program was delivered by nurse practitioners (NPs) in primary care settings. The following excerpt describes some of the measurement methods used in this study.




Outcome Measures




“Data were collected at the individual (participant) and organizational (NP and site) levels at scheduled time points throughout the study. … All data were collected by trained research assistants blinded to group assignment, with the exception of … lipids, which were collected by experienced laboratory personnel at each site and sent to one laboratory for analysis.” (Whittenmore et al., 2009, p. 5)


“Efficacy data were collected on clinical outcomes on (weight loss, waist circumference, and lipid profiles); behavioral outcomes (nutrition and exercise); and psychological outcomes (depressive symptoms). … All data were collected at baseline, 3 months, and 6 months, with the exception of the laboratory data, which were collected at baseline and 6 months. … Efficacy data collection measures and times were based on the DPP [diabetes prevention program] study and modified for the short duration of this pilot study.


Weight loss was the primary outcome and was calculated as a percentage of weight loss from baseline to 6 months. … Waist circumference and lipid profiles were secondary clinical outcomes. Waist circumference was measured by positioning a tape measure snugly midway between the upper hip bone and the uppermost border of the iliac crest. In very overweight participants, the tape was placed at the level of the umbilicus (Klein et al., 2007). Lipid profiles (LDL [low-density lipoproteins], HDL [high-density lipoproteins], total cholesterol, and total triglycerides) were determined using fasting venous blood.


Diet and exercise health-promoting behaviors were measured with the exercise and nutrition subscales of the Health-Promoting Lifestyle Profile II (eight and nine items, respectively), which has items constructed on a 4-point Likert scale and measures patterns of diet and exercise behavior (Walker, Sechrist, & Pender, 1987). This instrument has been used in diverse samples, and demonstrates adequate internal consistency (r = .70 to .90 for subscales; Jefferson, Melkus, & Spollett, 2000). The alpha coefficients for the exercise and nutrition subscales in this study were .86 and .76, respectively.…


Psychosocial data were collected on depressive symptoms, as measured by the Center for Epidemiologic Studies Depression Scale (CES-D), a widely used scale (Radloff, 1977). The CES-D consists of 20 items that address depressed mood, guilt or worthlessness, helplessness or hopelessness, psychomotor retardation, loss of appetite, and sleep disturbance [see Figure 10-3]. Each item is rated on a scale of 0 to 3 in terms of frequency during the past week. The total score may range from 0 to 60, with a score of 16 or more indicating impairment. High internal consistency, acceptable test-retest reliability, and good construct validity have been demonstrated (Posner et al., 2001). The alpha coefficient was .93 for the CES-D in this sample.”


(Whittenmore et al., 2009, p. 6)

Only gold members can continue reading. Log In or Register to continue

Stay updated, free articles. Join our Telegram channel

Feb 9, 2017 | Posted by in NURSING | Comments Off on Clarifying Measurement and Data Collection in Quantitative Research

Full access? Get Clinical Tree

Get Clinical Tree app for offline access