The expectation that nursing practice be based on research evidence has made it important for students and clinical nurses to acquire skills in reading and evaluating the results from statistical analyses (Brown, 2014; Craig & Smyth, 2012). Nurses probably have more anxiety about data analysis and statistical results than they do about any other aspect of the research process. We hope that this chapter will dispel some of that anxiety and facilitate your critical appraisal of research reports. The statistical information in this chapter is provided from the perspective of reading, understanding, and critically appraising the results sections in quantitative studies rather than on selecting statistical procedures for data analysis or performing statistical analyses.

To appraise the results from quantitative or outcomes studies critically, you need to be able to (1) identify the statistical procedures used, (2) judge whether these procedures were appropriate for the purpose and the hypotheses, questions, or objectives of the study and level of measurement of the variables, (3) determine whether the researchers’ interpretations of the results are appropriate, and (4) evaluate the clinical importance of the study’s findings. This chapter was developed to provide you with a background for critically appraising the results and discussion sections of quantitative studies.

The elements of the statistical analysis process are discussed at the beginning of this chapter. Relevant theories and concepts of statistical analyses are described to provide a background for understanding the results included in research reports. Some of the common statistical procedures used to describe variables, examine relationships among variables, predict outcomes, and test causal hypotheses are introduced. Strategies are identified for determining the appropriateness of the statistical analysis techniques included in the results sections of published studies. Guidelines are provided for critically appraising the statistical results of studies. The chapter concludes with guidelines for critically appraising the following study outcomes—findings, limitations, conclusions, generalizations, implications for nursing practice, and suggestions for further study. Examples from current studies are provided throughout this chapter to promote your understanding of the content.

Understanding the Elements of the Statistical Analysis Process

Statistical techniques are analysis procedures used to examine, reduce, and give meaning to the numerical data gathered in a study. In this textbook, statistics are divided into two major categories, descriptive and inferential. Descriptive statistics are summary statistics that allow the researcher to organize data in ways that give meaning and facilitate insight. Descriptive statistics are calculated to describe the sample and key study variables. Inferential statistics are designed to address objectives, questions, and hypotheses in studies to allow inference from the study sample to the target population. Inferential analyses are conducted to identify relationships, examine predictions, and determine group differences in studies.

In critically appraising a study, it may be helpful to understand the process that researchers use to perform data analyses. The statistical analysis process consists of several stages: (1) management of missing data; (2) description of the sample; (3) examination of the reliability of measurement methods; (4) conduct of exploratory analyses of study data; and (5) conduct of inferential analyses guided by the study hypotheses, questions, or objectives. Although not all of these stages are equally reflected in the final published report of the study, they all contribute to the insights that can be gained from analysis of the study data.

Management of Missing Data

Except in very small studies, researchers almost always use computers for data analyses. The first step of the process is entering the data into the computer using a systematic plan designed to reduce errors. Missing data points are identified during data entry. If enough data are missing for certain variables, researchers may have to determine whether the data are sufficient to perform analyses using those variables. In some cases, subjects must be excluded from an analysis because data considered essential to that analysis are missing. In examining the results of a published study, you might note that the number of subjects included in the final analyses is less than the original sample; this could be a result of attrition and/or subjects with missing data being excluded from the analyses. It is important for researchers to discuss missing data and its management in the study.

Description of the Sample

Researchers obtain as complete a picture of the sample as possible for their research report. Variables relevant to the sample are called demographic variables and might include age, gender, ethnicity, educational level, and number of chronic illnesses (see Chapter 5). Demographic variables measured at the nominal and ordinal levels, such as gender, ethnicity, and educational level, are analyzed with frequencies and percentages. Estimates of central tendency (e.g., the mean) and dispersion (e.g., the standard deviation) are calculated for variables such as age and number of chronic illnesses that are measured at the ratio level. Analysis of these demographic variables produces the sample characteristics for the study participants or subjects. When a study includes more than one group (e.g., treatment group and control or comparison group), researchers often compare the groups in relation to the demographic variables. For example, it might be important to know whether the groups’ distributions of age and chronic illnesses were similar. When demographic variables are similar for the treatment (intervention) and comparison groups, the study is stronger because the outcomes are more likely to be caused by the intervention rather than by group differences at the start of the study.

Critical Appraisal Guidelines

Description of the Sample

When critically appraising a study, you need to examine the sample characteristics and judge the representativeness of the sample using the following questions.

1. What variables were used to describe the sample?

2. What statistical techniques were used to descriptively analyze the demographic variables, and were these techniques appropriate based on the level of measurement of these variables? Figure 10-2 covers the rules for the nominal, ordinal, interval, and ratio levels of measurement.

3. Was the sample representative of the study target population? For example, was this study’s sample similar to the samples of other studies in this area that were cited in the literature review or the discussion section of the study?

4. If the sample is divided into groups for data analyses, was the similarity or homogeneity of the groups discussed? (See Chapter 9.)

Research Example

Description of the Sample

Research Study

Kim, Chung, Park, and Kang (2012) conducted a quasi-experimental study to examine the effectiveness of an aquarobic exercise program on the self-efficacy, pain, body weight, blood lipid levels, and depression of patients with osteoarthritis. The study included 70 subjects, with 35 patients randomly assigned to the experimental group and 35 to the control group. We recommend that you obtain this article, and review this study. The results from this study are presented as examples several times in this chapter to facilitate your understanding of statistical techniques.

The demographic variables used to describe the sample in the Kim and colleagues’ (2012) study included age, educational level, marital status, religion, occupation, income, and health status. Descriptive statistics of frequency and percentage (%) were used to analyze the demographic data, and the experimental and control groups were compared for similarities. The results from these analyses are presented in Table 11-1

Table 11-1

Homogeneity Test of General Characteristics Between Experimental and Control Groups

Characteristics	Experimental Group n=35n%	Control Group n=35n%	p
Age (yr)
55-59	0 (0.0)	2 (5.7)	0.545*
60-64	11 (31.4)	9 (25.7)
65-69	15 (42.9)	17 (48.6)
≥70	9 (25.7)	7 (20.0)
Educational Level
None	4 (11.4)	1 (2.9)	0.373*
Elementary	5 (14.3)	11 (31.4)
Middle school	15 (42.9)	14 (40.0)
High school	5 (14.3)	5 (14.3)
College or more	6 (17.1)	4 (11.4)
Marital Status
Married	22 (62.9)	22 (62.9)	1.000*
Bereavement	11 (31.4)	12 (34.3)
Other	2 (5.7)	1 (2.9)
Religion
Christian	8 (22.9)	7 (20.0)	0.907*
Catholic	12 (34.3)	15 (42.9)
Buddhist	9 (25.7)	9 (25.7)
None	5 (14.3)	4 (11.4)
Occupation
Yes	4 (11.4)	3 (8.6)	1.000*
None	31 (88.6)	32 (91.4)
Income (per/mo)
<100	18 (51.4)	18 (51.4)	0.496*
100-200	5 (14.3)	9 (25.7)
201-300	7 (20.0)	6 (17.1)
>300	5 (14.3)	2 (5.7)
Health Status
Good	5 (14.7)	5 (14.3)	0.649
Fair	16 (47.1)	16 (45.7)
Bad	13 (38.2)	14 (40.0)

^*Fisher’s exact test.

From Kim, I., Chung, S., Park, Y., & Kang, H. (2012). The effectiveness of an aquarobic exercise program for patients with osteoarthritis. Applied Nursing Research, 25(3), p. 186 (Table 2 from the article).

Critical Appraisal

Kim and associates (2012) developed a table that clearly presented the results of their analysis of demographic variables, and they discussed this table in the results section of their research report. The demographic variables of age, educational level, marital status, occupation, and income are commonly used in many studies to describe the samples. The descriptive analysis techniques of frequency and percentage were appropriate for the demographic variables measured at the nominal or ordinal level. Age, educational level, income, and health status were measured at the ordinal level, and marital status, religion, and occupation were measured at the nominal level.

The study included two groups (experimental and control), and differences between these two groups were examined for each of the demographic variables using the chi-square or Fisher’s exact tests. These analytical procedures are appropriate for examining group differences for nominal-level variables and are discussed in more depth later in this chapter. The p values (probabilities in this study) were all greater than 0.05, indicating no significant differences between the experimental and control groups for the demographic variables. Therefore the groups could be considered demographically similar in this study, so any significant differences noted are more likely to be caused by the study intervention than by differences in the groups at the start of the study.

Implications for Practice

Kim and co-workers (2012) implemented a structured aquarobic exercise program that included two educational sessions, followed by selected aerobic exercises. The aquarobic exercises are detailed in a table in the article and involved an instructor leading the patients with osteoarthritis through various aerobic exercises in the water for 1 hour, three times a week, for a total of 36 sessions over 12 weeks. The researchers found that the “Aquarobic Exercise Program was effective in enhancing self-efficacy, decreasing pain, and improving depression levels, body weight, and blood lipid levels in patients with osteoarthritis” (Kim et al., 2012, p. 181). They recommended the use of this program in managing patients with osteoarthritis but also recognized the need for additional research to determine the long-term benefits of this program for these patients. The Quality and Safety Education for Nursing Institute (QSEN, 2013) provides competencies for prelicensure nurses. The QSEN implication of this research report is the evidence-based intervention of an aquarobic exercise program, which improved the health outcomes for these patients with osteoarthritis. Nurses and students are encouraged to use research findings in promoting an evidence-based practice (EBP) for nursing.

Reliability of Measurement Methods

Researchers need to report the reliability of the measurement methods used in their study. The reliability of observational or physiological measures is usually determined during the data collection phase and needs to be noted in the research report. If a scale was used to collect data, the Cronbach alpha procedure needs to be applied to the scale items to determine the reliability of the scale for this study (Waltz, Strickland, & Lenz, 2010). If the Cronbach alpha coefficient is unacceptably low (< 0.70), the researcher must decide whether to analyze the data collected with the instrument. A value of 0.70 is considered acceptable, especially for newly developed scales. A Cronbach alpha coefficient value of 0.80 to 0.89 from previous research indicates that a scale is sufficiently reliable to use in a study (see Chapter 10). The t-test or Pearson’s correlation statistics may be used to determine test-retest reliability (Grove, Burns, & Gray, 2013). In critically appraising a study, you need to examine the reliability of the measurement methods and the statistical procedures used to determine these values. Sometimes researchers examine the validity of the measurement methods used in their studies, and this content also needs to be included in the research report (see Chapter 10).

An example is presented from the Kim and colleagues’ (2012) study (see earlier). They measured self-efficacy with a 14-item Likert-type scale that had been previously developed for patients with arthritis. The higher scores indicated greater levels of self-efficacy. The calculated Cronbach alpha for this scale in this study was 0.90, indicating 90% reliability or consistency in the measurement of self-efficacy and 10% error. Kim and associates (2012) measured depression with the Zung Self-Rating Depression Scale, which “consisted of 20 items (10 positive and 10 negative) on a 4-point Likert-type scale. Negative responses were converted into scores, and higher scores indicated greater levels of depression. Cronbach’s alpha in this study was 0.75” (75% reliability and 25% error; Kim et al., 2012, p. 185). This study included a concise description of the scales used to collect psychosocial data and documented the scales’ reliability using Cronbach alpha values. The reliability of the self-efficacy scale was strong at 0.90, but the Zung Self-Rating Depression Scale reliability (0.75) was a little low for an established scale.

Understanding Theories and Concepts of the Statistical Analysis Process

One reason that nurses tend to avoid statistics is that many were taught only the mathematical procedures of calculating statistical equations, with little or no explanation of the logic behind those procedures or the meaning of the results. Computation is a mechanical process usually performed by a computer, and information about the calculation procedure is not necessary to begin understanding statistical results. Here we present an approach to data analysis that will enhance your understanding of the statistical analysis process. You can then use this understanding to appraise data analysis techniques critically in the results section of research reports.

This section presents a brief explanation of some of the theories and concepts important in understanding the statistical analysis process. Probability theory and decision theory are discussed, and the concepts of hypothesis testing, level of significance, inference, generalization, the normal curve, tailedness, type I and type II errors, power, and degrees of freedom are described. More extensive discussion of these topics can be found in other sources, and we recommend our own textbooks (Grove, 2007; Grove et al., 2013) and a quality statistical text by Plichta and Kelvin (2013).

Probability Theory

Probability theory is used to explain the extent of a relationship, the probability that an event will occur in a given situation, or the probability that an event can be accurately predicted. The researcher might want to know the probability that a particular outcome will result from a nursing intervention. For example, the researcher may want to know how likely it is that urinary catheterization during hospitalization will lead to a urinary tract infection (UTI) after discharge from the hospital. The researcher also may want to know the probability that subjects in the experimental group are members of the same larger population from which the comparison or control group subjects were taken. Probability is expressed as a lower case letter p, with values expressed as percentages or as a decimal value, ranging from 0 to 1. For example, if the probability is 0.23, then it is expressed as p = 0.23. This means that there is a 23% probability that a particular outcome (e.g., a UTI) will occur. Probability values also can be stated as less than a specific value, such as 0.05, expressed as p < 0.05. (The symbol < means less than.) A study may indicate the probability that the experimental group subjects were members of the same larger population as the comparison group subjects was less than or equal to 5% (p ≤ 0.05). In other words, it is not very likely that the comparison group and the experimental group are from the same population. Put another way, you might say that there is a 5% chance that the two groups are from the same population, and a 95% chance that they are not from the same population. The inference is that the experimental group is different from the comparison group because of the effect of the intervention in the study. Probability values often are stated with the results of inferential statistical analyses. In critically appraising studies, it is useful to recognize these symbols and understand what they mean.

Decision Theory, Hypothesis Testing, and Level of Significance

Decision theory assumes that all of the groups in a study (e.g., experimental and comparison groups) used to test a particular hypothesis are components of the same population relative to the variables under study. This expectation (or assumption) traditionally is expressed as a null hypothesis, which states that there is no difference between (or among) the groups in a study, in terms of the variables included in the hypothesis (see Chapter 5 for more details of types of hypotheses). It is up to the researcher to provide evidence for a genuine difference between the groups. For example, the researcher may hypothesize that the frequency of UTIs that occurred after discharge from the hospital in patients who were catheterized during hospitalization is no different from the frequency of such infections in those who were not catheterized. To test the assumption of no difference, a cutoff point is selected before data collection. The cutoff point, referred to as alpha (α), or the level of statistical significance, is the probability level at which the results of statistical analysis are judged to indicate a statistically significant difference between the groups. The level of significance selected for most nursing studies is 0.05. If the p value found in the statistical analysis is less than or equal to 0.05, the experimental and comparison groups are considered to be significantly different (members of different populations).

Decision theory requires that the cutoff point selected for a study be absolute. Absolute means that even if the value obtained is only a fraction above the cutoff point, the samples are considered to be from the same population, and no meaning can be attributed to the differences. It is inappropriate when using decision theory to state that the findings approached significance at p = 0.051 if the alpha level was set at 0.05. Using decision theory rules, this finding indicates that the groups tested are not significantly different, and the null hypothesis is accepted. On the other hand, once the level of significance has been set at 0.05 by the researcher, if the analysis reveals a significant difference of 0.001, this result is not considered more significant than the 0.05 originally proposed (Slakter, Wu, & Suzaki-Slakter, 1991). The level of significance is dichotomous, which means that the difference is significant or not significant; there are no “degrees” of significance. However, some people, not realizing that their reasoning has shifted from decision theory to probability theory, indicate in their research report that the 0.001 result makes the findings more significant than if they had obtained only a 0.05 level of significance.

From the perspective of probability theory, there is considerable difference in the risk of occurrence of a type I error (saying something is significant when it is not) when the probability is between 0.05 and 0.001. If p = 0.001, the probability that the two groups are components of the same population is 1 in 1000; if p = 0.05, the probability that the groups belong to the same population is 5 in 100. In other words, if p = 0.05, then in 5 times out of 100, groups with statistical values such as those found in these statistical analyses actually are members of the same population, and the conclusion that the groups are different is erroneous.

In computer analysis, the probability value obtained from each data analysis (e.g., p = 0.03 or p = 0.07) frequently is provided on the printout and is often reported by the researcher in the published study, along with the level of significance set before data analysis was conducted. In summary, the probability (p) value reveals the risk of a type I error in a particular study. The alpha (α) value, set prior to the study, usually at α = 0.05, reveals whether the probability value for a particular analysis in a study met the cutoff point for a significant difference between groups or a significant relationship between variables.

Inference and Generalization

An inference is a conclusion or judgment based on evidence. Statistical inferences are made cautiously and with great care. The decision theory rules used to interpret the results of statistical procedures increase the probability that inferences are accurate. A generalization is the application of information that has been acquired from a specific instance to a general situation. Generalizing requires making an inference; both require the use of inductive reasoning. An inference is made from a specific case and extended to a general truth, from a part to the whole, from the concrete to the abstract, and from the known to the unknown. In research, an inference is made from the study findings obtained from a specific sample and applied to a more general target population, using the results from statistical analyses. For example, a researcher may conclude in a research report that a significant difference was found in the number of UTIs between two samples, one in which the subjects had been catheterized during hospitalization and another in which the subjects had not. The researcher also may conclude that this difference can be expected in all patients who have received care in hospitals. The findings are generalized from the sample in the study to all previously hospitalized patients. Statisticians and researchers can never prove something using inference; they can never be certain that their inferences and generalizations are correct. The researcher’s generalization of the incidence of UTIs may not have been carefully thought out—the findings may have been generalized to a population that was overly broad. It is possible that in the more general population, there is no difference in the incidence of UTIs based on whether the patient was catheterized or not. Generalizing study findings are part of the discussion section of a research report (see later).

Normal Curve

A normal curve is a theoretical frequency distribution of all possible values in a population; however, no real distribution exactly fits the normal curve (Figure 11-1). The idea of the normal curve was developed by an 18-year-old mathematician, Johann Gauss, in 1795. He found that data from variables (e.g., the mean of each sample) measured repeatedly in many samples from the same population can be combined into one large sample. From this large sample, a more accurate representation can be developed of the pattern of the curve in that population than is possible with only one sample. Surprisingly, in most cases, the curve is similar, regardless of the specific variables examined or the population studied.

Levels of significance and probability are based on the logic of the normal curve. The normal curve presented in Figure 11-1 shows the distribution of values for a single population. Note that 95.5% of the values are within 2 standard deviations (SDs) of the mean, ranging from − 2 to + 2 SDs. (Standard deviation is described later in this chapter; see “Using Statistics to Describe.”) Thus there is approximately a 95% probability that a given measured value (e.g., the mean of a group) would fall within approximately 2 SDs of the mean of the population, and there is a 5% probability that the value would fall in the tails of the normal curve (the extreme ends of the normal curve, below − 2 (− 1.96 exactly) SDs [2.5%] or above + 2 (+ 1.96 exactly) SDs [2.5%]). If the groups being compared are from the same population (not significantly different), you would expect the values (e.g., the means) of each group to fall within the 95% range of values on the normal curve. If the groups are from (significantly) different populations, you would expect one of the group values to be outside the 95% range of values. An inferential statistical analysis performed to determine differences between groups, using a level of significance (α) set at 0.05, would test that expectation. If the statistical test demonstrates a significant difference (the value of one group does not fall within the 95% range of values), the groups are considered to belong to different populations. However, in 5% of statistical tests, the value of one of the groups can be expected to fall outside the 95% range of values but still belong to the same population (a type I error).

Tailedness

Nondirectional hypotheses usually assume that an extreme score (obtained because the group with the extreme score did not belong to the same population) can occur in either tail of the normal curve (Figure 11-2). The analysis of a nondirectional hypothesis is called a two-tailed test of significance. In a one-tailed test of significance, the hypothesis is directional, and extreme statistical values that occur in a single tail of the curve are of interest (see Chapter 5 for a discussion of directional and nondirectional hypotheses). The hypothesis states that the extreme score is higher or lower than that for 95% of the population, indicating that the sample with the extreme score is not a member of the same population. In this case, 5% of statistical values that are considered significant will be in one tail, rather than two. Extreme statistical values occurring in the other tail of the curve are not considered significantly different. In Figure 11-3, which shows a one-tailed figure, the portion of the curve in which statistical values will be considered significant is the right tail. Developing a one-tailed hypothesis requires that the researcher have sufficient knowledge of the variables to predict whether the difference will be in the tail above the mean or in the tail below the mean. One-tailed statistical tests are uniformly more powerful than two-tailed tests, decreasing the possibility of a type II error (saying something is not significant when it is).

Fig 11-2 Two-tailed test of significance.

Type I and Type II Errors

According to decision theory, two types of error can occur when a researcher is deciding what the result of a statistical test means, type I and type II (Table 11-2). A type I error occurs when the null hypothesis is rejected when it is true (e.g., when the results indicate that there is a significant difference, when in reality there is not). The risk of a type I error is indicated by the level of significance. There is a greater risk of a type I error with a 0.05 level of significance (5 chances for error in 100) than with a 0.01 level of significance (1 chance for error in 100).

Table 11-2

Type I and Type II Errors

	In Reality, The Null Hypothesis* Is:
Data Analysis Indicates	True	False
Results significant, null hypothesis rejected	Type I error (α)	Correct decision (power)
Results not significant, null hypothesis not rejected	Correct decision	Type II error (β)

^*The null hypothesis is stating that no difference or relationship exists.

A type II error occurs when the null hypothesis is regarded as true but is in fact false. For example, statistical analyses may indicate no significant differences between groups, but in reality the groups are different (see Table 11-2). There is a greater risk of a type II error when the level of significance is 0.01 than when it is 0.05. However, type II errors are often caused by flaws in the research methods. In nursing research, many studies are conducted with small samples and with instruments that do not accurately and precisely measure the variables under study (Grove et al., 2013; Waltz et al., 2010). In many nursing situations, multiple variables interact to cause differences within populations. When only a few of the interacting variables are examined, small differences between groups may be overlooked. This leads to nonsignificant study results, which can cause researchers to conclude falsely that there are no differences between the samples when there actually are. Thus the risk of a type II error is often high in nursing studies.

Power: Controlling the Risk of a Type II Error

Power is the probability that a statistical test will detect a significant difference that exists (see Table 11-2). The risk of a type II error can be determined using power analysis. Cohen (1988) has identified four parameters of a power analysis: (1) the level of significance; (2) sample size; (3) power; and (4) effect size. If three of the four are known, the fourth can be calculated using power analysis formulas. The minimum acceptable power level is 0.80 (80%). The researcher determines the sample size and the level of significance (usually set at α = 0.05). (Chapter 9 provides a detailed discussion of power analysis.) Effect size is “the degree to which the phenomenon is present in the population, or the degree to which the null hypothesis is false” (Cohen, 1988, pp. 9-10). For example, if changes in anxiety level are measured in a group of patients before surgery, with the first measurement taken when the patients are still at home, and the second taken just before surgery, the effect size will be large if a great change in anxiety occurs in the group between the two points in time. If the effect of a preoperative teaching program on the level of anxiety is measured, the effect size will be the difference in the post-test level of anxiety in the experimental group compared with that in the comparison group. If only a small change in the level of anxiety is expected, the effect size will be small. In many nursing studies, only small effect sizes can be expected. In such a study, a sample of 200 or more is often needed to detect a significant difference (Cohen, 1988). Small effect sizes occur in nursing studies with small samples, weak study designs, and measurement methods that measure only large changes. The power level should be discussed in studies that fail to reject the null hypothesis (or have nonsignificant findings). If the power level is below 0.80, you need to question the validity of the nonsignificant findings.