6
Data Science
Suzanne Bakken
OVERVIEW
Data Science and Big Data
The National Consortium for Data Science defined data science “as the systematic study of the organization and use of digital data in order to accelerate discovery, improve critical decision-making processes, and enable a data-driven economy” (p. iii; Ahalt et al., 2014). Data science draws upon theories and techniques from multiple fields, including mathematics, statistics, and information technology, and includes methods such as probability models, machine learning, statistical learning, computer programming, pattern recognition and learning, visualization, predictive analytics, uncertainty modeling, data warehousing, and high performance computing. Data science methods are applied to what is commonly called “big data” that are typically conceptualized in terms of volume, variety, and velocity (e.g., sensor or streaming data) and more recently, veracity, which refers to the level of uncertainty associated with the collection of data sources (IBM, 2015). These multiple attributes of big data reflect an additional complexity beyond that of simple volume (e.g., Medicare claims data may be considered voluminous, but do not necessarily meet other aspects of the data science definition due to their well-structured nature).
The Big Data to Knowledge (BD2K) initiative at the National Institutes of Health (NIH) reinforces this in their definition of biomedical big data as “more than just very large data or a large number of data sources” and “comprising the diverse, complex, disorganized, massive, and multimodal data being generated by researchers, hospitals, and mobile devices around the world,” including “imaging, phenotypic, molecular, exposure, health, behavioral, and many other types of data” (NIH, 2016b).
In a panel at the Council for the Advancement of Nursing Science, Meeker (2014) offered an operational perspective by contrasting properties of big data and associated processing in distributed research networks versus traditional outcomes research in federated research networks:
• Analysis questions: patterns, predictions, classification versus causal inference, and hypothesis testing
• Data distribution: randomly distributed across process versus tied to a site
• Number of network nodes: 100s versus 10s
• Common analytic platforms: R-Volution/R-Hadoop, Apache Mahout, Spark Machine Learning Library, Spark Graph X Library versus SAS, R, STATA
Some authors also consider a fifth “V” in data science attributes—value (Marr, 2015). The attribute of value also highlights the significance of domain expertise in data science to ensure that appropriate questions are asked of relevant data sources and that results are interpreted accurately through the stages of data science inquiry: obtain, scrub, explore, model, and interpret (Mason & Wiggins, 2010).
Drivers for Data Science in Nursing and Other Health and Biomedical Research
There are multiple drivers for data science in nursing and other health and biomedical research. First, electronic health-related data have increased in volume and variety. This is particularly relevant to several aspects of the National Institute of Nursing Research (NINR) Strategic Plan, including symptom science, self-management, and precision medicine (NINR, 2016a). For example, consumer-generated data from mobile apps, wearables, medical devices, and sensors, along with multiple sources of environmental and “omic” data, complement traditional electronic sources of research data such as surveys, clinical data warehouses, and transaction data (Bakken & Reame, 2016). In addition, the Patient-Centered Outcomes Research Institute (PCORI) development of PCORnet has increased the availability of data from a variety of data sources through its Clinical Research Data Networks and Patient Powered Research Networks (Fleurence et al., 2014). The inclusion of the strong patient voice in the latter broadens the available data types and ensures their relevance to patient centeredness. Moreover, through the U.S. government’s Health Data Initiative and the open data portal (NIH, 2016c), thousands of datasets are accessible for linking with big data streams. These include datasets from federal agencies such as the Centers for Medicare and Medicaid Services, Centers for Disease Control and Prevention, Department of Health and Human Services, Food and Drug Administration, Health Resources and Services Administration, and NIH as well as states (e.g., Hawaii, Michigan, New York) and cities (e.g., Boston, San Francisco).
Second, through the BD2K initiative (NIH, 2016b), the NIH has acknowledged the role of big data to advance scientific knowledge and launched a set of funding opportunities designed to develop new approaches, methods, software tools, and related resources and increase the competency of the biomedical workforce in data science in three core areas: computer science, biostatistics, and biomedical science. The BD2K Centers program has funded 11 centers of excellence for big data computing and two additional centers, the LINCS-BD2K Perturbation Data Coordination and Integration Center, and the Broad Institute LINCS Center for Transcriptomics, which are collaborative projects with the NIH Common Fund LINCS program (NIH, 2016b). The centers also provide training to advance big data science in the context of biomedical research as part of the overall BD2K investments in training (NIH, 2016a). Training goals include increasing the data science skills of all biomedical scientists as well as preparing specialists in data science. Consequently, the training-related awards use a variety of funding mechanisms to support course development, including massive open online courses (MOOCs) on data management for biomedical big data and training resource development and coordination, as well as supplements to existing T32 and T15 predoctoral training programs. To foster the development of new teams consisting of biomedical scientists and data scientists, BD2K is also supporting the Quantitative Approaches to Biomedical Big Data Program along with the National Science Foundation.
Third, initiatives related to precision medicine and biobanking are growing, including the recent launch of President Obama’s Precision Medicine Initiative (PMI; Collins & Varmus, 2015). The NIH defined precision medicine as an emerging approach for disease treatment and prevention that takes into account individual variability in genes, environment, and lifestyle for each person. Consequently, there is no doubt that data science methods are essential to implementing the vision of more than a million persons’ PMI cohort and other aspects of precision medicine, because PMI takes advantage of increased connectivity through social media and mobile devices as well as “omic” and traditional data sources such as electronic health record (EHR) data and surveys.
NURSING SCIENCE FOUNDATIONS FOR DATA SCIENCE
In “Nursing Needs Big Data and Big Data Needs Nursing,” Brennan and Bakken (2015) argue that nursing’s role in data science is motivated by the American Nurses Association’s Social Policy Statement, which specifically delineates the role of nurses in the generation and application of knowledge and technology to health care outcomes and planning for health policy and regulation that is responsive to consumer needs and provides for best resource use in the provision of health care for all (American Nurses Association, 2010). Nursing science has resulted in critical foundation for data science in nursing and other health and biomedical research in multiple areas. Four key aspects are (a) health care standards that facilitate semantic interoperability; (b) common data elements (CDEs); (c) theories, models, and frame-works; and (d) ethical, legal, and social implications (ELSI). Moreover, nurse scientists have demonstrated application of analytic techniques relevant to data science to datasets that do not meet the definitions of biomedical big data, but are reflective of approach competencies that can be applied to big data.
Health Care Standards That Facilitate Semantic Interoperability
Documentation of patient status, nursing interventions, and patient outcomes in EHRs extends the patient phenotypes beyond demographics, medical diagnoses, medical procedures, and laboratory values and the care processes and outcomes beyond those attributable to physician practice. This includes data such as assessments related to fall risk (Dykes et al., 2010), high risk for readmission (Bowles et al., 2015), and nursing-sensitive patient outcomes (Brown, Donaldson, Burnes Bolton, & Aydin, 2010). Although aggregation of well-structured phenotypic data in large volumes and application of traditional analytic methods do not meet the definition of data science, these data are a critical data source that complements other less structured EHR (e.g., progress notes) and high volume and velocity data such as that from continuous physiological monitoring or environmental data.
Nurse scientists have led efforts to establish the necessary standards to support inclusion of nursing data in a variety of EHRs and establish comparability across sites and settings at the national (Harris et al., 2015; Henry, Holzemer, Reilly, & Campbell, 1994; Westra et al., 2015) and international levels (Hardiker, 2003; Hardiker & Coenen, 2007; Matney et al., 2012). As part of interdisciplinary teams, nurse scientists have also contributed to integration of genomic data and knowledge into EHRs (Jing, Kay, Marley, Hardiker, & Cimino, 2012). Complementing the data standardization efforts, nurse scientists have also focused efforts on other aspects of semantic interoperability, including information models (Danko et al., 2003) and document representation standards (Hyun et al., 2009; Hyun & Bakken, 2006; Matney, Dolin, Buhl, & Sheide, 2016). These structures enhance the application and performance of natural language processing, a data science method for extracting nursing-relevant data from narrative documents such as progress notes or reports (Hyun, Johnson, & Bakken, 2009; Johnson et al., 2008).
Although nurses have been working in interdisciplinary standards development organizations for several decades, there are still knowledge and implementation gaps. To address these gaps, the University of Minnesota School of Nursing has convened a series of conferences and related activities to promote a national action plan for sharable and comparable nursing data to support practice and research (Westra et al., 2015).
Common Data Elements
Complementary to the efforts in semantic interoperability, the NIH has significantly invested in the creation of CDEs, including standardized measures. Use of CDEs—variables conceptually defined, operationalized, and measured in identical ways across studies and clinical practice—improves data quality and opportunities for data sharing (Redeker et al., 2015). In addition, CDEs enable comparison of outcomes across multiple studies and also support analysis of important subsamples, which is typically not possible due to inadequate statistical power (National Library of Medicine, 2015).
An important source of CDEs, the Patient Reported Outcome Measure Information System (PROMIS®), includes a collection of highly reliable, precise measures of patient-reported health status for physical, mental, and social well-being. The measures are supplemented by web-based tools for administration including computer adaptive testing. A CDE effort specific to nursing science that occurred within the context of the NINR Centers of Excellence program built upon the PROMIS work to select a set of symptom CDEs (Redeker et al., 2015). The recommended CDEs for six of seven symptoms were PROMIS measures: pain (PROMIS Pain), fatigue (PROMIS Fatigue), sleep disturbance (PROMIS Sleep Disturbance plus additional duration question), affective mood disturbance (PROMIS Positive Affect and PROMIS Depression), affective well-being (Medical Outcomes Study Short Form-36 Psychological Well-being Scale), and Cognitive Disturbance (PROMIS Applied Cognition and General Concerns).
As with structured EHR data, a large volume of CDEs alone does not constitute data science. However, these data represent important variables that can be combined with other data to conduct analyses and discover insights for hypothesis generation. For example, in the area of symptom science, consumer-facing technologies such as mobile devices and sensors provide an important source of high volume and velocity data related to the symptom experience, including the environmental conditions under which the symptoms exacerbate (Bakken & Reame, 2016).
Theories, Models, and Frameworks
Brennan and Bakken note that “nursing’s long tradition of theory-driven science provides the frameworks that can guide explorations towards promising phenomena and leverage insights into new knowledge, thereby avoiding the distractions of opportunistic exploration” (Brennan & Bakken, 2015, p. 481). Nurse theorists have generated grand theories that address the metaparadigm concepts of person, environment, and health. For example, the Roy Adaptation Model has influenced nursing science and education for more than 40 years (Roy, 2011). In addition, middle range theories have been developed that can be used to frame data science-based inquiry to advance nurse science. For instance, Riegel developed a middle range theory of self-care that addresses the process of maintaining health with health promoting practices within the context of chronic illness management and includes major concepts of self-care maintenance, self-care monitoring, and self-care management (Riegel, Jaarsma, & Stromberg, 2012). Another study tested a middle range theory of adaptation of chronic pain in community-dwelling older adults, and found that contextual stimuli should be considered when developing plans of care for older adults experiencing chronic pain (Dunn, 2005). The increasing availability of electronic data streams will enable further development and testing of such middle range theories.
FIGURE 6.1 Social Ecological Model definitions and potential electronic data streams for symptom self-management.
CDE, common data element; EHR, electronic health record.
Moreover, nurse scientists have applied theories, models, or frameworks from other disciplines to advance nursing science. For example, the Precision in Symptom Self-Management (PriSSM) Center uses data science approaches to advance the science of symptom self-management for Latinos through a social ecological lens that takes into account variability in individual, interpersonal, organizational, and environmental factors across the life course. The Social Ecological Model (SEM) was chosen as the theoretical model for the center (Centers for Disease Control and Prevention, 2015). Definitions and potential electronic data streams are summarized in Figure 6.1 to illustrate the intersection of data science and theoretical approaches.
Ethical, Legal, and Social Implications
Bakken and Reame (2016) explicated the potential ethical perils related to data science, using the three principles for the ethical conduct of research subjects as articulated in the historic Belmont Report: respect for persons (i.e., autonomy), beneficence, and justice (The National Commission for the Protection of Human Subjects of Biomedical and Behavioral Research, 1979).
The primary mechanism for protection of autonomy, which relates to the principle of respect for persons, is informed consent that includes adequate information—understandable in lay terms—to make an informed decision about participation. Some big data sources have explicit opt-in or opt-out consent processes, and use of protected health information for research receives ethical and regulatory oversight from institutional review boards and the Health Insurance Portability and Accountability Act (HIPAA). However, this is not true of data streams that are generated from sources such as social network sites and mobile health devices, because terms of service are usually lengthy and although an individual may technically consent by clicking “I agree,” the consent does not meet typical research criteria. Moreover, commercial uses may not be highlighted or described in a health-literate manner. Kahn, Vayena, and Mastroianni (2014) recommend that “approaches to informed consent must be reconceived for research in the social-computing environment, taking advantage of the technologies available and developing creative solutions that will empower users who participate in research, yield better results, and foster greater trust” (p. 13678).
Methodological rigor is an ethical requirement, not just a scientific one, to ensure that scarce resources are used wisely and decisions are not made based on unsound findings. This requirement, which relates to the principle of beneficence (i.e., optimizing benefits while minimizing risks), remains critical in data science, although the methods vary from traditional research designs (Vayena, Salathe, Madoff, & Brownstein, 2015). Beyond poor methodological rigor, other important perils to beneficence relate to risks associated with loss of confidentiality and commodifica-tion of patient/consumer-generated data. Through prosumption, digital exhaust is produced as digital content (e.g., websites, mobile health applications, tweets) and is consumed. In some instances, these data are commodified and used for commercial purposes in which the data producer does not reap the benefit.
The principle of justice gives rise to moral requirements that there be fair procedures and equitable outcomes in the selection of research participants. In regard to big data, the investigator may be using existing data sources that may result in biases against underrepresented groups such as racial and ethnic minorities. For example, Latinos are less likely than Whites or Blacks to use an app for health tracking, and racial and ethnic minorities are less likely to participate in biobanks (Dang et al., 2014; Shaibi, Coletta, Vital, & Mandarino, 2013), thus hindering discoveries that may be of particular relevance to race or ethnicity.
Application of Data Science Analytics
The application of data science analytics to datasets that do not meet the definition of biomedical big data, typically due to lack of diversity and complexity in data types, demonstrates the promise and relevance of these analytics to advance nursing science. For example, using Outcome and Assessment Information Set (OASIS) data from a convenience sample of 270,634 home health care patient records, Dey et al. (2015) applied association analysis to identify sets of variables (patterns) associated with each other and also with mobility outcomes (i.e., improvement versus no improvement as compared to baseline). They found that the patterns of factors discovered through association mining typically comprised combinations of functional and cognitive status and the type and amount of help required at home and provided the foundation for tailored intervention development.
Clancy (Clancy, 2004; Clancy, Effken, & Pesut, 2008) pioneered the application of computational modeling and simulation to examine complex systems related to nursing and health care systems. For example, Clancy et al. (2007) used a variety of electronic data sources, observation of physician work patterns, and computational modeling and simulation to predict the impact of an EHR that integrated heart failure guidelines on practice patterns of physicians of nurses. The results indicated that such an implementation would decrease nursing workload, but increase physician workload in the community hospital setting. Predictive analytics are ubiquitous in the business world and are increasingly being used in the health care setting.
HOW HAVE RESEARCHERS USED DATA SCIENCE TO ADVANCE NURSING SCIENCE?
Although the drivers for data science have only increased in intensity in recent years and continue to do so, there are some examples of application of data science to big data to advance nursing science. The following nurse-scientist-led examples were selected to reflect a variety of settings, data sources, and analytic methods.
Embedded Home Sensor Technology