M. R. Kraft
VA Hospital-Hines, Illinois and Loyola University, Chicago, USA
K. C. Desouza
University of Illinois at Chicago, USA
I. Androwich
Loyola University, Chicago, USA
Copyright © 2003, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
This chapter defines and discusses healthcare data and various healthcare databases as resources for knowledge discovery that can support effectiveness research, quality improvement, and resource allocation. Privacy and confidentiality of health records are addressed along with the dimensions and complexity of information retrieval from healthcare databases and patient health records. The Veterans' Health Administration (VHA) data and databases are specifically addressed. Issues and methods of data preparation for a data mining exploration of a VHA Spinal Cord Injury (SCI) clinical database are presented from a nursing perspective. The potential of using healthcare databases for research is noted.
The healthcare industry faces contradictory pressures of lowering cost and increasing quality of service, both of which require efficient decision-making. These business challenges of healthcare delivery require greater operational efficiencies and the tools necessary to provide real-time access to information. Healthcare facilities have at their disposal vast amounts of data from administrative and clinical databases. Capabilities for data storage have created databases of immense size that can be tapped to generate knowledge. However, the challenge is to extract relevant information from this data and act upon it in a timely manner (Desouza, 2002). Efficient decision-making is a by-product of thorough analysis of available data on a given problem. This chapter discusses the kinds of healthcare databases currently available, including claims, administrative and practice databases. Their potential for use in database research, designed to determine effectiveness of care, improve the quality of care delivered, and improve resource allocation, is presented. The Veterans' Health Administration (VHA) data is specifically addressed, and the issues and methods related to data preparation for a data mining exploration of a VHA Spinal Cord Injury (SCI) clinical database are presented from a nursing perspective.
With the movement of healthcare reimbursement from fee-for-service toward capitation models, healthcare information systems need to be able to detect developing cost, quality and access problems. Payers demand better documentation of care and outcomes. A subtle yet critical issue facing the healthcare industry is the documentation of professional practice and the ability to provide such information across the continuum of care (Orsolits, Davis & Gross, 1988). With the current healthcare system's emphasis on a business model, the application of the four diagnostic information categories identified by Drucker (1999) is appropriate. He identifies foundation information as the basic standard measurements within the industry; productivity information as the measurement of knowledge-based and service work; competence information, which looks at the core competencies of the industry; and resource allocation information, which addresses how both capital and people are allocated throughout the enterprise. These four kinds of information are seen as telling about the current state of the business and directing business tactics, and can all be related directly to healthcare information.
Healthcare providers are confronted daily with constantly changing information needs to manage care. Hersh (1999) indicates that in two of every three patient encounters, the average clinician has unmet information needs, even though managed care and other healthcare system innovations mandate that providers be knowledgeable about the details of patient care (Lewis, 1997). As practice data are computerized, the ability to capture, store, retrieve, organize, and analyze the information of clinical practice can provide information for decision support, enhancement of documentation, and identification of care trends and costs, with the ultimate goal of improved patient care. Generating information and knowledge calls for organizing data into a useful form. For purposes of this paper and within the framework of systems theory, information is defined as organized and processed data that can be communicated, and/or received, and when received, is meaningful and useful to the recipient. The capture and management of healthcare data has developed into the field of medical informatics.
The title, Informatics, is derived from the French word, informatique, and is defined as computer and information science. Informatics consists of a set of technologies that facilitates the representation, management, and manipulation of data, information, and knowledge (Ball, Hannah, Newbold & Douglas, 1995). The field of information science concerned with analysis and dissemination of data through the application of information systems to various aspects of healthcare is known as medical informatics. Seelos (1993) defines the field as, "the science of information processing and the creating of information processing systems in medicine and healthcare delivery." The management component of informatics is the collection, aggregation, organization, movement and representation of information in an efficient, economical, and useful way. The processing component of informatics is the transformation of data or information, usually to a more complex state of organization.
Originally, the term medical informatics was thought adequate to cover all healthcare professions. With progress in the field of healthcare informatics, nursing began to recognize a discreet and unique body of knowledge related to nursing, information processing and information science separate from medicine. This has also occurred in other health disciplines, and it has been suggested that the phrase health informatics should replace medical informatics as the more appropriate umbrella term encompassing all disciplines in the health field (Hannah, Ball & Edwards, 1995).
Healthcare information is extraordinarily complex. "It often contains implicit attributes, internal intricacies, intentional ambiguities and inaccuracies" (Cimino, l995. p.780). Southon, Braithewaite& Lorenzi (1997) outline three dimensions in health information: management information, professional information and patient information. They identify overlap and commonalties, but feel that fundamental differences exist in the types of information required for each dimension, the way the information is used, and the way standards are maintained. The achievement of a comprehensive and integrated data structure that can serve the multiple needs of each of these three dimensions is a goal in most healthcare information system development.
The cost of information is usually not stated or, indeed, even known (Blois, 1987), but information represents a large percentage of the healthcare cost structure. The healthcare industry is approaching the same level of investment in information systems found in the banking and finance industry. About 1/3 of the cost of healthcare in the United States — some 300 billion dollars — represents the cost of capturing, storing, and processing such information as patient's records, physicians' notes, test results, and insurance claims (Evans & Wurster, 1997). The cost of information technology is definitely high, but the cost of manual information handling is also expensive (Appleby, 1997). It has been estimated that 25% of hospital cost is spent on information handling, primarily as a means of communication. It is now much easier to access more information at a substantially lower cost. This ability should facilitate more informed choices and better decisions, but with all the increases in "cheaper information," little progress has been made in deriving knowledge from this information. Unfortunately, healthcare faces what Fransman (1996) has described as the "information paradox." Currently, there is a glut of healthcare data, too much for one person to process. There is a cost to the collection of every piece of data, so data collection will require justification. The more precise the data, the more it will cost. Does the information requested increase our knowledge for the benefit of the patient? How much information will we need, want, and be able to afford?
There are indications that the most frequent problem with healthcare information is a lack of availability (Desouza, 2001). Too much information, information in the wrong place, incomplete, inaccurate, inconsistent, illegible or difficult to understand information is also noted.
Patient health records represent comprehensive documents of the continuity of healthcare and are a rich source of data for research. Such data are generally accessible, accurate and relatively inexpensive. The advantages of using healthcare records as a source of information are accurate and timely data, rich clinical detail and dates attached to data elements. Traditionally, the paper record is documented in a "diary" style (Gabrieli, 1990), and includes documentation that produces a defensive legal record. Documentation in health records is assumed to be legally and medically accurate and reliable. Historically, the nursing documentation in the patient's "chart" has been seen as a "transaction log rather than an evolving repository of practice based on nursing knowledge" (Bakken & Constantino, 2001, p. 52). As nursing diagnoses, interventions and patient outcomes are captured, the nursing record becomes a document that records actual nursing practice. This kind of nursing documentation offers the opportunity for study of actual nursing practice and its effects on patient outcomes.
The move to computerized patient record systems (CPRS) has radically changed the traditional healthcare record and how the data and information in the record is used. Sinclair (1990) argues that the computer's assistance, in the form of memory support and decision support, introduces changes in the definition of professional expertise. With computers to filter and/or analyze data, a lack of knowledge of specific facts becomes less important.
There are disadvantages to using healthcare records as a data source. Concerns related to such data are that data are collected as a by-product of some other processes; data are probably collected and entered by many people without any quality check; data may have different structure even within the same database; and missing data may be common (Lange & Jacox, 1993). Too often the data are so disconnected that there is no useful information. Other disadvantages are related to the non-research purpose of the record, the presence of selective information, the need for interpretation of certain information in the record, and the difficulty with data verification (Krowchuk, Moore & Richardson, 1995). Although concern has been expressed about the reliability and validity of health record data, most investigators operate on the premise that healthcare records provide fairly accurate information. Identified factors that influence reliability and validity of health record data include the clinical competence of the recorder, patient cooperation, the type of provider and setting of care, situational factors, and the type of data collected and recorded (Aaronson & Burman, 1994). These authors suggest that the reliability of health records can be assessed by the consistency with which a single recorder documents specific information, and by having the same data recorded by two different providers.
Confidentiality is an emerging problem in computerized clinical data sets. Technology has encouraged the accumulation of an unlimited quantity of healthcare data, but has also created a resurgence of controversy in the issues of privacy and confidentiality (Rittman & Gorman, 1992; Romano, 1987). The development of electronic databases has raised the concern of a patient's right to privacy as compared to potential societal benefits of the use of the data for quality improvement and effectiveness research (Gostin, 1997). Styffe (1997) addressed the unique meaning of privacy, confidentiality, and security as related to patient data in clinical information systems. Individuals are concerned with privacy as their right to determine when, how, where, and to what extent their information is transmitted. Confidentiality is the concern of healthcare providers and organizations and, according to Styffe (1997), is the "trust placed that information shared will be respected and used only for the purpose disclosed." It is based on the relationship between the person disclosing and the person receiving information. Security is built into clinical information systems and addresses the levels of authorization necessary for access to data and information. Computer security involves the protection of data against accidental or intentional disclosure to unauthorized persons.
Data ownership is a critical issue. Questions about when patient consent is necessary to access years of patient information need to be addressed. Must one freely consent to being included in a database? Historically, retrospective data access has not required patient consent, but as databases increase, the appropriate use of years of patient information raises ethical concerns (Peck et al., 1997). Permission to use data beyond the original intent has rarely been obtained explicitly (Wuerker, 1997). McArt & McDougal (1985) stated that, "permission by subjects to use data collected on them is generally not required in a secondary data analysis unless additional data are sought or subjects may be considered at risk because of the sensitivity of the data collected" (p. 56). Fiesta (1996) suggests that from a legal standpoint, widespread usage will eventually establish a uniform standard of database usage. Some advocates suggest that research access to patient information should be limited to "epidemiological data," but only if the linkage to individual records is removed (Wechsler, 1996).
New rules to protect the privacy of all medical records under the Health Insurance Portability and Accountability Act (HIPAA) become effective in December 2002, and will require written consent from patients before any disclosure of medical information, even for routine actions like insurance billing (Childs, 2001). HIPAA compliance will be mandatory and healthcare facilities, providers, and payers are currently developing implementation plans. Prior to HIPAA, permission to use data beyond the original intent has rarely been obtained explicitly. Human subject committees will have to determine whether use of data represents a threat to confidentiality and, if risk is high, subjects may have to be re-contacted to get permission to use their personal health data for research and quality improvement initiatives.
All healthcare professionals must face the issue of a possible breach in information security. Such an event is serious enough that in addition to legal liability, the healthcare professional may suffer a loss of information necessary for practice (Sardinas & Muldoon, 1998). The Code of Ethics for Nurses (ANA, 1985) stresses that when patient anonymity cannot be guaranteed, client consent for use of medical records must be obtained.
It has been estimated that the amount of data in the world doubles every twenty months (Frawley, Pietetsky-Shapiro & Mathaus, 1992). The information available on the Internet is thought to double in less than every nine months (Turban & Aronson, 2001). Information gathering capacity has increased exponentially (Tan & Sheps, 1999) and, as a result, much transactional data are moved into databases for storage. Currently, databases are measured in gigabytes and terabytes. A terabyte has been described as the equivalent of two million books (Hedberg, 1995). As the amount of collected data grows, the need to efficiently analyze data also increases. Massive amounts of data remain largely unexplored and may even be on the verge of being discarded.
A database can be defined as "any structured representation of data which describes a subset of the real world" (Newbold, 1993). A database is an organized repository of data: a collection of interrelated records; any set of files subject to manipulation by a common database management system (DBMS). A database must meet standards that ensure comprehensiveness through inclusion of core criteria and accuracy of data while avoiding redundancy. Data quality and integrity must be maintained and security must be provided. Databases must be planned to facilitate access to an integrated collection of data for multiple users and multiple applications and must also interface with future advances in technology. There are several kinds of databases that can be important in providing information for healthcare providers. They include claims databases used for billing, administrative databases that usually include some details of medical care, and practice databases built from patient records or data from specific clinical departments, such as radiology, laboratory, and pharmacy. Healthcare databases may also include disease specific information, such as cancer or SCI registries.
Claims databases are compiled by third party payers, quality assurance organizations and the government, and are used primarily for billing Medicare and other third-party payers. Generally the claims databases are quite large, but lack the richness of detail found in practice databases (Tierney & McDonald, 1991).
Administrative databases store financial and administrative data used in facility/system management and include patient demographics, as well as coding for Diagnostic Related Groups (DRGs), International Classification of Diseases (ICD9) and Current Procedural Terminology (CPT), which reflect some areas of medical care (Wray, Ashton, Kuykendall & Hollingsworth, 1995). The Medicare database is an administrative database containing information on the utilization of covered medical services, diagnoses, episodes of illness and the Medicare-covered costs of healthcare for more than 35 million beneficiaries (McPhillips, 1991). This database has been linked to several national surveys on aging and provides a valuable source of information on health services and aging (Lillard & Farmer, 1997). Cowper, Hynes, Kubal & Murphy (1999) recommend that outcomes researchers use administrative databases as either the principal source of data or as a supplement to other primary data collection. Ray (1997) suggests that administrative databases are a potentially useful source of data for retrospective studies. According to Palmer (1997), the increasing availability of clinical details in large healthcare databases makes comprehensive processbased measures of quality more possible. Such databases are now being used to measure quality of care over time and across institutions, and Davidoff (1997) predicts that databases will be powerful tools for quality improvement in healthcare. Currently most regional database explorations start with administrative databases built from reimbursement transactions. Disadvantages to using administrative databases are probably primarily related to the fact that these systems were not created for research purposes and there was no research input into the design and types of information to be collected. Historically, with few exceptions, developers of national data sets have not considered practice or operational systems as content sources for their data sets (McDonald, Overhage, Dexter, Takesue & Dwya., 1997). Iezzoni (1997) believes that in the future, administrative and clinical data will have less distinctive boundaries and both data types will be utilized in regional efforts to assess quality in healthcare services.
Healthcare institutions compile practice databases with data generated from the delivery of care to patients. This data comes from a variety of institutional sources, such as the laboratory, pharmacy, radiology, and medical records. These practice databases may vary widely in content but are considered as representative of accurate and timely clinical data (Tierney & McDonald, 1991). Although most practice databases are established to augment clinical care through storage and transmission of data, they have significant potential for research. If stored properly and if accessible to the researcher, practice databases can provide input for data mining systems (Psomas, Schaufele & Madhaven, 2000). Practice databases have been seen as a valuable source in meeting the needs of healthcare systems for epidemiological information (Pringle & Hobbs, 1991). Because a database represents real world relationships, it may be used in a predictive way in health services research. Using primary care databases as research tools does identify the possibility of major problems with incomplete recording; data are of little use unless high quality data are collected.
Lazaridis (1997) presents the mantra of "reduce, reuse, and recycle" as appropriate for reducing the need for expensive and difficult clinical trials by reusing data already available in existing databases which can be recycled into products (information) not envisioned when data were initially collected. He adds the fourth R of responsibility, stressing the need to consider the legal and ethical implications of using data in ways not originally intended. Temple (1990) urges caution in using databases to assess effectiveness. He lists as an area of concern the fact that such evaluations are always retrospective and unblinded, with high potential for patient selection bias and analyst bias. Temple recommended that every "startling" finding in database effectiveness research should be subjected to a controlled trial. Rubin (1997) states that the presence of observational rather than experimental data is a complication of large data base research. Ray (1997) believes the gold standard for evidence of efficacy, safety, and cost-effectiveness is a randomized controlled trial, but recognizes that retrospective studies have been the primary tool used to evaluate policy and program changes. Hlatky et al. (1984) proposes the use of an observational database as complementary to the randomized controlled trial in assessing the efficacy of therapy. They present the primary purpose of an observational database as the collection and distillation of "accumulated clinical experience to make accurate predictions for individual patients" (p. 375).
Threats to validity inherent in large databases include sampling and measurement errors. Sampling errors are the result of the selection of cases, and measurement errors develop as the result of problems with operational definitions of concepts. Because data bases exist over long periods of time, reliability threats are created by such things as clerical error, subtle changes in data collection techniques with improved diagnostic skills, and the instrumentation used to collect data. Appraisal of data includes consideration of accuracy, representativeness, authorship and authenticity (Reed, 1992). The validity of conclusions in research depends partly on the completeness and accuracy of the data. Incomplete data is meaningless. Both random and systematic errors can occur in data collection and management. Such errors may be identified with measures of frequency, central tendency, range, and dispersion. Knowing the data well can help in the identification and resolution of potential errors (Roberts, Anthony, Madigan & Chen, 1997). Rather than accepting data at face value, the researcher must consider all potential limitations. Ray (1997) identifies the major problems with retrospective data studies as poor data quality, lack of concurrent controls, an inability to ascertain essential study outcomes, and incomplete or missing data. Data problems are seen as the primary explanation for outliers. Another identified problem is related to the potential for coding errors (Cowper, Hynes, Kubal & Murphy, 1999).
Databases have been used in research for observational studies to improve the conduct of research through protocol adherence, subgroup targeting, collaboration, data collection and methodology; and in process research studies and hypothesis research studies (Pryor et al., 1985). Databases are ideally suited for observational descriptions. These descriptions can be as simple as frequencies or as complex as statistical analyses of co-variates. Lange & Jacox (1993) identified interest in using clinical and administrative healthcare databases for health policy research because of national concern about the quality, cost, and outcomes of healthcare. In reviewing the possibilities of research using large databases, they list several policy-related issues that could be addressed by nurse researchers:
"A large database represents a resource in which the behavior of analytic techniques can be studied and compared with alternative strategies" (Pryor et al., 1985, p. 639). The Cardiovascular Disease Databank at Duke University Medical Center with information on over 9,000 patients is the observational databank discussed by these researchers. This databank is structured through four essential elements: complete prospective data collection; regular follow-up of patients; close collaboration among clinicians, researchers and other multidisciplinary team members; and careful use of multivariable statistical methods. The Duke Cardiovascular Databank has been used to determine factors that influence the prognosis of patients with coronary artery disease and to develop a prognostic model for prediction of outcomes. Another example of database research is the use of the American Rheumatism Association Medical Information system (ARAMIS) to study patients with rheumatoid arthritis resulting in an increased understanding of the clinical course of the disease, frequency of hospital admissions, causes of morbidity, risk factors for morbidity, and variations in treatment costs (Tierney & McDonald, 1991). Other database studies include the use of artificial intelligence (AI) for managed care modeling (Borok, 1997), hospital infection control (Brossette et al., 1998), the development of a prediction model for premature birth (Goodwin et al., 1997) and Medicare fraud (Milley, 2000).
When the Omnibus Budget Reconciliation Act (OBRA) (Public Law 101–239) created the Agency for Healthcare Policy and Research (AHCPR) in 1989, the bill mandated a report on the feasibility of linking research related data with data collected by both Federal and non-Federal agencies as the basis for medical effectiveness research (U.S. DHHS, 1991). Effectiveness research uses epidemiological methods to examine large databases. Such research requires databases with large numbers of cases and standardized reliable data elements. Potential data sources identified were Department of Defense (DOD) data, Veterans' Health Administration (VHA) data, and data collected by individual states in the Uniform Hospital Discharge Data Set (UHDDS). Areas of concern identified in moving to database linkages included data accessibility, data standards, vocabulary standards, data processing standards, security standards, and the maintenance of confidentiality. A recent partnership between the Agency for Healthcare Research and Quality (AHRQ), formerly the AHCPR, and public and private statewide data organizations has led to the Healthcare Cost and Utilization Project (HCUP) which is the development of a "family of databases" that include multi-state inpatient and outpatient discharge records with data elements that include demographics, clinical, diagnostic, and procedural information (Steiner, Elizhauser, & Schnaier, 2002). This project is seen as the largest collection of all-payer, administrative data that allows longitudinal population-based studies. The research potential of HCUP is enhanced with linkages to other databases such as the American Hospital Association (AHA) annual survey.
The Veterans Health Administration (VHA), part of the Department of Veterans Affairs (DVA), represents the largest healthcare system in the United States. VHA utilizes an internally developed health information system (HIS), originally called the Decentralized Hospital Computerized Program (DHCP), which has evolved into the current Veterans Integrated Systems and Technology Architecture (VistA). Since the beginning of electronic data collection in 1980 (Kolodner, 1997), VHA data, stored in a centralized repository, now represents "fairly comprehensive, patient-level inpatient and outpatient data on healthcare utilization of all patients receiving care in the VHA" (Murphy, Cowper, Seppala, Stroupe, & Hynes, 2002, p. 7–8). Data elements have been added and subtracted over this time period, and some elements are not available for every year, but years of data collection and the use of encrypted patient identifiers do allow for longitudinal studies. The Quality Enhancement Research Initiative (QUERI) within the VA health system has identified data needs, data weaknesses, and data availability as part of their quality improvement program that has a diseasemanagement approach (Hynes, Cowper, Kerr, Kubal & Murphy, 2000). In 1998, the VHA founded the VA Information Resource Center (VIReC) to serve as a database and informatics resource and referral center to researchers and others who use VA information systems for research concerning the care needs of veterans served in the VHA. VIReC staff provide consultation about access to and utilization of both VA and non-VA databases for research, management, and quality improvement applications.
It should be noted that all transactional data captured in VistA is not stored in the central repository. Patient acuity from nursing files is stored, but other nursing data elements such as nursing diagnoses and interventions are not.
The collection of data into large databases has led to the creation of data warehouses for the storage of multiple databases and the recognition that potentially valuable data warehouse contents should be analyzed. Thus, the need for data mining is demonstrated. The data warehouse is a composite of hardware, middleware, databases, warehousing tools, and software (Marietti, 1997). Data warehousing is a process and the data warehouse is the location in which the process takes place. Typically, in a data warehouse, data input comes from a variety of sources in a variety of formats (DeJesus, 1999). Within the warehouse, data is cleansed of extraneous and erroneous material and then transformed into a common format for availability to the end user. There are three different functional areas in a data warehouse, and each is customized to meet specific business or facility needs. First is data acquisition or the handling of data from a variety of sources that must be identified, copied, formatted, cleansed, normalized, audited, and prepared for loading. Second is data storage and archiving. The third component of a warehouse is that of data access. Data access requires an assortment of products including intelligent agents, query capability, statistical analysis, data dictionaries, data discovery, on-line analytical processing (OLAP), and data visualization (Fletcher, 1997; Mattison, 1996). Marietti (1997) considers the heart of the data warehouse to be the analytical database. The benefits of data warehousing include immediate information delivery, the ability to do trend and outcome analysis, query and report capabilities, and the ability to integrate data from multiple sources (Shams & Farishta, 2001). DeJesus (1999) suggests the data warehouse as a useful tool for healthcare in disease management and the prediction of at-risk populations, if warehouse contents are analyzed and used in decision-making.
Large databases have emerged as attractive but controversial sources of information (Johnson, 1999; Matchar, et al., 1997; Wuerker, 1997). As the body of data collected grows in size and complexity, there is a resulting consensus that significant untapped knowledge lies hidden in many large databases. Data mining and knowledge discovery in databases (KDD) relate to the process of extracting valid, previously unknown and potentially useful patterns and information from raw data in large databases (Biswas, Weinberg & Fisher, 1998; Frawley, Piatetsky-Shapiro & Matheus, 1992; Kiel, 2000; Lingras & Yao, 1998; Simoudis, 1998). KDD shares much with statistical and exploratory data analysis in terms of statistical procedures for modeling data and handling noise. The extraction of information or knowledge from large databases is closely related to exploratory data analysis. Often, data mining and KDD are treated as synonyms and refer to the whole process in moving from data to knowledge (Raghavan, Deogun & Sever, 1998). "A primary goal of knowledge discovery is the interpretation of discovered concepts in the context of domain knowledge" (Biswas, et al., 1998, p. 224.) The explosion of data and the development of large databases have led to the creation of data warehouses and recognition of the need for data mining. Surviving the information explosion means not only knowing how to classify and access information, but also how to apply it. Neural networks have become one way to organize the increased information in a way that makes it relevant in the context in which decisions are made (Tan & Sheps, 1998).
Data mining is "an interdisciplinary field bringing techniques of machine learning, pattern recognition, statistics, databases, and data visualization to address the issues of information extraction from large databases" (Cabena, Hadjinian, Stadler, Verhees & Zanasi, 1998, p. ix). The analogy of "mining" suggests the sifting through of large amounts of low-grade ore (data) to find something valuable — information (Psomus, Schaufele & Madhaven, 2000). Data mining is a multi-step, iterative inductive process (Cabena, et al., 1998; Gerber, 1998) useful in deriving useful knowledge from real-world databases through the application of pattern extraction techniques (Raghaven, et al., 1998, p. 402). It includes such tasks as problem analysis, data extraction, data preparation and cleaning, data reduction, rule development, output analysis and review (Darling, 1998; Gilman, 1997; McDonald, Brossette & Moser, 1998). Data mining is the process of discovering meaningful new correlations, patterns and trends by sifting through large amounts of data stored in repositories, using pattern recognition technologies as well as statistical and mathematical techniques (Gartner Group, 1999).
Data mining has emerged as one of the powerful techniques for extraction of useful information from databases (Kostoff & Geisler, 1999). Large amounts of data can be explored to uncover previously unknown patterns that may include surprising patterns of relationship that might not have been otherwise found. Initial applications of data mining were in business and industry and data mining is seen as an essential analytical skill in the business community (SPSS, 2000). However, a number of published studies now address the value of data mining within the healthcare industry. These studies have looked at such varied issues as infection patterns (Brossette, et al., 1998), Medicare fraud (Milley, 2000), the prediction of premature births (Goodwin, et al., 1997) and the prediction of hospitalization of long-term care patients (Abbot, Quirolgico, Marchand, Canfield & Adya, 1998).
Database mining has been called the "confluence of machine learning techniques and the performance emphasis of database technology" (Agrawal, Tmielinski & Swami, 1998, p.16). Because data mining involves retrospective analyses of data, experimental design is considered outside the scope of data mining. Data mining looks through an entire database to find trends, patterns, and relationships that may not have otherwise been noticed (Rudin, 1996). The process of implementing data mining generally begins with the selection and preparation of data to be mined. Data is then qualified using cluster and/or feature analysis to reduce the complexity of data management. Next is the selection and application of a a data mining tool. After the data has been mined, analysis is done, and the final step of the data mining process is the application of knowledge discovered (Gerber, 1998). "The greatest obstacle in locating potentially useful patterns in data is the likelihood that the database wasn't constructed with discovery processes in mind" (Norton, 2000).
Steps for KDD are goal setting, selection of data, preprocessing (cleansing data of noise and errors, developing procedures to account for missing data, developing naming conventions), transformation (reducing data by finding features that can represent several elements of the data), mining, and interpretation/evaluation (data visualization). Data visualization is an invaluable counterpart to data mining. Visualization includes displays of trends, clusters and differences (Gray et al., 1996). Data mining algorithms allow for interpretation and prediction analysis based on information in databases (Schulman, 1998). Data mining has the ability to take information and go beyond stating what was into the realm of predicting what could be.
John (1997) suggests that two types of patterns are discovered with data mining: predictive and informative. Predictive patterns represent an educated guess about the value of an unknown attribute, given the values of other known attributes. Pattern analysis can be defined as the examination of the configuration of relationships of the elements of phenomena. Informative patterns present interesting patterns that provide new insight to a domain expert. The value of informative patterns lies in whether actions are suggested to the domain expert and whether suggested actions are effective. Interestingness of a pattern is a measure used to determine whether to discard, or keep and explore a pattern further. "Data miners are often more interested in understandability than accuracy of predictability" (Glymour, Madigan, Pregibon & Smyth, 1996, p. 15). Mills (1991) suggests that patterns may reveal enabling predictions and thus generate hypotheses for further investigation.
Four factors leading to the accessibility of data for decision-making are: the incredible increase in computing power with expanded computational speed; the accumulation of large amounts of data; the advancement of methods to benefit from data modeling, without requiring a detailed knowledge of statistical concepts; and the visual nature of current generation data modeling software (Danziger, 1997). Computerization of data does not make up for bad data, but once data has been cleaned, the analysis of vast amounts of data may identify potentially important relationships that do not emerge from sparse data. The analyst must formulate a query to extract data from a database, extract the aggregated data, visualize the results in a graphical way, and analyze the results. The process of analysis requires domain knowledge.
Requests to perform pattern extraction tasks are queries. There are several classes of queries in data mining: hypothesis testing, generalization, classification, characterization, association, and clustering (Raghavan, et al., 1998). Classification, e.g., discrimination, identification, recognition, implies decision making or response selection of some kind based on a system of rules that partitions data into groups (Balakrishnan & Ratcliff, 1996, p.615). Hypothesis testing queries do not explicitly discover patterns within data, but receive as input a stated hypothesis that is then evaluated against a selected database. The hypothesis is a conjecture about the existence of a specific pattern within the database and the goal is verification of the hypothesis being tested. Data mining without a preconceived hypothesis is discovery driven. Operations of discoverydriven data mining include creating prediction and classification models, analyzing links, segmenting databases, and detecting deviation (Simoudis, 1998). These operations are supported by a variety of techniques, including predictive modeling, supervised induction, association discovery, sequence discovery, conceptual clustering, and visualization.
Commonly used techniques in data mining are artificial neural networks, decision trees, genetic algorithms, the "nearest neighbor" method, and rule induction. Classification queries use decision variables or examples to partition data into subclasses. Characterization queries derive common features of a class regardless of the characteristics of other classes. An association query discovers associations among values grouped by selection phrase with a user specified minimum support requirement. Combinations of rules are found within a pre-set confidence factor that specific associations occur. Clustering queries partition data of a relational table with members of each cluster sharing a number of properties. Five common types of information yielded by data mining are: association, sequences, classifications, clusters, and forecasting. Associations happen when occurrences are linked to a single event. Sequences are events linked over time. Classification recognizes patterns that describe the group to which an item belongs. Clustering is related to classification but differs in that no groups have yet been defined and mining discovers different groupings within data. Data mining is almost always used in conjunction with traditional data analysis techniques. Themes of modern statistics of fundamental importance to data miners are clarity about goals, appropriate reliability assessment and accounting for sources of uncertainty. The convergence of statistics and data mining is developing a promising research area.
There is little evidence that nurse researchers seek aggregate patient data that might reveal trends and patterns among patients with similar situations or treatments. Such information can be useful in understanding patterns and in predicting patients' responses to conditions and interventions. Information systems can be designed to aggregate such data and present it in a variety of formats. Exploration of nursing data elements within a spinal cord injury (SCI) database was proposed as a mechanism to help in the identification of major phenomena basic to SCI nursing care. Utilization of information in SCI databases may be a means of bringing more focused and appropriate care to SCI individuals who, as consumers of significant costly care resources, are "outliers" in the healthcare system (Lincoln & Builder, 1999). Patients identified as outliers are those whose annual care costs far exceed normally expected healthcare costs. In our rapidly changing healthcare system, it is important to know aggregate costs of SCI to ensure that adequate funds are allotted for care of the SCI population. Although SCI occurs much less frequently than other types of injury and debilitating disease, the cost of SCI to individuals and to society is staggering. Berkowitz, O'Leary, Kruse & Harvey (1998) estimate that SCI costs the nation more than 9.7 billion dollars per year. Direct care costs within the first year of injury average $223,261, with an additional annual cost for SCI care of at least $26,000. Equipment, supplies, medications, and environmental modification costs increase both figures. Indirect costs related to loss of income and productivity are more difficult to compute, with consideration given to age at injury and earning potential. but indirect cost estimates can be projected as significant. The aggregate annual direct and indirect costs of new cases of SCI may be between 7.2 and 9.5 billion dollars (Berkowitz et al., 1998).
This database analysis uses data mining techniques to determine if there are patterns of patient needs, nursing diagnoses, nursing interventions, and patient outcomes that can contribute information that can improve the efficiency and effectiveness of the delivery of SCI nursing care. Analysis may demonstrate that information patterns related to the presence of specific nursing diagnoses and the choice of specific nursing interventions that promote desired outcomes can be used to allocate resources for SCI care delivery. The application of the data mining process to this SCI clinical database may determine that this research method can lead to a better understanding of how to use data to improve SCI nursing practice.
The setting for this study is a large tertiary care Veteran's Health Administration (VHA) Hospital located on a 62-acre campus within the metropolitan Chicago area. The Veterans Administration (VA) is involved in the full continuum of SCI care and has the largest single network of SCI care in the nation (DVA, 2000). This hospital has two acute rehabilitation/continuing care inpatient SCI units with a total of 68 beds, a hospital-based SCI home care program, and a 30 bed residential SCI unit. The hospital uses the national VA hospital information system (HIS) known as the Veterans Health Information Systems and Technology Architecture (VistA). VistA, one of the most extensive hospital information systems in the world, is an internally developed, comprehensive integrated system that provides for both administrative and clinical support and documentation of care. Over a 20-year period, VistA has evolved to include over 70 applications, as well as numerous links to commercial products. VA software is written in MUMPS (Mass General Utility Multi-Programming System), an ANSI (American National Standards Institute) programming language now call "M" (Kolodner, 1997).
The modular design of the VA nursing software within VistA allows computerization of data for clinical, administrative, research, and educational purposes, as well as quality improvement (Vance, Gillian-Storm, Kraft, Lang & Mead, 1997; Vance, Kraft & Lang, 1998). The data collection system of the VA nursing software incorporates the elements of the nursing minimum data set (NMDS) as defined by Werley and others (Werley, Devine & Zorn, 1990). The NMDS standardizes the items of essential core nursing data for collection, storage, and retrieval. It includes 16 elements categorized into three broad groups: nursing care, client demographics, and service (Werley & Leske, 1991). These elements represent data used on a regular basis by nurses in any setting where nursing care is provided, and are considered necessary for the analysis of nursing practice and its impact on outcomes and cost effective care. The goal of the NMDS is to provide for comparability of nursing data across clinical populations, settings, and geographic areas. The specific nursing care elements in the NMDS are nursing diagnoses, nursing interventions, outcomes and intensity of nursing care. Most computerized nursing information systems (NISs) now utilize the NMDS as the framework for data capture.
The VA nursing database for patient health problems is built on the North American Nursing Diagnosis (NANDA) taxonomy and the care planning process of diagnosis, intervention, and outcome reflects the nursing process. Nursing diagnoses provide a common language within the profession, which can enhance communication between nursing clinicians, improve continuity of care, help formulate expected outcomes, assist in addressing cost-effectiveness of care, and allow emphasis on clinical nursing research. Nursing diagnoses have been recognized as the nursing equivalent of Diagnostic Related Groups (DRGs). The use of nursing diagnosis increases the possibility of giving comprehensive care by identification, validation, and documentation of response to specific health concerns. Nursing diagnosis allows clinicians to describe nursing practice within a shared framework.
Permission to use the VHA SCI database for this study was obtained from the facility's institutional review board (IRB) which includes the Human Studies Subcommittee (HSS) of the Research and Development (R&D) Committee and the R&D Committee itself. Since there were no interventions and no direct contact with patients, the facility IRB gave an expedited review approval. IRB approval for the study was also obtained from the Institutional Review Board of Loyola University. Confidentiality for this study was maintained by using the internal VA patient coding to download data. This data was immediately rerecoded by the investigator to remove all possibility of patient identification.
The 525 patients with 1,107 admissions to the study unit between July 1989 and June 2000 became the study sample. The list of admissions to the study unit was downloaded from an ORACLE mainframe database built through nightly data extracts from VistA. After identification of the study population, nursing diagnoses and interventions selected for these patient encounters were identified using an identification and ranking query that is part of the VistA nursing software. Since the nursing data elements of interest in this study are not included in the VA national data warehouse, this data was downloaded directly from the operational database to a P.C. Data related to age, date of injury and level of injury was obtained directly from the mainframe SCI Registry database that is another VistA software package.
Typically, data preparation is the lengthiest part of the data mining process. It is estimated that 80% of the time spent in a data mining project is spent in data preparation and cleaning (Desouza, 2001; Gerber, 1998). Erroneous data can be a significant problem in real world databases. Data may be redundant or insignificant to the problem. Data preparation includes data selection (identification and extraction of data), data preprocessing (sampling and quality testing), and data transformation (conversion into an analytical model) (Cabena, Hadjinian, Stadler, Verhees & Zanasi, 1998). Goodwin et al. (1997) identify the issues obstructing progress in data mining for improved health outcomes as "data quality, data redundancy, data inconsistency, repeated measures, temporal (time-contextual) measures, and data volume" (p.291). Computerization of data does not make up for bad data, but once data has been cleaned, the analysis of vast amounts of data may identify potentially important relationships that do not emerge from sparse data. The analyst must formulate a query to extract data from a database, extract the aggregated data, visualize the results in a graphical way, and analyze the results. Invariably, routinely collected data is full of errors and incompleteness. Much of the data collected from this computerized database was found to be non-standardized and at a nominal level of measurement. As a result, data were visually inspected, structured, and checked for accuracy, reliability, and redundancy. Data "noise" included redundant, insignificant, erroneous, and missing data. Differences in punctuation and case or changes in word sequence were recognized by the computer software as new terms, new labels, or new variables. This required the researcher to make a visual inspection of all diagnostic and interventional labels and create a structure of labels that represent label clusters with a common or shared meaning. Data visualization is an invaluable counterpart to data mining. Visualization includes displays of trends, clusters, and differences. The visual review of all eleven years of data in this study took approximately 500 hours of time.
There were 4,750 different diagnostic labels in the cumulative eleven-year database that, after visual inspection, were determined to represent 161 unique nursing diagnoses. Through further inspection, these were clustered into 20 diagnostic categories. Two domain experts with significant SCI knowledge and experience reviewed the categories to reach a consensus on the labels for the diagnostic categories. The selected diagnostic categories for the cumulative data were: Skin Care; Elimination; Self Care Deficit; Infection Prevention/Control; Mobility; Respiratory Function; Psychosocial Adaptation; Pain Management; Knowledge Deficit; Nutrition; Fluid Volume Maintenance; Acute Problem Management; Safety/Prevention of Injury; Activity/Rest; Cognitive Functioning; Temperature Control; Sexual Health; Communication, and Miscellaneous. Any diagnostic label within the cumulative database that did not appear at least eleven times during the eleven-year study period was assigned to the category of "Miscellaneous." A map of the annual diagnostic rankings for each of the eleven years in the study was developed to determine if there were significant changes in nursing diagnosis over the study time frame (see Table 1). The data set is currently being examined using data mining methodology.
Years |
89–90 |
90–91 |
91–92 |
92–93 |
93–94 |
94–95 |
95–96 |
96–97 |
97–98 |
98–99 |
99–00 |
---|---|---|---|---|---|---|---|---|---|---|---|
Skin Care |
1 |
1 |
1 |
1 |
1 |
1 |
1 |
1 |
1 |
1 |
1 |
Elimination |
3 |
2 |
2 |
2 |
2 |
2 |
2 |
2 |
2 |
2 |
2 |
Self Care Deficit |
2 |
3 |
4 |
3 |
3 |
4 |
3 |
4 |
4 |
5 |
5 |
Infection Prevention |
5 |
4 |
3 |
4 |
4 |
3 |
4 |
3 |
3 |
3 |
3 |
Mobility |
7 |
5 |
10 |
6 |
5 |
5 |
5 |
5 |
7 |
6 |
6 |
Psychosocial Adapt. |
4 |
6 |
6 |
5 |
7 |
7 |
6 |
6 |
5 |
7 |
7 |
Respiratory Function |
6 |
7 |
5 |
7 |
6 |
6 |
8 |
7 |
6 |
4 |
4 |
Comm. Reintegration |
N/A |
8 |
11 |
9 |
N/A |
8 |
10 |
8 |
8 |
9 |
9 |
Pain Mgmt. |
8 |
9 |
9 |
8 |
8 |
9 |
9 |
10 |
12 |
8 |
8 |
Knowledge Deficit |
9 |
10 |
7 |
12 |
10 |
10 |
7 |
9 |
9 |
10 |
10 |
Fluid Volume Maint. |
11 |
11 |
12 |
13 |
14 |
14 |
14 |
13 |
13 |
14 |
16 |
Nutrition |
10 |
12 |
8 |
10 |
9 |
13 |
12 |
11 |
14 |
15 |
15 |
Miscellaneous |
13 |
13 |
13 |
15 |
17 |
16 |
17 |
16 |
17 |
11 |
13. |
Acute Medical Mgmt. |
12 |
14 |
15 |
14 |
11 |
11 |
13 |
14 |
15 |
16. |
14 |
Activity/Rest |
13 |
15 |
14 |
11 |
12 |
15 |
16 |
15 |
11 |
13 |
11 |
Prevention of Injury |
N/A |
16 |
16 |
16 |
13 |
12 |
11 |
12 |
10 |
12 |
12 |
Temperature Control |
N/A |
17 |
19 |
19 |
N/A |
N/A |
N/A |
N/A |
N/A |
N/A |
17 |
Cognitive Functioning |
N/A |
18 |
18 |
17 |
16 |
17 |
15 |
17 |
16 |
17 |
18 |
Sexual Health |
N/A |
19 |
17 |
18 |
15 |
19 |
19 |
18 |
N/A |
N/A |
N/A |
Sensory/Perceptual Deficit |
N/A |
N/A |
N/A |
N/A |
N/A |
18 |
18 |
N/A |
N/A |
N/A |
N/A |
The potential for original research with pre-collected data is tremendous. The better acquainted the researcher is with the database, the greater the potential for creative new research. Secondary analysis is described as "extremely versatile in that it can be applied to studies designed to understand the present and the past; to understand change; and to examine phenomena comparatively" (Kiecolt & Nathan, 1985, p. 47). Advantages of such data analysis include larger samples, elimination of instrument development, sample selection, and data collection (Abel & Sherman, 1991). Aggregated data may provide insight and information that is useful in patient care delivery, program planning, and policy development. Lange & Jacox (1993) have identified interest in using clinical and administrative healthcare databases for health policy research because of national concern about the quality, cost, and outcomes of healthcare.
The use of data mining, by definition, excludes the possibility of testing preconceived hypotheses. Data miners do not pose a question, as much as ask the system to discover data patterns that may be predictive. The process of data mining may result in the identification of hypotheses for future research. Of specific interest is a predictive model using artificial neural networks for hospital length of stay based on nurse diagnosis. Care must be taken in the evaluation and analysis of data sets since data set variables may not adequately reflect the secondary analyst's concepts of interest. The task of designing a study using available data can be challenging.
The identification of patient outcomes sensitive to nursing care is a priority for nursing research. Research related to SCI nursing diagnoses and SCI nursing interventions may demonstrate under which particular circumstances specific interventions promote the most effective outcomes for SCI patients. The need to capture outcomes has been recognized by providers, payers, and policy makers. Knowledge discovery in clinical databases is a step toward the identification of outcomes and the measurement of effectiveness. Outcomes may be classified as "generic" or pertinent to all healthcare consumers, or "condition-related" and pertinent to sub-populations of patients with specific diseases or conditions. In addition, time becomes a dimension of outcome measurement. Outcome related data might come from multiple sources, such as the patient, families and caregivers, healthcare professionals, and biomedical instrumentation (Zielstorff, 1995). Assessment of effectiveness of care, according to Ozbolt (1996), requires standardized data aggregated in databases for comparison across times, conditions, and institutions. To analyze healthcare data, it is critical that data are stored in a retrievable format according to standards that will allow for data sharing and data queries while patient privacy and confidentiality is protected. There must also be a way to link outcome data to all influencing factors such as co-morbidities, procedures, treatments, interventions, patient demographics, etc.
Ozbolt (1991) has suggested that the failure of nursing to agree upon and offer a valid defined and standardized data set for inclusion in healthcare databases has created the problem of the lack of nursing inclusion. The inclusion of nursing data is absolutely necessary for nursing research on effectiveness. Nursing data are not included in many healthcare databases including the UHDDS (Uniform Hospital Discharge Data Set). There are several reasons why nursing is invisible in most existing healthcare datasets. Nursing data has not been required for regulatory reporting and reimbursement. In addition, the lack of a widely accepted nursing structured terminology supports the "noncapture" of nursing data. The issues identified by the work from AHCPR regarding data accessibility, data standards, and vocabulary standards are all identified in the study of SCI nursing data. The application of data mining techniques to the databases found in healthcare does have the potential to discovery of undetected patterns of practice and outcomes, and may also generate practice hypotheses for further research.
Aaronson, L., & Burman, M. (1994). Use of health records in research: Reliability and validity issues. Research in Nursing & Health, 17, 67–73.
Abbot, P., Quirolgico, S., Marchand, R., Canfield, K., & Adya, M. (1998). Can the U.S. minimum data set be used to predict admissions to acute care? Medinfo 9. Pt2 (13), 18–21.
Abel, E., & Sherman, J. (1991). Use of national data sets to teach graduate students research skills. Western Journal of Nursing Research, 13 (6): 794–797.
Agrawal,R., Tmielenski, T., & Svami, A. (1998). Database mining: A performance perspective. San Jose, CA.: IBM Almaden Research Center.
American Nurses Association. (1985). Code of ethics for nurses. Kansas City, MO: American Nurses Association.
Appleby, C. (1997, March 5). Cyberspaced. Hospitals and Health Networks, pp. 30–32.
Bakken, S., & Costantio, M. (2001). Standardized terminologies and integrated information systems: Building blocks for transforming data into nursing knowledge. In J. M. Dochterman & H. Grace, (Eds.), Current issues in nursing (pp. 52–59). St. Louis: Mosby.
Balakrishnan, J., & Ratcliff, R. (1996). Testing models of decision making using confidence ratings in classification. Journal of Experimental Psychology, 22 (3), 615–633.
Ball, M., Hannah, K., Newbold, S., & Douglas, J. (1995). Nursing informatics: Where caring and technology meet (2nd ed.). New York: Springer.
Berkowitz, M., O'Leary, P., Kruse, D., & Harvey, C. (1998). Spinal cord injury: An analysis of medical and social costs. New York: Demos Medical Publishing, Inc.
Biswas, G., Weinberg, J., & Fisher, D. (1998). ITERATE: A conceptual clustering algorithm for data mining. IEEE Transactions on Systems, Man, and Cybernetics, 28 (2), 219–230.
Blois, M. (1987). What is it that computers compute? M.D. Computing, 4 (3), 31–33, 56.
Borok, L. (1997). Data mining: Sophisticated forms of managed care modeling through artificial intelligence. Journal of Health Care Finance, 23 (3), 20–36.
Brosette, S., Sprague, A., Hardin, M., Waites, K., Jones, W., & Moser, S. (1998). Association rules and data mining in hospital infection control and surveillance. Journal of the American Medical Informatics Association, 5 (4), 273–281.
Cabena, P., Hadjinian, P., Stadler, R., Verhees, J., & Zanasi, A. (1998).Discovering data mining: From concept to implementation. Upper Saddle River, NJ: Prentice Hall, Inc.
Childs, N. (2001, February). Health agency releases medical data privacy standard. Provider, p. 12.
Cimino, J. (1995). Vocabulary and healthcare information technology: State of the art. Journal of the American Society for Information Sciences, 48 (10), 777–782.
Cowper, D., Hynes, D., Kubal, J., & Murphy, P. (1999). Using administrative databases for outcomes research: Select examples from VA health services research and development. Journal of Medical Systems, 23 (3), 240–259.
Danziger, D. (1997). Data mining—It's not just for statisticians anymore. Retrieved from http://www.tgc.com/dsstar/9710.
Darling, C. (1998). Data mining for the masses. Retrieved from http://www.datamation.com/plugIn/workbench/dataming.
Davidoff, F. (1997). Databases in the next millennium. Annals of Internal Medicine, 127 (8), 770–774.
DeJesus, E. (1999, October). State of the art/Data mining. BYTE. Retrieved from http://byte.com/art/950/sec8/sec8.htm.
Desouza, K. (2001). Artificial intelligence for healthcare management. In Proceedings of the First International Conference on Management of Healthcare and Medical Technology. Enschede, The Netherlands: Institute for Healthcare Technology Management.
Desouza, K. (2002). Knowledge management with artificial intelligence. Westport, CT: Quorum Books.
Drucker, P. (1999). Management challenges for the 21st century. New York: HarperCollins Books.
Evans, P., & Wurster, T. (1997, September–October). Strategy and the new economics of information. Harvard Business Review, pp. 71–82.
Fiesta. (1996). Legal issues in the information age—Part 2. Nursing Management, 27 (9), 12–13.
Fletcher, D. (1997). No fool's gold: Guarantee riches from your data mine. Healthcare Informatics, pp. 115–118.
Fransman, M. (1996, July). Information regarding the information superhighway and interpretive ambiguity. IEEE Communications Magazine, 34 (7), 76–80.
Frawley, W., Piatetsky-Shapiro, G., & Matheus, C. (1992). Knowledge discovery in databases: An overview. AI Magazine, pp. 213–228.
Gabrieli, E. (1990). Electronic healthcare records: A discourse. Journal of Clinical Computing. 18 (5&6), 130–143.
Gartner Group. (2000). Retrieved from http://www.gartner.com.
Gerber, C. (1998). Excavate your data. Retrieved from http://www.PlugIn/workvench/datamine/exacv.htm.
Gilman, M. (1997). NuggetsTM and data mining. White paper. Melville, NY: Data Mining Technologies, Inc.
Glymour, C., Madigan, D., Pregibon, D., & Smyth, P. (1997). Statistical themes and lessons for data mining. Data Mining and Knowledge Discovery, 1, 11–28.
Goodwin, L., Prather, J., Schlitz, K., Iannacchione, M., Hage, M., Hammond, W., & Grzymala-Busse, J. (1997). Data mining issues for improved birth outcomes. Biomedical Sciences Instrumentation, 34, 291–296
Gostin, L. (1997). Health care information and the protection of personal privacy: Ethical and legal considerations. Annals of Internal Medicine, 127 (8), 683–690.
Graves, J., & Corcoran, S. (1989). The study of nursing informatics. Image, 21 (4), 227–231.
Gray, J., Chaudhuri, S., Bosworth, A., Layman, A., Reichart, D., Venkatrao, M., Pellow, F., & Pirahesh, H. (1997). Data Cube: A relational aggregation operator generalizing group-by, cross-tab, and sub-totals. Data Mining and Knowledge Discovery, 1, 29–53.
Hannah, K., Ball, M., & Edwards, M. (1994). Introduction to nursing informatics. New York: Springer-Verlag.
Hedberg, S. (1995, October). State of the art/The data mining gold rush. BYTE. Retrieved from http://www.byte.com/art95/sec8/art2.html.
Hersh, W. (1999). A world of knowledge at your fingertips: The promise, reality, and future directions of on-line information retrieval. Academic Medicine, 74 (3), 240–243.
Hlatky, M., Lee, K., Harrell, F., Califf, R., Pryor, D., Mark D., & Rosati, R. (1984). Tying clinical research to patient care by use of an observational database. Statistics in Medicine, 3, 375–384.
Hynes, D., Cowpere, D., Kerr, M., Kubal, J., & Murphy, P. (2000). Database and informatics support for QUERI: Current systems and future needs. Medical Care, 38 (6), 114–128.
Iezzoni, L. (1997). Assessing quality using administrative data. Annals of Internal Medicine, 127 (8), 666–674.
John, G. (1997). Enhancements to the data mining process. Stanford University Doctoral Dissertation. Ann Arbor, MI: UMI Dissertation Services. UMI Number: 9723376.
Johnson, N. (1999). Evaluating the quality and applicability of database-derived outcomes studies. Formulary, 34, 603–606.
Kiecolt, K., & Nathan, L. (1985). Secondary analysis of survey data. Thousand Oaks, CA: Sage Publications.
Kiel, J. (2000). Data mining and modeling: Power tools for physician practices. MD Computing: Computers in Medical Practice, 17 (3), 33–34.
Kolodner, R. (ed.). (1997). Computerizing large integrated health networks: The VA success. New York: Springer Verlag.
Kostoff, R., & Geisler, E. (1999). Strategic management and implication of textual data mining in government organizations. Technology Analysis & Strategic Management, 11 (4), 493–525.
Krowchuk, H., Moore, M., & Richardson, L. (1995). Using health care records as sources of data for research. Journal of Nursing Measurement, 3 (1), 3–12.
Lange, L., & Jacox, A. (1993). Using large data bases in nursing and health policy research. Journal of Professional Nursing, 9 (4), 204–211.
Lazaridis, E. (1997). Database standardization, linkage and the protection of privacy. Annals of Internal Medicine, 127 (8), 696.
Lewis, E. (1997, Spring). Guest editorial. Nursing Administration Quarterly, viii–x.
Lillard, L., & Farmer, M. (1997). Linking medicare and national survey data. Annals of Internal Medicine, 127 (8), 691–695.
Lincoln, T., & Builder, C. (1999). Global healthcare and the flux of technology. International Journal of Medical Informatics, 53, 213–224.
Lingras, P., & Yao, Y. (1998). Data mining using extensions of the rough set model. Journal of the American Society for Information Science, 49 (5), 415–422.
Marietti, C. (1997). The data warehouse. Healthcare Informatics, pp. 93–102.
Matchar, D., Samsa, G., Matthews, J., Ancukiewicz, M., Parmigiani, G., Hasselblad, V., Wold, P., D" Agostino, R., & Lipscomb, J. (1997). The stroke prevention policy model: Linking evidence and clinical decisions. Annals of Internal Medicine, 127 (8), 704–711.
Mattison, R. (1996). State of the art: Warehousing wherewithal. Retrieved from http://www.cio.com/archive/040196_soa_content.html.
McArt, E., & McDougal, L. (1985). Secondary data analysis: A new approach in nursing research. Image, 17 (2), 54–57.
McDonald, C., Brossette, S., & Moser, S. (1998). Pathology information systems: Data mining leads to knowledge discovery. Archives of Pathology & Laboratory Medicine. 122, 409–411.
McDonald, C., Overhage, J., Dexter, P., Takesue, B., & Dwya., D. (1997). A framework for capturing clinical data sets from computerized sources. Annals of Internal Medicine, 127 (8), 675–682.
McPhillips, R. (1991). National and regional databases: The big picture. In Patient outcomes research: Examining the effectiveness of nursing practice. Proceedings of the State of the Science Conference. NCNR, DHHS, NIH Publication # 93–3411.
Milley, A. (2000). Health care and data mining. Health Management Technology, 21 (8), 44–45.
Mills, W. (1991). Why a classification system? In R. Carroll-Johnson, (Ed.), Classification of nursing diagnoses: Proceedings of the 9th conference (pp. 3–5). Philadelphia, PA: Lippincott.
Murphy, P., Cowper, D., Seppala, G., Stroupe, K., & Hynes, D. (2002). Veterans Health Administration inpatient and outpatient care data: An overview. Effective Clinical Practice. Retrieved from www.acponline.org/journals/ecp/May/June02/Murphy.
Newbold, D. (1993). Deciding data. Nursing Times, 89 (48), 64–65.
Norton, M. (2000). Knowledge discovery with a little perspective. Bulletin American Society for Information Science, pp. 21–23.
Orsolits, M., Davis, C., & Gross, M. (1988). Nursing informatics and the future: The twenty-first century. In M. Ball, K. Hannah, U. Jelger, & H. Peterson, (Eds.), Nursing informatics: Where caring and technology meet. New York: Springer-Verlag.
Ozbolt, J. (1991). Strategies for building nursing data bases for effectiveness research. In Patient outcomes research: Examining the effectiveness of nursing practice. Proceedings of the State of the Science Conference. DHHS, NIH Publication # 93–3411.
Ozbolt, J. (1996). From minimum data to maximum impact: Using clinical data to strengthen patient care. Advanced Practice Nursing Quarterly, 1 (4), 62–69.
Palmer, R. (1997). Process-based measures of quality: The need for detailed clinical data in large health care databases. Annals of Internal Medicine, 127 (8), 733–738.
Peck, M., Nelson, N., Buxton, R., Bushnell, J., Dahle, M., Rosebrock, B., & Ashton, C. (1997, Spring). LDS hospital, a facility of intermountain health care. Nursing Administration Quarterly, 29–49.
Pringle, M., & Hobbs, R. (1991). Large computer databases in general practice. British Medical Journal, 302, 741–742.
Pryor, D., Califf, R., Harrell, F., Hlatsky, M., Lee, K., Mard, D., & Rosati, R. (1985). Clinical databases: accomplishments and unrealized potential. Medical Care, 23 (5), 623–647.
Psomas, J., Schaufele, M., & Madhaven, G. (2000). Data mining overview and select vendor tools. Unpublished manuscript.
Raghavan, V., Deogun, J., & Server, H. (1998). Knowledge discovery and data mining. Journal of the American Society for Information, 49 (5), 397–402.
Ray, W. (1997). Policy and program analysis using administrative databases. Annals of Internal Medicine, 127 (8), 712–718.
Reed, J. (1992). Secondary data in nursing research. Journal of Advanced Nursing, 17, 877–883.
Rittman, M., & Gorman, R. (1992). Computerized databases: Privacy issues in the development of the nursing minimum data set. Computers in Nursing, 10 (1), 14–17.
Roberts, B., Anthony, M., Madigan, E., & Chen, Y. (1997). Data management: Cleaning and checking. Nursing Research, 46 (6), 350–352.
Romano, C. (1987). Privacy, confidentiality, and security of computerized systems: The nursing responsibility. Computers In Nursing, 5 (3), 99–104.
Romano, C., & Brennen, P. (1991). Computerizing the documentation of patient care. In C. D'Argenio, (Ed.), Implementing Nursing Diagnosis-based Practice. St. Louis: Mosby.
Rubin, D. (1997). Estimating causal effects from large data sets using propensity scores. Annals of Internal Medicine, 127 (8), 757–763.
Rudin, K. (1996.) What's new in data warehousing? DBMS Data Warehouse Supplement. Retrieved from http://www.dbmsmag.com/9708.htm.
Sardinas, J., & Muldoon, J. (1998). Securing the transmission and storage of medical information. Computers in Nursing, 16 (3), 162–168.
Schulman, S. (1998). Data mining: Life after report generators. Information Today, 15 (3), 52.
Seelos, H. (1993). The empirical object of medical information. Journal of Medical Systems, 7 (2), 87–96.
Shams, K., & Fareshta, M. (2001). Data warehousing: toward knowledge management. Topics in Health Information Management, 21 (3), 24–32.
Simoudis, E. (1998). Data mining: A technology comes of age. Retrieved from http://www.software.ibm.com/sq/issues/vol24/datatch.htm.
Sinclair, V. (1990). Potential effects of decision support systems on the role of the nurse. Computers in Nursing, 8 (2), 60–65.
Southon, F., Braithwaite, J., & Lorenzi, N. (1997). Strategic constraints in health informatics: Are expectations realistic? International Journal of Health Planning and Management,12, 3–13.
SPSS. (2000). Build leading-edge e-commerce and business intelligence curricula. Retrieved from www.spss.com/education.
Steiner, C., Elizhauser, A., & Schnaier, J. (2002, May/June). The healthcare cost and utilization project: An overview. Effective Clinical Practice. Retrieved from http://www.acponline.org/journals/ecp/mayjun02/steiner.htm.
Styffe, E. (1997, Spring). Privacy, confidentiality, and security in clinical information systems: Dilemmas and opportunities for the nurse executive. Nursing Administration Quarterly, 21–28.
Tan, J., & Sheps, S. (Eds.) (1999). Health decsion support systems. Gaithersburg, MD: Aspen Publishers, Inc.
Temple, R. (1990). Problems in the use of large data sets to assess effectiveness. International Journal of Technology Assessment in Health Care, 8, 211–219.
The American Nurse. (1999). New nursing recognition criteria announced. The American Nurse, 9. Washington: American Nurses Association.
Tierney, W., & McDonald, C. (1991). Practice databases and their uses in clinical research. Statistics in Medicine, 10, 541–557.
Turban, E., & Aronson, J. (2001). Decision support systems and intelligent systems. Upper Saddle River, NJ: Prentice Hall.
U.S. Department of Health and Human Services. (1991). The feasibility of linking research-related data bases to federal and non-federal medical administrative data bases (AHCPR Publication No. 91-0003). Rockville, MD: U.S. Department of Health and Human Services.
Vance, B., Gilleran-Strom, J., Kraft, M., Lang, B., & Mead, M. (1997). Nursing use of systems. In R. Kolodner, (Ed), Computerizing large integrated health networks. New York: Springer-Verlag.
Vance, B., Kraft, M. R., & Lang, B. (1998). Nursing software development and implementation: An integral aspect of the Veterans Health Administration information system infrastructure. In S. Moorhead & C. Delaney, (Eds.), Information systems innovations for nursing: New visions and ventures. Thousand Oaks, CA: Sage Publications.
Werley, H., & Leske, J. (1991). Standardized comparable, essential data available through the nursing minimum data set. In J. Turley & S. Newbold, (Eds). Nursing Informatics 91: Pre-conference proceedings. Heidelberg-Berlin: Springer-Verlag.
Werley, H., Devine, E., & Zorn, C. (1990). Nursing minimum data set: Data collection manual. Milwaukee, WI: University of Wisconsin School of Nursing.
Weschler, J. (1996). Electronic transmission, sharing of health information raising patient privacy concerns. Formulary, 31, 990–991.
Wray, N., Ashton, C., Kuykendall, D., & Hollingsworth, J. (1995). Using administrative databases to evaluate the quality of medical care: A conceptual framework. Social Sciences in Medicine, 40 (12), 707–715.
Wuerker, A. (1997). Longitudinal research using computerized clinical databases: Caveats and constraints. Nursing Research, 46 (6), 353–358.
Zielstorff, R. (1995). Capturing and using clinical outcome data: Implications for information systems design. Journal of the American Medical Informatics Association, 2, 191–196.
Part I - ERP Systems and Enterprise Integration
Part II - Data Warehousing and Data Utilization