building clinical data warehouse for traditional chinese medicine

6

Click here to load reader

Upload: leduong

Post on 31-Dec-2016

213 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Building Clinical Data Warehouse for Traditional Chinese Medicine

Building Clinical Data Warehouse for Traditional Chinese Medicine Knowledge Discovery

Xuezhong Zhou1, Baoyan Liu2, Yinghui Wang3, Runsun Zhang3, Ping Li3, Shibo Chen2, Yufeng Guo3, Zhuye Gao4, Hua Zhang4

Email: [email protected]; [email protected] 1College of Computer Science and Information Technology, Beijing Jiaotong University, Beijing,

100044, China 2China Academy of Chinese Medicine Sciences, Beijing, 100070, China

3Guanganmen Hospital, China Academy of Chinese Medicine Sciences, Beijing, 100053, China 4Beijing University of Chinese Medicine, Beijing, 100029, China

Abstract

The clinical data from the daily clinical process, which keeps to traditional Chinese medicine (TCM) theories and principles, is the core empirical knowledge source for TCM researches. This paper introduces a data warehouse system, which is based on the structured electronic medical record system and daily clinical data, for TCM clinical researches and medical knowledge discovery. The system consists of several key components: clinical data schema, extraction-transformation-loading tool, online analytical analysis (OLAP) based on Business Objects (a commercial business intelligence software), and integrated data mining functionalities. Currently, the data warehouse contains 20,000 inpatient data of diabetes, coronary heart disease and stroke, and more than 20,000 outpatient data. Moreover, we have developed several important research oriented subject analyses using OLAP, and conducted several TCM clinical data mining applications. The analysis applications show that the developed clinical data warehouse platform is promising to build the bridge for TCM clinical practice and theoretical research, hence, will promote the related TCM researches.

Keywords: Clinical data warehouse, Traditional Chinese medicine, Extraction-transformation-loading, Online analytical analysis, Data mining

1. Introduction

Traditional Chinese medicine (TCM) has long history and distinguished clinical effects. Different from the modern biomedical science, TCM has no general experimental practice in laboratory. In contrast, clinical practice or clinical experiments is the core basis of TCM. Hence, the new Chinese medical formulas and

theoretical knowledge are not from laboratory but directly from daily clinical practice. The clinical practice with synthesized treatment based on syndrome differentiation (STSD) is the basis of TCM clinical evaluation and clinical study [1]. The huge clinical data storage is the firsthand and effective evident for TCM clinical researches. Developing the clinical wet-dry mode is a significant and vital task of TCM researches [2].

Data warehouse [3] is a technical solution for immense data storage, management and processing. The increased demand on financial analysis [4], disease control [5], clinical decision process [6], adverse drug events [7], laboratory test data analysis [8] and information feedback for hospital practice management [9] in healthcare organizations has given rise to the research and development of clinical data warehouse. Clinical data warehousing is a difficult systematic task with many particular complicated issues such as many-to-many relationships, entity-attribute-value (EAV) data structure and bitemporal data [10]. Data integration tasks of medical data store are challenging, hence the data warehouse architectures [11] are studied to propose practicable solutions to tackle data integration issues.

Compared with the clinical data of modern medicine, TCM clinical data has some significant and distinct information contents like symptom/sign, TCM syndrome, formula and herb, etc. The three types of information are the core elements of TCM clinical data. Moreover, the symptom/sign information with systematic description is the foundational information for TCM syndrome diagnosis. Therefore, the medical record contains symptom/sign should be structured and stored in relational database. To utilize and analyze the daily TCM clinical data for TCM researches, we have

2008 International Conference on BioMedical Engineering and Informatics

978-0-7695-3118-2/08 $25.00 © 2008 IEEEDOI 10.1109/BMEI.2008.83

615

2008 International Conference on BioMedical Engineering and Informatics

978-0-7695-3118-2/08 $25.00 © 2008 IEEEDOI 10.1109/BMEI.2008.83

615

Page 2: Building Clinical Data Warehouse for Traditional Chinese Medicine

developed a clinical data warehouse system based on the structured TCM electronic medical record system (SEMR) [12], which has structured data storage of the information of medical record (e.g. chief complaint and histories). Furthermore, since most TCM clinical data, such as symptom/sign, diagnosis and formula prescription, is represented by terminologies, we have a systematic study on the TCM clinical terminology and nomenclature [13] to facilitate the data entry and standard representation.

We have collected about 20,000 inpatients data in TCM hospitals (ten top grade hospitals in Beijing, China) or TCM wards on diabetes, coronary heart disease (CHD) and stroke. Furthermore, there are more than 20,000 outpatient data instances, which record the outpatient clinical process of twenty over famous TCM physicians in Beijing, China. By comprehensive analyzing the characteristics of TCM clinical data structure and the analysis subjects of TCM clinical researches, we have designed the information model, physical data model and multidimensional data model for clinical data warehouse. Meanwhile, we have developed an extraction-transformation-loading (ETL) tool, Medical Integrator (MI), to take the tasks of clinical data integration, cleaning and preprocessing. Furthermore, we have integrated the data mining systems, namely Weka and Oracle data miner, and business intelligence tool (Business Objects) to implement a TCM clinical intelligence platform with data mining and online analytical processing (OLAP) abilities.

2. The infrastructure of TCM clinical data warehouse

As a comprehensive platform for TCM clinical and theoretical researches, the TCM clinical data warehouse system is designed based on Java and J2EE platform. The technology infrastructure of TCM clinical data warehouse is depicted in Fig.1. We see that the infrastructure aims to integrate different operational data sources (e.g. SQL Server, Oracle, DB2) using a self-developed specific ETL tool. More data sources are possible by extending the database interface configuration. Due to the heterogeneous operational data sources, we use a series of metadata information tables to record the metadata (e.g. database type, hospital information, physician information, data content description and transforming information) of the different data sources.

The data storage management is supported by Oracle (currently, we use Oracle 10G as the database server), also the analysis and query service is mainly supported by the distinguished business intelligence

system, Business Objects (BO). BO has the design and analysis clients like Crystal Report, Web/Desktop Intelligence, Dashboard and Performance manager to implement the OLAP functionalities. Meanwhile, we integrate the Oracle data mining option with the client-Oracle data miner, and the machine learning platform, Weka, to perform the online data mining tasks. Therefore, the infrastructure builds a technological framework for huge TCM clinical data integration, preprocessing, management and online analysis.

Fig.1. The technology infrastructure of TCM clinical data warehouse.

As a platform aiming to TCM clinical researches, TCM clinical data warehouse system also can directly provide preprocessed data for the statistics softwares (e.g. SPSS, SAS and STATISTICA) to make possible statistical analysis and test. Hence, from the application perspective, TCM clinical data warehouse proposes integrative functional platform supporting raw clinical data integration and data cleaning, OLAP, data mining and statistics analysis tasks.

3. Traditional Chinese medicine clinical data model design

The information model analysis and design is the vital step of TCM clinical data warehouse development. Medical information model like HL7 reference information model (RIM) [14] is a very complicated system with various concepts and relationships. The objectives of HL7 RIM are to support the medical operational process, particularly, support the information exchanging between different medical information systems. The semantic network of unified medical language system (UMLS) [15] is considered as the distinguished medical ontology in modern biomedical science. The semantic types and structures proposed a global conceptual view of the medical terminologies. The focus and emphasis of UMLS is to bridge the gap between different terminological systems used in the medical literatures. Hence, the conceptual unification principle is adhered to design the core framework of semantic network.

616616

Page 3: Building Clinical Data Warehouse for Traditional Chinese Medicine

However, the information model of TCM clinical data warehouse focuses on the information content that will be analyzed and used in TCM clinical and theoretical researches. Hence, the classification and definition of the information generated by the TCM clinical processes, is the emphases of our work. We consider TCM clinical process as a dynamic system with two core entities, namely physician and patient, and three core information elements, namely symptom, disease/TCM syndrome and treatment. The symptom information element is regarded as a relatively objective disease phenomena, whereas, disease/TCM syndrome is one type of human morbid status, which is the diagnosis result of a specific physician. Meanwhile, the TCM treatment is a clinical event that aims to make patient healthful. Therefore, while taking the abstraction of these five core information elements and constructing the global conceptual framework of TCM information model, we design an information model for clinical data warehouse. We consider that the main information content of TCM clinical researches is studying on the relationships between different entities in one event and also the relationships between different events. Therefore, we can regard the clinical information as various kinds of events (phenomena and activity), in every event there may have several conceptual entities and physical entities participated at a specific time. Because of the mixture of TCM and modern medical concepts and methods in current TCM clinical process in hospitals, the sub-classes of entity class are also the mixture of TCM and modern medical classes. For example, we have defined two distinct disease classes in the model. One class represents the disease concept in TCM, while another class is the modern medical concept. It should be noted that the entity classes will be materialized as dictionary tables in the physical data model in data warehouse. We have the more detailed description of the information model in the work [16].

Adhering to the information model defined, we have designed the physical data model to help store and manage the TCM clinical data. Furthermore, to support the multidimensional analysis such as OLAP, we have designed several core multidimensional data models as the data structure basis of data marts. We have developed several significant subject analysis applications for TCM clinical researches. Each subject analysis application has the corresponding relational multidimensional data model. The practical results show that the information model and multidimensional data model can support very well for the clinical analysis applications.

4. Medical Integrator

ETL is the core component of a successful data warehouse system. Due to the requirement of complex clinical data structure, flexible data checking, multiple heterogeneous data sources integration and numerous terminological standardization processing, even the commercial ETL systems can not fit well for the tasks. Hence we develop MI, the specific ETL tool using Java and Eclipse standard widget toolkit (SWT), to implement the required functions. Fig.2 is the snapshot of the main form of MI. It has the key functions such as data connection configuration, data checking, source database consolidation, data transformation and loading, data cleaning, data standardization and data analysis interface.

Besides the traditional ETL functions, MI has focused on the particular functions like data standardization and data analysis interface. Data standardization process mainly concerns the standardization of the terminological data like symptom, diagnosis and treatment (herb name, description phrase of therapeutical method, etc.). Because the clinical data contains various terms and phrases with flexible expressions, and also errors, the data standardization is vital and important to have an effective analysis. We use a rule-based batch processing approach to take these tasks. About 8 rule tables are designed to store the different kinds of standardization rules. The rules are edited and imported into the corresponding tables using Medical Integrator by TCM clinical experts. To keep the origin data for different analysis applications, we let MI build the necessary middle tables to store the processed data, and provide a standardized data set for different potential data analyses. We take the symptom standardization process as an instance. The expressions of symptom are quite various in clinical practices due to the personal favor of different physicians. Also the error expressions or writings are possible in such huge data storage. We let domain experts edit four kinds of transformation rules to standardize the symptom data. The four kinds of rules instruct the process of noise data cleaning, unified term description, terminological granularity unification and synonymous unification. The result of symptom standardization is the terminological phrases with unified concept.

The EAV structure [10][17] is the preferred choice in clinical data model. However, most statistical and data mining systems are requiring conventional flat style data. Moreover, some analysis systems need encoded data. Hence, to seamlessly integrate the statistical and data mining systems, we have developed several key functions (e.g. automatic encode process,

617617

Page 4: Building Clinical Data Warehouse for Traditional Chinese Medicine

EAV to flat schema conversion and data exporting) for data analysis interface. Using the functions of MI, we have a good preparation of data set with high quality for various data analysis tasks.

Fig. 2. The main interface of Medical Integrator with functional items.

5. Data analysis components

Based on the multidimensional data model and ETL preprocessing, the clinical data has been prepared for the analysis and data mining tasks of clinical researches. We use BO to provide the OLAP analysis. Also the data mining systems such as Oracle data mining, Weka, are integrated to the clinical data warehouse system.

BO has the multidimensional analysis report designing tools such as crystal report, web/desktop intelligence. Also the BO platform is a middleware server to support the management, design and browsing of the reports in B/S framework. The semantic layer is the patent product of BO Company. It realizes the mapping of data structure to domain knowledge category. Compared with the complicated physical data structure in data warehouse, the semantic layer (categories and attributes) is rather simple and with medical sense.

Oracle data mining is an option of Oracle 10g enterprise edition. We have integrated the data mining client, Oracle data miner, to TCM clinical data warehouse. Furthermore, we have integrated the famous open-source machine learning platform, Weka (3.4 version) [18], with JDBC configuration to directly use the data in data warehouse. The integrated two data mining systems have the online data access ability of the clinical data warehouse. Hence, it makes the data mining tasks more facilitating and on-line.

6. Clinical data analysis and knowledge discovery case studies

Clinical practice has a vital role for TCM research and development. Inductive analysis of the empirical data

from clinical practice is a key step for TCM clinical researches. Moreover, study on the relationships between primary conceptual elements like disease, syndrome, symptom/sign, herb and formula is the central issue of TCM clinical researches.

6.1. Online analytical processing and description analysis

Based on the multidimensional data schema and BO semantic layers, we have developed 10 OLAP subject analysis applications with more than 400 analysis reports. The subjects mainly focus on the two types of clinical knowledge: empirical diagnosis and treatment knowledge of famous TCM physician, and the clinical features of vital chronic diseases like diabetes, stroke and CHD. The subjects contain data profile of physicians or diseases, clinical herb and formula using, the relationship among clinical finding, TCM syndrome, disease and complication, etc. The analysis reports can be accessed by authorized web users. Besides the interactive browsing of reports, the user can also export the results as Excel or PDF format.

Fig. 3 is the screenshot of the global data profile (the graphic area) of a famous TCM physician. It proposes the information about the total number of patient instances, consultation times, the disease distribution, herb and formula using, symptom distribution and therapeutic method, etc. The global data profile provides the baseline information of the clinical data related to a specific physician. Fig. 3 shows that the clinical data of the related physician is mainly on the diseases such as Xiong Bi (thoracic obstruction of Qi), gastric pain, Xin Ji (palpitation) and vertigo.

Fig. 3. The global profile analysis of outpatient clinical data of a famous TCM physician.

Also we can know the herb using knowledge on TCM syndrome (Fig. 4) or symptom (Fig. 5) of a famous TCM physician. Other empirical knowledge like clinical using of classical formula, regular herb dosage is analyzed by the corresponding OLAP reports. All the developed reports have the appropriate parameters like physician name, disease name that can

618618

Page 5: Building Clinical Data Warehouse for Traditional Chinese Medicine

be selected by users on demand to show the analysis results of the different physicians or diseases. The exploring analysis of the inpatient data focuses on the relationships among disease, TCM syndrome and clinical findings.

Fig. 4. The herb using information on a specific TCM syndrome of a famous TCM physician.

Fig. 5. The relationships between herb and symptom show which herbs would be prescribed for a specific symptom.

6.2. Data mining

With the integrated data mining abilities and preprocessing functions in clinical data warehouse, we have successfully conducted several preliminary TCM clinical data analysis researches like acupuncture prescription knowledge discovery [19], the relationship between formula (herbs) and syndrome about T2DM affiliated metabolic syndrome [20], herb treatment for T2DM [21], and cluster analysis on syndrome type of TCM in patients with acute myocardial infarction [22].

The acupuncture prescription knowledge discovery research [19] focuses on the empirical clinical acupuncture prescription of Prof. Conghuo Tian in acupuncture department of Guanganmen hospital, Beijing, China. Using the association rule mining method in Weka, we got 18 acupuncture prescriptions from more than one thousand and two hundred medical records. Prof. Tian indicates that one of the eighteen acupuncture prescriptions is not a fixed prescription in his clinical practice. Therefore, finally, we get 17 useful acupuncture prescriptions (with prescription name, acupuncture point composition, modifications, main efficacy, etc.), which reflect the empirical knowledge of Prof. Tian. More data mining

case studies on the outpatient clinical data can refer to the work [16].

The data mining case studies on the inpatient clinical data is focusing on T2DM and CHD. T2DM is still a relatively new disease for TCM treatment and the TCM syndrome classification of T2DM is a research issue. We study on the TCM syndrome classification of T2DM with metabolic syndrome by herb composition network analysis [20]. We find that the therapeutic methods for T2DM with metabolic syndrome mainly include nourish Yin & clear away hot, replenish Qi & nourish Yin, and replenish Qi & nourish blood, etc., as the disease course extends. This indicates that the TCM syndrome categories of T2DM affiliated with metabolic syndrome are Yin Deficiency Heat Excess (early stage), Qi-Yin Deficiency (middle stage) and Qi-Deficiency Blood Stasis (terminal stage). The result proposes a primary guidance for clinical treatment for patients with T2DM affiliated with metabolic syndrome. We have study on the herb prescription knowledge for T2DM with different complications [21], which also propose useful information for TCM treatment of T2DM.

7. Conclusion and Future Work

In conclusion, clinical researches building on the real TCM clinical practices, which keep to STSD, are the essential requirement of TCM research. This paper proposes a data warehouse solution for the clinical data organization, management, processing and analysis. We have accomplished the whole framework and developed the core components such as clinical information model, ETL tool, OLAP and data mining functions. Moreover, based on the collected structured EMR data, we have developed and performed several research oriented subject analyses and data mining tasks. The data analysis case studies show that the clinical data warehouse provides a handy platform for TCM clinical knowledge discovery. Therefore, the clinical data warehouse will be promising to build an infrastructure for TCM clinical and theoretical research. However, the project is still in progress. We will focus on the following three tasks in the future.

The private and security issues are main problems in clinical data using and sharing. We will address the information content protect about both physicians and patients. This has been considered in the current ETL tool and data analysis applications.

Currently, the clinical data only contains the TCM research oriented information, while hospital management information is not covered yet. Due to the decision support requirement of hospital management, we will consider integrating the data from hospital

619619

Page 6: Building Clinical Data Warehouse for Traditional Chinese Medicine

information system and developing the corresponding subject analyses.

Compared with the free-text EMR data collecting, the collecting of the structured clinical data with high quality is still a laborious job. Therefore, the data limit has not made full use of the whole clinical data warehouse framework. We have hammered at the upgrading of the SEMR system to facilitate the data entry tasks. Furthermore, with more TCM hospitals taking the SEMR system as the regular EMR collecting tool and more research projects permitted to provide their data, the current data capacity will increase rapidly in the near future.

Acknowledgements

This work is partially supported by Scientific Breakthrough Program of Beijing Municipal Science & Technology Commission, China (H020920010130), China Postdoctoral Science Foundation (2005037106), China Key Technologies R & D Programme (2007BA110B06), China 973 project (2006CB504601) and the Science and Technology Foundation of Beijing Jiaotong University (2007RC072).

References

[1]. Liu B., Hu J., Xie Y., et al, Conception and Study in Establishment of Modern Individualized Diagnosis and Treatment System in TCM (in Chinese). World Science and Technology-Modernization of TCM. 2003; 5(1):1-5.

[2]. Liu B., Zhou X., Design and Practice of Wet-Dry Approach in Clinical Research of TCM (in Chinese), World Science and Technology-Modernization of TCM. 2007; 9(1):85-9.

[3]. Inmon W.H., Building the Data Warehouse (Third Edition), John Wiley & Sons, Inc.2002.

[4]. Silver M., Sakata T., Su H., et al, Case study: how to apply data mining techniques in a healthcare data warehouse. J Healthc Inf Manag. 2001; 15: 155-64.

[5]. Wisniewski M.F., Kieszkowski P., et al, Development of a Clinical Data Warehouse for Hospital Infection Control. JAMIA. 2003; 10(5):455-62.

[6]. Banek M., Tjoa A. M., Stolba N., Integrating Different Grain Levels in a Medical Data Warehouse Federation. In Proceedings of Data Warehousing and Knowledge Discovery, A. Min Tjoa, Juan Trujillo (Eds.), 2006, Krakow, Poland, LNCS, 4081, 185-94.

[7]. Einbinder J.S., Scully K., Using a Clinical Data Repository to Estimate the Frequency and Costs of Adverse Drug Events. JAMIA. 2002 Nov–Dec; 9(6 Suppl 1): s34-s38.

[8]. Allard R.D., The clinical laboratory data warehouse. An overlooked diamond mine, Am J Clin Pathol 2003, 817-9.

[9]. Granta A., Moshyka A., Diaba H., et al, Integrating feedback from a clinical data warehouse into practice organisation. Int J Med Inform. 2006;75, 232-9.

[10]. Pedersen T.B., Jensen C.S., Research Issues in Clinical Data Warehousing. In Proceedings of SSDBM-98, Italy, July 1-3, 1998.

[11]. Sahama T.R., Croll P.R., A data warehouse architecture for clinical data warehousing. in Proceedings of the fifth Australasian symposium on ACSW frontiers, Australian Computer Society, Inc., Darlinghurst, Australia, 2007;68:227-32.

[12]. Li P., Liu B., Wen T., et al, Traditional Chinese medicine electronic medical record system and the reorganization of TCM theoretical knowledge (in Chinese). Chinese Journal of Information on TCM. 2005; 12(4):7, 39.

[13]. Guo Y., Liu B., Li P., et al, Ontology and Standardization of the TCM Terms (in Chinese). Chinese Archives of TCM. 2007; 25(7):1368-70.

[14]. HL7 Reference Information Model, http://www.hl7.org/ library/data-model/RIM.

[15]. Lindberg D.A.B., Humphreys B.L., McCray A.T., The Unified Medical Language System. Meth Inform Med. 1993; 32:281-91.

[16]. Zhou X., The Research on TCM Clinical Data Warehousing and Clinical Data Mining Methods (in Chinese). Postdoctoral Report, China Academy of Chinese Medical Sciences, 2007.3.

[17]. Deshpande A.M., Brandt C., Nadkarni P.M., Metadata-driven Ad Hoc Query of Patient Data Meeting the Needs of Clinical Studies. JAMIA. 2002; 9(4):369-82.

[18]. Witten I.H. and Frank E., Data Mining: Practical machine learning tools and techniques (2nd Edition) Morgan Kaufmann, San Francisco, 2005.

[19]. Zhang H., Tian C., Liu B., et al, Study on the idea of clinical accupuncture point combination of TCM physician Tian (in Chinese). Journal of Clinical Acupuncture and Moxibustion. 2007.2, 23(2):36-8.

[20]. Ni Q., Liu B., Chen S., et al, Study of Relationship between Formula (herbs) and Syndrome about Type 2 Diabetes Mellitus Affiliated Metabolic Syndrome Based on the Scale-free Network (in Chinese). Chinese Journal of Information on TCM. 2006; 13(11):19-22.

[21]. Jian Z., Ni Q., Zhou X., et al, Study on treatment law of type 2 diabetes based on structural clinical information collect system (in Chinese). Journal of Shangdong University of TCM. 2007;31(3):195-7.

[22]. Zhuye Gao, Hao Xu, Dazuo Shi, et al, The Cluster Analysis on Syndrome Type of TCM in Patients with Acute Myocardial Infarction (in Chinese). Journal of Emergency in TCM. 2007;16(4): 432-4.

620620