mining three-dimensional anthropometric body surface scanning data for hypertension detection

264 IEEE TRANSACTIONS ON INFORMATION TECHNOLOGY IN BIOMEDICINE, VOL. 11, NO. 3, MAY 2007

Mining Three-Dimensional Anthropometric BodySurface Scanning Data for Hypertension Detection

Chaochang Chiu, Kuang-Hung Hsu, Pei-Lun Hsu, Chi-I Hsu, Po-Chi Lee, Wen-Ko Chiou, Thu-Hua Liu,Yi-Chou Chuang, and Chorng-Jer Hwang

Abstract—Hypertension is a major disease, being one of the topten causes of death in Taiwan. The exploration of three-dimensional(3-D) anthropometry scanning data along with other existing sub-ject medical profiles using data mining techniques becomes an im-portant research issue for medical decision support. This researchattempts to construct a prediction model for hypertension usinganthropometric body surface scanning data. This research adoptsclassification trees to reveal the relationship between a subject’s3-D scanning data and hypertension disease using the hybrid of theassociation rule algorithm (ARA) and genetic algorithms (GAs) ap-proach. The ARA is adopted to obtain useful clues based on whichthe GA is able to proceed its searching tasks in a more efficient way.The proposed approach was experimented and compared with aregular genetic algorithm in predicting a subject’s hypertensiondisease. Better computational efficiency and more accurate predic-tion results from the proposed approach are demonstrated.

Index Terms—Anthropometric data, association rule, classifica-tion trees, genetic algorithms (GAs), hypertension.

I. INTRODUCTION

HYPERTENSION as a major disease is one of the top tencauses of death in Taiwan. Hypertension is deemed as

a factor for Syndrome X that has been investigated for yearsin epidemiologic studies [1]–[4]. Although earlier identificationof this disease is gaining importance in clinical research, theinvestigation of factors for prevention and intervention are alsocrucial issues in preventive medicine. Modifiable factors suchas life-style variables and body measurements, for reducingrisk of the disease are especially interesting for public healthprofessionals.

Manuscript received May 25, 2004; revised June 10, 2005 and April 30,2006. This work was supported in part by the National Science Council (NSC),Taiwan, R.O.C. under Grant NSC89-2745-P-182-002 and Grant NSC90-2745-P-155-003.

C. Chiu is with the Department of Information Management, Yuan Ze Uni-versity, Chungli 320, Taiwan, R.O.C. (e-mail: [email protected]).

K.-H. Hsu, Y.-C. Chuang, and C.-J. Hwang are with the Department of HealthCare Management, Chang Gung University, Taoyuan 333, Taiwan, R.O.C.(e-mail: [email protected]).

P.-L. Hsu is with the Department of Electronic Engineering, Ching Yun Uni-versity, Chungli 320, Taiwan, R.O.C. (e-mail: [email protected]).

C.-I. Hsu is with the Department of Information Management, Kai Nan Uni-versity, Taoyuan 338, Taiwan, R.O.C. (e-mail: [email protected]).

P.-C. Lee was with the Department of Information Management, Yuan ZeUniversity, Chungli 320, Taiwan, R.O.C. He is now with the Department of Com-puter Science and Information Management, Chang Gung University, Taoyuan333, Taiwan, R.O.C.

W.-K. Chiou is with the Department of Industrial Design, Chang Gung Uni-versity, Taoyuan 333, Taiwan, R.O.C.

T.-H. Liu was with the Department of Industrial Design, Chang Gung Univer-sity, Taoyuan 333, Taiwan, R.O.C. He is now with the Department of IndustrialDesign, Mingchi University of Technology, Taipei 243, Taiwan, R.O.C.

Digital Object Identifier 10.1109/TITB.2006.884362

Due to the recent development of a new three-dimensional(3-D) scanning technology that has many advantages over theold system of anthropometric measurements [5], [6] using tapemeasures, anthropometers (a special measuring ruler) [7], andother similar instruments, the Department of Health Manage-ment of Chang Gung Medical Center at Taiwan is able to collectthe 3-D anthropometric body-surface scanning data easily andaccurately [8]. Unlike other anthropometric databases that con-tain only geometric and demographic data on human subjects,each of the scanned body data sets in Chang Gung MedicalCenter database connects to the health record and the clinicalrecord through the medical center computer.

Therefore, the exploration of these 3-D data along with otherexisting subject medical profiles using data mining techniquesbecomes an important research issue for medical decision sup-port [9], [10]. Traditional approaches, usually the applied statis-tic techniques, determine the relationship between a target dis-ease and the corresponding anthropometric data.

Tree induction is one of the most common methods of knowl-edge discovery. It is a method for discovering a tree-like pat-tern that can be used for classification or estimation. Some ofthe often-mentioned decision-tree induction methods such asC4.5 [11], classification and regression trees (CART) [12], andQuest [13] are not evolutionary-based approaches. Basically, anideal decision-tree induction technique has to carefully tackleaspects such as model comprehensibility and interesting factor,selection attributes, learning efficiency, and effectiveness, etc.The genetic algorithm (GA), one of the often used evolutionarycomputation techniques, is known for its superior flexibility andexpression of the representation of the problem and heuristic-based searching capability for knowledge discovery. In the past,GAs were mostly employed to enhance the learning process ofsoft computing techniques such as neural nets or fuzzy expertsystems, but rather to discover models or patterns. That is, GAsact as a method for performing a guided search for good modelsin the solution space. While GAs are an interesting approachfor discovering hidden valuable knowledge, they have to handlecomputation efficiency with large volume data.

Generally, decision-tree induction methods are used toautomatically produce rule sets for predicting the expectedoutcomes as accurately as possible. However, the emphasison revealing novel or interesting knowledge has become arecent research issue in data mining. These attempts mayimpose additional trees discovery constraints, and therebyproduce additional computation overhead. For regular GAoperations, constraint validation is carried out after a candidatechromosome is produced. That is, several iterations may be

1089-7771/$25.00 © 2007 IEEE

CHIU et al.: MINING 3-D ANTHROPOMETRIC BODY SURFACE SCANNING DATA FOR HYPERTENSION DETECTION 265

required to determine a valid chromosome (i.e., patterns). Oneway to improve the computation load problem is to obtainassociated information such as attributes or their values beforeinitial chromosomes are generated; thereby accelerating theevolution efficiency and effectiveness. Potentially, this can bedone by applying association rule algorithms (ARAs) to findout the clues related to the classification results.

This paper proposes a rule-based classification approach thatintegrates an ARA with a GA to discover classification trees. AnARA, which is also known as market basket analysis, is used forattribute selection; therefore, those related input variables canbe determined before executing the GA’s evolution.

Apriori algorithm is a popular association rule technique fordiscovering the attributes relationship that is converted to formu-late the initial GA population. This proposed method attempts toenhance the GA’s searching performance by gaining more im-portant clues leading to the final patterns. A prototype systembased on association-based GA (AGA) approach is adopted topredict the hypertension disease. Similar application using theAGA approach and AGA’s detailed introduction can be foundin [14]. The application data that consists of 3-D anthropome-try data and the related profiles’ information were derived fromChang Gung Memorial Hospital, Taiwan.

In order to compare with the often used classification treetechniques, a simple GA (SGA), C5.0, and CART are adoptedusing the same data sets. SGA is a pure GA program withoutinformation derived from the Apriori algorithm.

The remainder of this paper is organized as follows. In Sec-tion II, previous research works and related techniques are re-viewed. The detailed AGA procedures are then introduced inSection III. Section IV presents the experiments and resultswith medical data sets followed by discussion and conclusions.

II. LITERATURE REVIEW

A. Instruments and Procedures

The Chang Gung whole body 3-D laser scanner scans a cylin-drical volume estimated by 1.9 m height and 1.0 m in diameter.These dimensions accommodate the vast majority of humansubjects. In order to collect the measuring volume of the wholebody at the same time, we used six laser scanners to surroundthe whole body at 360◦. Every two scanners were pushed by avertical-moving slipping rail, with three vertical-moving shaftsconnected to a stepping motor. Therefore, we could get datafrom six different angles by the six different scanners. Thesedata were combined for reconstructing the parameters into an-thropometric measures as specified in this study.

A platform structure supports the subject and provides analignment for the towers. The system was built to withstandshipping and repeated use without alignment or adjustment.The standard scanning apparel for both men and women in-cluded light gray cotton biker shorts and a gray sports bra forwomen. Latex caps were used to cover the hair on subjects’heads. Each subject was measured in three different scanningpostures. Automatic landmark recognition (ALR) technologywas used to automatically extract anatomical landmarks from

TABLE IVARIABLES USED IN HYPERTENSION DETECTION MODEL

the 3-D body-scan data. The landmarks were then placed on thesubject.

More than 30 measurements of the new anthropometric fac-tors that traditional measurements did not provide can be col-lected from this whole body 3-D laser scanner. The data used inthis study were reviewed by Chang Gung Medical Center’s med-ical professionals, whose research area is the biological meaningof human body. We abandoned those measurements which hadno bearing on either anatomical or physiological senses. On thecontrary, some measures which were not frequently used in thetraditional measurements were used in this analysis, includingcross-sectional area measurements, surface area measurements,and volume measurements.

In order to detect the hypertension disease, the physiciansthought that those data attributes excluding the 3-D anthropo-metric related information are useful to support the judgment.In addition, not all patients had complete set of non-3-D factorssuch as demographic data, biochemical tests, risk factors, andfamily history of disease. Therefore, we collected the subjects’data mainly on the basis of those patients with complete non-3-Ddata. Consequently, the seven 3-D anthropometric data attributeswere adopted for data mining because of their availability for allsubjects. Those 3-D factors included in this research are bodymass index (BMI), left arm volume (LAV), trunk surface area(TSA), weight (W ), waist circumference (WC), waist-hip ra-tio (WHR), and waist width (WW). Other non-3-D factors andtheir corresponding description are detailed in Table I. The bodyscanner scans the human body from head to feet using the laserin the cohorizontal plane around the whole body. The computerprocesses the 3-D data at the speed of 60 laser beams per second.For a body height of 180 cm for computation, when the Gemini10085 scanner is set up for a 4-mm vertical scanning resolu-tion, it takes 7.5 s to scan the whole body. If it is configured ata vertical scanning resolution of 2.5-mm, then it takes 12 s toscan the whole body. Subjects were also asked to provide demo-graphic data such as age, ethnic group, sex, area of residence,


education level, present occupation, and family income. Relatedhospital health records and clinical records were obtained foreach subject, if available.

The 3-D laser scanning system is based on optical triangu-lation of reflected photo profiles of an incident, cross-sectionalplane of laser light that travels around the segment. The profileis collected by a digitizer camera and is used to characterize the3-D spatial surface geometry. The markers developed for usewith the ALR algorithm are flat, adhesive backed disks with aconcentric circular, high contrast pattern, consisting of a 6-mmdiameter black center circle and a 12-mm diameter outer whiteannulus.

Anthropometric measurements were performed to determineBMI and WHR. Data were coded by the computer. The resultswere correlated with data on blood pressure, blood glucose,lipid, and uric acid levels. The health index (HI) was determinedusing following equation:

HI (body weight × 2 × waist profile area)

body height2 × breast profile area hip profile area.

B. Decision Tree and Rule Induction

Current rule induction systems typically fall into two cat-egories: “divide and conquer” [11] and “separate and con-quer” [15]. The former recursively partitions the instance spaceuntil regions of roughly uniform class membership are obtained.The latter induces a rule that explains a part of its training in-stances at a time, and recursively conquers the remaining exam-ples by learning more rules until no examples remain.

Rule induction methods may also be categorized into eithertree-based or nontree-based methods [16]. Some of the often-mentioned decision tree induction methods include C4.5 [11],CART [12] and GOTA [17] algorithms. Both decision treesand rules can be described as disjunctive normal form (DNF)models. Decision trees are generated from data in a top–down,general-to-specific direction [18]. Each path to a terminal nodeis represented as a rule consisting of a conjunction of testson the path’s internal nodes. These ordered rules are mutuallyexclusive [19]. Quinlan introduced techniques to transform aninduced-decision tree into a set of production rules [11].

Among the data mining techniques, a decision tree is oneof the most commonly used methods for knowledge discovery.A decision tree features its easy understanding and a simpletop–down tree structure where decisions are made at each node.The nodes at the bottom of the resulting tree provide the finaloutcome, either of a discrete or continuous value. When theoutcome is of a discrete value, a classification tree is developed[20], while a regression tree is developed when the outcome isnumerical and continuous [21].

Classification is a critical type of prediction problems. Clas-sification aims to examine the features of a newly presentedobject and assign it to one of a predefined set of classes [22].Classification trees are used to predict membership of cases orobjects in the classes of a categorical dependent variable fromtheir measurements on one or more predictor variables. Theclassification trees induction process often selects the attributesby the algorithm at a given node according to such as Quin-

lan’s information gain (ID3), gain ratio (C4.5) criterion, Quest’sstatistics-based approach to determine proper attributes and splitpoints for a tree [11], [13], [23].

C. GA for Rule Induction

GAs have been successfully applied to data mining for rulediscovery in literature. There are some techniques using Michi-gan approach (one-rule-per-individual encoding) proposed byGreene and Smith [24], and Noda et al. [25]. For the Michi-gan approach, a chromosome usually can be identical to a lin-ear string of rule conditions, where each condition is often anattribute–value pair, to represent a rule. Although the individualencoding is simpler and syntactically shorter, the problem is thatthe fitness of a single rule is not necessarily the best indicatorof the quality of the discovered rule set. Then, the Pittsburgapproach (several-rules-per-individual encoding) [26], [27] hasthe advantage by considering its rule set as a whole, by takinginto account rule interactions. However, this approach makesthe chromosome encoding more complicated and syntacticallylonger, which usually requires more complex genetic operators.More recently, Carvalho and Freitas [28] applied GA for dis-covering small disjunct rules. Further, a hybrid decision tree/GAmethod was developed to discover classification rules [29].

In order to discover the high-level prediction rules, Fre-itas [30] applied a first-order relationship such as “Salary >Mortgage” by checking an attribute compatibility table (ACT)during the discovery process with GA-Nuggests. ACT wasclaimed particularly effective for its knowledge representationcapability. Further extension that allows linear or nonlinear mul-tivariates relationship in the classification tree splits can be vi-able. In this paper, the proposed approach only adopts the linearmultivariates relationship to aid reducing the search spaces dur-ing the GA’s evolution process.

D. ARAs for Attributes Selection

Many algorithms can be used to discover association rulesfrom data in order to identify patterns of behavior. One of themost used and famous one is the Apriori algorithm that is dis-cussed in [31] and [32]. For instance, an ARA is able to producea rule as follows:

When gender is female and age greater than 55 then havinghypertension 20% of the time.

An ARA, given the minimum support and confidence levels,is able to quickly produce rules from a set of data through thediscovery of the so-called frequent itemset. A rule has two mea-sures, called confidence and support. Support (or prevalence)measures, how often all the items in the rule occur together ina record (transaction), as a percentage of the total number ofrecords. Confidence measures how often the items in the ruleconsequent occur together in a record, as a percentage of thenumber of records containing all items in the rule antecedent.

An association rule adapted from Agrawal et al. [31] is animplication of the form X → Y , where X and Y are set of itemscalled itemsets. The support(s) for an association rule X → Yis the percentage of transactions in the database that contain


Fig. 1. Conceptual framework of AGA.

X ∪ Y . The confidence(s) for an association rule X → Y is thepercentage of transactions in the database containing X alsothat contain Y .

Due to the advantage of the ARAs in deriving associationamong data items efficiently, recent data mining research in clas-sification trees construction has attempted to adopt this mecha-nism for knowledge preprocessing. For example, Apriori algo-rithm was applied to produce those association rules that canbe converted into initial population of GP [33], [34]. Improvedlearning efficiency for the proposed method was demonstratedwhen compared with a GP without using ARAs. However, thehandling of multivariate classification problems and better learn-ing accuracy were not specified in these researches. Further,evolutionary algorithm (EA) can be employed to confine thesearch space for rules induction that is subsequently exploredby GP [35].

III. HYBRID OF ARAS AND AGAS

The proposed AGA approach consists of three modules. Ac-cording to Fig. 1, these modules are:

1) Association Rule Mining;2) Tree Initialization;3) Classification Tree Mining.The association rule mining module generates association

rules by Apriori algorithm. In this research, the items derivedfrom category attributes can be used to construct the associ-ation rules. The association rule here is an implication of theform X → Y , where X is the conjunction of conditions, andY is the type of classification. The rule X → Y has to sat-isfy user-specified minimum support and minimum confidencelevels.

The tree initialization module constructs the potential candi-dates of classification trees for better predictive accuracy. Fig. 2illustrates the classification tree with n rules. Each rule consistsof two types of attributes: categorical and continuous. Categor-ical attributes provide the partial combination of conjunctionof conditions and the others are contributed by continuous at-tributes in the form of inequality. The default class in this re-search is specifically defined as an unclassified class althoughit is generally defined as the majority class. From a more con-servative point of view in the disease classification problemsdomain, the default class in this research is treated as a classi-

Fig. 2. Classification tree illustration.

fication error. The formal form of a rule is presented by A andB → C, where A denotes the antecedent part of the associationrule; B is the conjunction of inequality functions in which thecontinuous attributes, relational operators, and splitting valuesare determined by GA; and C is the classification result directlyobtained from the association rule. For example, the x1, x2, x3

are categorical attributes and x4, x5 are continuous attributes.Assume that an association rule “ x1 = 5 and x3 = 3 → Ck” isselected as a rule specified as follows.

Rule i : If x1 = 5 and x3 = 3 and x4 ≤ 10 and x5 > 1 ThenClass C k.

The execution steps of the classification tree initializationmodule are stated as follows.

Step a: For each rule, the antecedent condition is obtainedfrom an association rule that is randomly selected out of theentire association rules set generated in advance. That is, theformal form for “A” part of a rule is determined.

Step b: By applying GA, the selected relational operator(≤ or >) and splitting point are determined for each contin-uous attribute. Therefore the formal form for “B” part of a ruleis determined.


Fig. 3. Conceptual diagram of the proposed classification tree mininig.

Step c: The classification result on a rule part comes directlyfrom the consequence of the derived association rule. Subse-quently, each classification rule is generated by repeatedly ap-plying Step a and b for n times, where n is a value automaticallydetermined by GA.

The classification tree search module applies GA to search fora superior classification tree based on the potential candidatesgenerated in the tree initialization module. Fig. 3 shows theproposed AGA approach for classification trees mining. Thedetails of each step are illustrated as follows.

Step a: Chromosome encoding—The classification tree canbe easily presented in the form of “Rule 1, Rule 2, . . ., Rulen”where n is the total number of the classification rules, and isautomatically determined by GA.

Step b: GA initialization—To generate a potential chromo-some in the beginning. The initial population is obtained fromthe tree initialization module.

Step c: Fitness evaluation—Calculate the fitness value foreach chromosome in the current population. The fitness functionis defined as the total number of misclassification.

Step d: Stop condition met—If the specified stopping con-dition is satisfied, then the entire process is terminated, andthe optimal classification tree is confirmed; otherwise, the GAoperations are continued.

Step e: GA operators—Each GA operation contains chromo-somes selection, crossover, and mutation in order to produceoffspring generation based on different GA parameter settings.

Step f: Chromosome decoding—The best chromosome is thustransformed to the optimal classification tree.

A. GA Chromosome Encoding

In order to further illustrate the chromosome encodingscheme, a drug recommendation example is used to explainthe details. In this paper, a set of rules are encoded as a chro-mosome based on several-rules-per-individual approach. Thisapproach allows a set of rules describing all possible conditions

TABLE IIRULE REPRESENTATION FOR DRUG A

associated with a single drug type to consist of at least one rulewithin a set of rules. Each drug type should own its set of rulesthat is typical represented as the following format:

(“IF cond1,1 AND . . . AND cond1,n THEN drug = A”or “IF cond2,1 AND . . . AND cond2,n THEN drug =A”or “IF cond3,1 AND . . . AND cond3,n THEN drug =A”)and(“IF cond4,1 AND . . . AND cond4,n THEN drug =B”). . .and(“IF condm−1,1 AND. . . AND condm−1,n THEN drug=E”or “IF condm,1 AND . . . AND condm,n THEN drug = E”)

For example, the rule set for the Drug A can be encoded inthe following rules shown in Table II:

IF(Age >= 25)AND(Sex = “M”)AND(K >= 0.03)

THEN Drug AorIF (BP =High)AND (Freq>=3)AND(Na <= 1.2 ∗ K)THEN Drug A


TABLE IIIGA PARAMETER SETTINGS

IV. EXPERIMENTS AND RESULTS

Before mining the hypertension data set, the heart diseasedata set from in UCI repository [36] was used to validate ourproposed approach. For the comparing purpose, SGA, C5.0,CART, and logistic regression techniques were applied to allthe data sets. Generally association rules extracted by Apriorialgorithm would be varied depending on the defined support andconfidence values. Different association rules extracted may re-sult in different impacts on AGA learning performances. There-fore, this research experimented with different sets of minimumsupport and confidence values to both the heart disease and hy-pertension problems. The evaluation of those classification treesgenerated by SGA, AGA, C5.0, CART, and logistic regressionapproaches were based on a ten-fold cross validation. That is,each training stage used 9/10 of the entire data records; with therest 1/10 data records used for testing stage. The GA parametersettings for both the applications are summarized in Table III.The GA executed up to 1000 generations for each data set. Thelongest execution time was less than 5 min. The AGA’s totalexecution time is the sum of the time consumed by Apriorialgorithm and GA. The execution time Apriori algorithm con-sumes is at most about 0.001 min using a Pentium 4 2.8-GHzpersonal computer with 512 MB RAM.

A. Heart Disease Data Set 1

The collected 303 heart disease data records consisted ofseven categorical and six continuous attributes for the inputpart. The output part is a categorical attribute to determine thehealth or sickness with heart disease. Data of the seven cate-gorical attributes are fed into the Apriori algorithm to producethe association rules. Among the entire data records, 160 arehealthy, 137 are sick, and six are missing data.

After several trials with different sets of minimum supportvalues and confidence values, the best AGA learning perfor-mance is obtained. The derived association rule sets consists of86 rules for “healthy” output category and 101 rules for “sick”output category.

The testing results of the SGA, AGA, C5.0, CART, and lo-gistic regression using these three data sets are summarized inTable IV, along with their corresponding minimum support val-ues and confidence values. As shown in this table, the AGA’saccuracy rates for heart disease data in the testing stage are ableto reach 82.17%. The C5.0 accuracy rate of ten-fold testing forheart disease is 79.1%. The CART accuracy rates of ten-foldtesting for heart disease data is 76.8%. The logistic regressionaccuracy of ten-fold testing for heart disease data is 72.1%.

TABLE IVTEN-FOLD TESTING ACCURACY RATES AND STANDARD DEVIATION

Fig. 4. Learning progress over generations (based on ten-fold average).

In order to obtain more details about the learning progressfor the SGA and AGA approaches, learning tract behavior wererecorded in sessions. Fig. 4 depicts the entire learning progressesmonitored over generations (up to 1000 generations) and thecorresponding time required for the three data sets. Table Vpresents AGA’s detailed classification tree notation for one ofthe relatively better classification trees derived with the heartdisease data set.

B. Heart Disease Data Set 2

This study has collected 1152 subject’s data from the Depart-ment of Health Examination, Chang Gung Memorial Hospital,from July 2000 to July 2001. 489 subjects have hypertension (ab-normal) and 663 are without hypertension (normal). All subjectswere Taiwan local residents and without history of systemic dis-ease. Standardized 3-D anthropometry scanning protocols wereperformed, and the data were collected by trained staff.

Subjects were instructed to fast for 12 to 14 h prior to testing,and compliance regarding fasting was determined by interviewon the morning of the examination. Measurements of height(to 0.1 cm) and weight (to 0.1 kg) were performed accordingto specified protocols in [37]. BMI [weight (kg)/height (m)2]was used as an indicator of obesity. We used body weight andbody height measured by conventional methods. Blood pres-sure measurements were made on the right arm in seated,


TABLE VNOTATION OF CLASSIFICATION TREE (HEART DISEASE DATA)

relaxed subjects. For both systolic and diastolic blood pres-sure, the average of six replicate mercury readings taken by tworandomly assigned and trained nurses were used in the anal-yses. Blood pressure levels were classified according to 1999World Health Organization-International Society of Hyperten-sion guidelines [38].

Hypertension was defined as a systolic blood pressure (SBP)of 140 mmHg or greater and/or a diastolic blood pressure (DBP)of 90 mmHg or greater. High-normal blood pressure was definedas an SBP of 130–139 mmHg or a DBP of 85–89 mmHg. Normalblood pressure was defined as an SBP of 120–129 mmHg and aDBP of 80–84 mmHg; optimal blood pressure, as an SBP lessthan 120 mmHg and a DBP of less than 80 mmHg. When SBPand DBP fell into different categories, the higher category wasapplied.

Each data record contains five categorical attributes and 11continuous attributes for the input part. The output part is acategorical attribute whose value is either normal or abnormal.The attributes used in this data set are specified in Table I.

After several trials with different sets of minimum supportvalues and confidence values, the best AGA learning perfor-mance was obtained. These results were based on the minimumsupport value (=8) and confidence value (=100). The derivedassociation rule sets consisted of 75 rules for “normal” out-put category and 51 rules for “abnormal” output category. Thetesting results of the SGA, AGA, C5.0, CART, and logistic re-gression are summarized in Table VI. As shown in Table VI,the AGA’s accuracy rate in the testing stage is able to reach92.48%; while with 88.3% of the testing result with C5.0, 85%with CART, and 75.5% with logisitic regression.

The learning progress monitored up to 1000 generations forthe SGA and AGA approaches were recorded in sessions asdepicted in Fig. 5. Table VII shows the detailed classification

TABLE VITEN-FOLD TESTING ACCURACY RATES AND STANDARD DEVIATION

Fig. 5. Learning progress over generations (based on ten-fold average).

TABLE VIINOTATION FOR EACH CLASSIFICATION TREE FOR A GIVEN DATA FOLD

(HYPERTENSION DATA)

tree notation for one of the relatively better classification treesderived.

In order to find out the prediction accuracy of the classifi-cation trees that are based on the entire hypertension data setand same data set excluding 3-D information, we have done an-other experiment to construct the classification trees using datawithout 3-D information. The results are indicated in Table VIIIindicating that the model based on data without 3-D informationexhibits inferior prediction accuracy. The relatively better testresults are shown in Table VIII as 80.21%.

Wilcoxon signed rank test is also adopted for examiningthe classification results in each test fold across five different


TABLE VIIITEN-FOLD TRAINING/TESTING RESULTS (HYPERTENSION DATA

WITHOUT 3-D INFORMATION)

TABLE IXSUMMARIZED WIN-LOSS TABLES (HEART DISEASE DATA)

TABLE XSUMMARIZED WIN–LOSS TABLES (3-D HYPERTENSION DATA)

generations of each data test. The Wilcoxon Signed Rank test,also known as the Wilcoxon matched pairs test, is a nonpara-metric test used to test the median difference in paired data. Thetest is based on the magnitude of the difference between thepairs of observations. Further detail about the calculation andinterpretation of the Wilcoxon signed rank test can be foundin [39] and [40]. We define a win/loss/tie as the accuracy rate ofa given test fold of AGA is greater/less/equal than that of SGA.For example, (100) denotes AGA is greater than GA; (010) de-notes AGA ties with SGA; (001) denotes AGA is less than SGA.In the tests with the win/loss outcome (as shown in Tables IXand X), the p-value derived is 0.000 for Heart Disease data, and0.021 for 3-D data. All these p-values of < 0.05 were consideredsignificant and indicating that there are statistically significantdifferences within groups for the classification results betweenAGAs and SGAs. These support the hypothesis that AGA hasbetter classification performance than SGA.

V. DISCUSSION

According to the results indicated above, AGA achieves su-perior learning performance than SGA in terms of computation

efficiency and accuracy. Also AGA has demonstrated betterlearning performance than C5.0, CART, and logistic regressionwith the data sets used in this research. By applying associationrule process, the partial knowledge is extracted and transformedas seeding chromosomes. As seen in Fig. 4, the initial averageaccuracy rates for both SGA and AGA are 35% and 80% respec-tively (for heart disease); and 23% and 73% respectively (for 3-Dhypertension) in Fig. 5. The improvement of the enhanced initiallearning performance is due to the derived GAs seeding chro-mosomes resulting in more effective search for better solutions.

From these values, it can be noted that SGA takes moregenerations to reach the similar performance that takes AGAwith fewer generations to reach. This can be attributed to theadoption of Apriori algorithm by which the GA search space issubstantially reduced. Also, it can be seen that AGA consistentlyoutperforms SGA over generations. Further, for the computa-tion time, AGA usually takes less time to reach the learningperformance that takes SGA more to do so.

As compared with C5.0, CART, and logistic regression inten-fold cross-validation testing results for hypertension (with3-D) data, SGA has the performances ranging from 82.63%to 83.5%; AGA has the performances ranging from 84.71% to92.48% among five different generation types.

AGA also outperforms C5.0, CART, and logistic regressionwhose ten-fold cross-validation testing accuracy are 88.3%,85%, and 75.5%, respectively. In the experiments, the AGAsstandard deviations for the testing results are generally largerthan those of SGAs. This can be mainly due to that AGA pro-gresses in a more significant way than SGA does. That is, SGAis maintaining relatively stable performance over generations.In the initialization stage, AGA is fed with specific clues to in-fluence the evolution. This process can be unnatural interventionto disturb regular GA evolution. Unlike regular GA crossoverand mutation operators, the enforced given initial gene valuesderived from association rules may bring in higher chaos ordisorder of the regular GA process. Thus, the higher standarddeviation is occurs and is seen across the entire data sets.

When looking at the time required in obtaining acceptablelearning performance, it can be seen that AGA requires lesstime and number of generations than SGA in achieving similartraining accuracy rates.

In convention, the default class of a classification tree is de-noted as the majority class in the data set. As contrast to ourcurrent default class definition as an unclassified error, the def-inition of the majority class for the default class will certainlyresult in higher classification accuracy rates. However, we holdthe thinking that this type of medical disease classification prob-lems should more emphasize the findings of the classificationrules contents that help better explain the causes relating to thedisease rather than achieving better prediction accuracy rates.

As shown in Table VII, the results reveal clues for preven-tive medicine that shows distinction and coherence with currentknowledge in hypertension. Body shape played a major role inpredicting hypertension on those subjects without family dis-ease history but apparently with very high BMI (>28.9) andrelatively larger trunk surface area (>4601.9 cm2). The data hasalso demonstrated that a hypertension will be found in subjects


with a larger waist width (>20.5 cm) and small body massindex relative to their age. The mechanism behind the relation-ship of 3-D body measures and hypertension is still imperfectlyunderstood. Albeit, the findings forecast that a potential set ofindicators to break through our current knowledge boundary ondiagnostic decision support system in medicine is close at hand.

VI. CONCLUSION

We have introduced the AGA approach that hybridizes Apri-ori algorithm and the GA for classification tree induction.Results from predicting hypertension disease by anthropomet-rical and 3-D measurements are promising and innovative infield of biomedical sciences. Specifically, significant predictorsfor hypertension are AGE, FHHT, WHR, WINE, WW, W, BMI,TSA, and TC, respectively.

Incorporating the associated knowledge related to the classi-fication results is crucial for improving evolutionary-based min-ing tasks. By employing the ARA to acquire partial knowledgefrom data, our proposed approach is able to more effectivelyinduce a classification tree due to Apriori algorithm. In terms ofcomputation efficiency during the training process, AGA usu-ally reaches higher accuracy rate than SGA when taking similaramount of execution time.

Predicting a subject’s hypertension disease from the 3-D an-thropometry data and other related profile information is a novelway for medical decision support. Without the 3-D information,the rest of the hypertension data would not be able to result inbetter prediction models. According to the experiment results,AGA has been proved to be a feasible way to provide a decentlyacceptable solution for predicting a subject’s hypertension dis-ease with about 90% classification accuracy.

REFERENCES

[1] S. R. Srinivasan, W. Bao, and G. S. Berenson, “Coexistence of increasedlevels of adiposity, insulin, and blood pressure in a young adult cohort withelevated very-low-density lipoprotein cholesterol: The Bogalusa heartstudy,” Metabolism, vol. 42, pp. 170–176, 1993.

[2] L. Mykkanen, S. M. Haffner, T. Ronnemaa, T. Bennemaa, R. N. Bergman,and M. Laakso, “Low insulin sensitivity is associated with clusteringof cardiovascular disease risk factors,” Amer. J. Epidemiol., vol. 146,pp. 315–321, 1997.

[3] W. Chen, W. Bao, S. Begum, A. Elkasabany, S. R. Srinivasan, andG. S. Berenson, “Age-related patterns of the clustering of cardiovascu-lar risk variables of syndrome X from childhood to young adulthood inpopulation made up of black and white subjects: The Bogalusa heartstudy,” Diabetes, vol. 49, pp. 1042–1048, 2000.

[4] J. Jeppesen, H. O. Hein, P. Suadicani, and F. Gyntelberg, “High triglyc-erides and low HDL cholesterol and blood pressure and risk of ischemicheart disease,” Hypertension, vol. 36, pp. 226–232, 2000.

[5] A. M. Coombes, J. P. Moss, A. D. Linney, R. Richards, and D. R. James,“A mathematical method for the comparison of 3D changes in the facialsurface,” Eur. J. Orthod., vol. 13, pp. 95–110, 1991.

[6] P. R. M. Jones, A. J. Baker, C. J. Hardy, and A. P. Mowat, “The measure-ment of body surface area in children with liver disease by a novel 3Dbody scanning device,” Eur. J. Appl. Physiol., vol. 68, pp. 514–518, 1984.

[7] K. Kroemer, “Engineering anthropometry,” Ergonomics, vol. 32, pp. 767–784, 1989.

[8] J. d. Lin, W. K. Chiou, H. F. Weng, Y. H. Tsai, and T. H. Liu, “Comparisonof three-dimensional anthropometric body surface scanning to waist–hipratio and body mass index in correlation with metabolic risk factors,” J.Clin. Epidemiol., vol. 55, pp. 757–766, 2002.

[9] P. R. M. Jones and M. Rioux, “Three-dimensional surface anthropometry:applications to the human body,” Opt. Laser Eng., vol. 28, pp. 89–117,1997.

[10] F. J. Meaney and L. A. Farrer, “Clinical anthropometry and medical ge-netics: A compilation of body measurement in genetic and congenitaldisorders,” Amer. J. Med. Genet., vol. 25, pp. 343–59, 1986.

[11] J. R. Quinlan, C4.5: Programs for Machine Learning. San Mateo, CA:Morgan Kaufmann, 1993.

[12] L. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone, Classificationand Regression Trees. Los Angeles, CA: Wadsworth, 1984.

[13] W. Y. Loh and Y. S. Shih, “Split selection methods for classification trees,”Statistica Sinica, vol. 7, pp. 815–840, 1997.

[14] P. L. Hsu, R. Lai, C. Chiu, and C.-I. Hsu, “The hybrid of associationrule algorithms and genetic algorithms for tree induction: An exampleof predicting student learning performance,” Expert Sys. Appl., vol. 25,no. 1, pp. 51–62, 2003.

[15] P. Clark and T. Niblett, “The CN2 induction algorithm,” Mach. Learn. J.,vol. 3, no. 4, pp. 261–283, 1989.

[16] M. K. Abdullah, “CAN: Chain of nodes approach to direct rule induction,”IEEE Trans. Sys., Man, Cybern. B, Cybern., vol. 29, no. 6, pp. 758–770,Dec. 1999.

[17] C. R. P. Hartmann, P. K. Varshney, K. G. Mehrotra, and C. L. Gerberich,“Application of information theory to the construction of efficient decisiontrees,” IEEE Trans. Inf. Theory, vol. IT-28, no. 2, pp. 565–577, Feb.1982.

[18] C. Apte and S. Weiss, “Data mining with decision trees and decisionrules,” Future Gener. Comput. Sys., vol. 13, no. 2–3, pp. 197–210, Nov.1997.

[19] P. Clark and R. Boswell, “Rule induction with CN2: Some recent improve-ments,” in Proc. 5th Eur. Conf.: Mach. Learn., Berlin, Germany, 1991,pp. 51—163.

[20] K. J. Hunt, “Classification by induction: application to modeling and con-trol of non-linear dynamical systems,” Intell. Sys. Eng., vol. 24, pp. 231–245, 1993.

[21] J. Bala and K. De Jong, “Using learning to facilitate the evolution offeatures for recognizing visual concepts,” Evol. Comput., vol. 4, no. 3,pp. 297–312, 1995.

[22] J. A. Michael and L. Gordon, Data Mining Techniques: For Marketing,Sales, and Customer Support.. New York: Wiley, 1997.

[23] J. R. Quinlan, “Induction of decision trees,” Mach. Learn., vol. 1, pp. 81–106, 1986.

[24] P. D. Greene and S. F. Smith, “Competition-based induction of de-cision models from examples,” Mach. Learn., vol. 13, pp. 229–257,1993.

[25] E. Noda, A. A. Freitas, and H. S. Lopes, “Discovering interesting predic-tion rules with a genetic algorithm,” in Proc. Congr. Evolution. Comput.,Washington, DC, Jul.1999, pp. 1322–1329.

[26] K. A. De Jong, W. M. Spears, and D. F. Gordon, “Using genetic algorithmsfor concept learning,” Mach. Learn., vol. 13, pp. 61–188, 1993.

[27] C. Z. Janikow, “A knowledge-intensive genetic algorithm for supervisedlearning,” Mach. Learn., vol. 13, pp. 189–228, 1993.

[28] D. R. Carvalho and A. A. Freitas, “A genetic algorithm for discoveringsmall disjunct rules in data mining,” Appl. Soft Comput., vol. 2, no. 2,pp. 75–88, 2002.

[29] , “A hybrid decision tree/genetic algorithm method for data mining,”Info. Sci., vol. 163, no. 1–3, pp. 13–35, 2004.

[30] A. A. Freitas, “A genetic algorithm for generalized rule induction,”in Advanced in Soft Computing–Engineering Design and Manufactur-ing. New York: Springer-Verlag, 1999, pp. 340–353.

[31] R. Agrawal, T. Imielinski, and A. Swami, “Mining association betweensets of items in large databases,” in Proc. ACM SIGMOD Int. Conf. Manag.Data, 1993, pp. 207–216.

[32] R. Agrawal and R. Srikant, “Fast algorithms for mining association rules,”in Proc. 20th Int. Conf. Very Large Databases, Santiago, Chile, 1994,pp. 487–499.

[33] A. Niimi and E. Tazaki, “Genetic programming with association rulealgorithm for decision tree construction,” in Proc. 4th Int. Conf. Knowl.-Based Intell. Eng. Sys. Allied Technol., vol. 2, Brighton, U.K., 2000,pp. 746–749.

[34] , “Combined method of genetic programming and association rulealgorithm,” Appl. Artif. Intell., vol. 15, pp. 825–842, 2001.

[35] K. C. Tan, Q. Yu, C. M. Heng, and T. H. Lee, “Evolutionary computing forknowledge discovery in medical diagnosis,” Artif. Intell. Med., vol. 27,no. 2, pp. 129–154, 2003.

[36] C. L. Blake and C. J. Merz, Department of Information and Com-puter Science, University of California, Irvine (2006, Jun. 9). UCIRepository of Machine Learning Databases, [Online]. Available:http://www.ics.uci.edu/ ∼ mlearn/MLRepository.html


[37] G. S. Berenson, A. W. Voors, L. S. Webber, L. S. Dalferes, Jr., andD. W. Harsha, “Racial differences of parameters associated with bloodpressure levels in children: the bogalusa heart study,” Metabolism, vol. 28,pp. 1218–1228, 2007.

[38] “World health organization–international society of hypertension guide-lines for the management of hypertension,” J. Hypertens., vol. 17, pp. 151–183, 1999.

[39] J. M. Bland, An Introduction to Medical Statistics, 2nd ed. Oxford, U.K.:Oxford Univ. Press, 1995, pp. 212–215.

[40] W. J. Conover, Practical Nonparametric Statistics, 2nd ed. New York:Wiley, 1980, pp. 278–292.

Chaochang Chiu received the B.S. degree fromNational Tsing Hua Univesity, Hsinchiu, Taiwan,R.O.C., in 1984, the M.S. degree from Arizona StateUniversity, Tempe, in 1988, and the Ph.D. degree ininformation systems from the University of Mary-land, Baltimore County, in 1993.

Currently, he is a Professor of information man-agement and an Associate Director in the R&D Di-vision, Yuan Ze University, Chungli, Taiwan, R.O.C.His current research interests include forecasting andresources allocation optimization under uncertainty,

including genetic algorithm, constraint-based reasoning, case-based reasoning,Bayesian network, and the hybrid of the above techniques, in the domains of air-line, agriculture, energy, medicine, and marketing campaign. He has publishedseveral papers in journals of repute.

Kuang-Hung Hsu received the Bachelor’s degreefrom the College of Medicine, National Taiwan Uni-versity, Taipei, Taiwan, R.O.C., in 1984, the Master’sdegree, with concentration on epidemiology, fromNational Taiwan University in 1988, and the Ph.D.degree, with a major in toxicology and epidemiologyand a minor in statistics, from the University of Cal-ifornia at Los Angeles in 1995.

He is currently an Associate Professor of HealthCare Management and Chair of the Department ofBusiness Administration, Chang Gung University,

Tao-Yuan, Taiwan, R.O.C. He is the Founder and presently the Director ofthe Center for Biotechnology Management at Chang Gung University. He haspublished more than 50 papers in refereed journals, including numerous top-tierjournals in the fields of medicine, epidemiology, and management. His researchinterests include epidemiology, biostatistics, bioinformatics, and biotech indus-trial management. He is the corresponding author of this paper.

Pei-Lun Hsu received the B.S. degree in computerand information science from SooChow Univesity,Taipei, Taiwan, R.O.C., in 1987 and the M.S. de-gree in computer science from Tamkang University,Taipei, in 1991.

He is currently a Lecturer in electronic engineer-ing at Ching Yun University, Chungli, Taiwan, R.O.C.His current research includes hybrid algorithms ofevolutionary computation and constraint-based opti-mization. He has published several papers in interna-tional journals of repute.

Chi-I Hsu received the B.A. degree in informationsystems from National Cheng Chi University, Taipei,Taiwan, R. O. C., in 1991, the M.S. degree in softwaresystems engineering from George Mason University,Fairfax, VA, in 1993, and the Ph.D. degree in in-formation systems from National Central University,Chungli, Taiwan, in 1999.

Currently, she is an Associate Professor of infor-mation management at Kai Nan University, Taoyuan,Taiwan, R.O.C. She has published several papers ininternational journals of repute. Her current research

interests include intelligent technique applications, electronic commerce, andinformation systems outsourcing.

Po-Chi Lee received Bachelor’s degrees in com-puter science and information management fromProvidence University, Taiwan, R.O.C., in 1994 andin1997, respectively, and the M.S. degree from theDepartment of Information Management, Yuan ZeUniversity, Chungli, Taiwan, R.O.C., in 2003.

Since 1997, he has been a Technology Engi-neer with the Information Center, Computer Scienceand Information Engineering, Chang Gung Univer-sity, Taoyuan, Taiwan, R.O.C. His research inter-ests include data mining, algorithms, and medical

information.

Wen-Ko Chiou received the M.B.A. and Ph.D. de-grees in industrial management from National Tai-wan University of Science and Technology, Taipei,Taiwan, R.O.C., in 1988 and 1994, respectively.

He is a Professor and Chairman of the Depart-ment of Industrial Design, Chang Gung University,Taoyuan, Taiwan, R.O.C.

Thu-Hua Liu received the M.S. degree in mechani-cal engineering from Stevens Institute of Technology,Hoboken, NJ, in 1983, and the Ph.D. degree in indus-trial engineering from the University of Iowa, IowaCity, in 1992.

He is a Professor and President of Mingchi Uni-versity of Technology, Taipei, Taiwan, R.O.C. He isalso affiliated with Chang Gung Memorial Hospital,Taipei, Taiwan, R.O.C.

Yi-Chou Chuang received the Master of HealthAdministration degree from the Department ofHealth Services Management, China Medical Col-lege, Taichung, Taiwan, R.O.C., in 1994.

Since 1988, he has served as the Director of the Ad-ministration Center, Chang Gung Memorial Hospital,Taipei, Taiwan, R.O.C. He was Dean of the Collegeof Management, Chang Gung University, Taoyuan,Taiwan, R.O.C., in 1996, and the Director, Board ofDirectors, Commission on Accreditation of Hospi-tals and Enhancement of Health Care Quality, Taipei,

Taiwan, R.O.C., in 1999. His research interests include health care management,strategy management, and decision sciences in health systems. He initiated thethree-dimensional scanning project.

Chorng-Jer Hwang received the Bachelor’s degreefrom the College of Medicine, National Taiwan Uni-versity, Taipei, Taiwan, R.O.C., in 1989, and the Mas-ter’s degree in health service from the University ofCalifornia at Los Angeles in 1993.

He is currently a Lecturer of Health Care Man-agement, Chang Gung University, Taoyuan, Taiwan,R.O.C. He also serves as Senior Administrator ofChang Gung Memorial Hospital, Taipei, Taiwan,R.O.C. His research interests include health caremanagement, epidemiology, and decision sciences in

health systems.

mining three-dimensional anthropometric body surface scanning data for hypertension detection

Documents