integration of gis and data mining technology to enhance

10
Integration of GIS and Data Mining Technology to Enhance the Pavement Management Decision Making Guoqing Zhou 1 ; Linbing Wang 2 ; Dong Wang 3 ; and Scott Reichle 4 Abstract: This paper presents a research effort undertaken to explore the applicability of data mining and knowledge discovery DMKD in combination with Geographic Information System GIS technology to pavement management to better decide maintenance strategies, set rehabilitation priorities, and make investment decisions. The main objective of the research is to utilize data mining techniques to find pertinent information hidden within the pavement database. Mining algorithm C5.0, including decision trees and association rules, has been used in this analysis. The selected rules have been used to predict the maintenance and rehabilitation strategy of road segments. A pavement database covering four counties within the state of North Carolina, which was provided by North Carolina DOT NCDOT, has been used to test this method. A comparison was conducted in this paper for the decisions related to a rehabilitation strategy proposed by the NCDOT to the proposed methodology presented in this paper. From the experimental results, it was found that the rehabilitation strategy derived by this paper is different from that proposed by the NCDOT. After combining with the AIRA Data Mining method, seven final rules are defined. Using these final rules, the maps of several pavement rehabilitation strategies are created. When their numbers and locations are compared with ones made by engineers at the Institute for Transportation Research and Education ITRE at North Carolina State University, it has been found that error for the number and the location are various for the different rehabilitation strategies. With the pilot experiment in the project, it can be concluded: 1 use of the DMKD method for the decision of road maintenance and rehabilitation can greatly increase the speed of decision making, thus largely saving time and money, and shortening the project period; 2 the DMKD technology can make consistent decisions about road maintenance and rehabilitation if the road conditions are similar, i.e., interference from human factors is less significant; 3 integration of the DMKD and GIS technologies provides a pavement management system with the capabilities to graphically display treatment decisions against distresses; and 4 the decisions related to pavement rehabilitation made by the DMKD technology is not completely consistent with that made by ITRE, thereby, the postprocessing for verification and refinement is necessary. DOI: 10.1061/ASCETE.1943-5436.0000092 CE Database subject headings: Pavement management; Data collection; Geographic information systems; Decision making. Author keywords: Pavement management; Data mining; GIS; Pavement distress. Introduction A pavement management system PMS is a valuable tool and one of the critical elements of the highway transportation infra- structure Tsai et al. 2004; Kulkarni and Miller 2003. The earliest PMS concept can be traced back to the 1960s. With rapid increase of advanced information technology, many investigators have successfully integrated the Geographic Information System GIS into PMS for storing, retrieving, analyzing, and reporting infor- mation needed to support pavement-related decision making. Such an integration system is thus called G-PMS Lee et al. 1996. The main characteristic of a GIS system is that it links data/information to its geographical location e.g., latitude/ longitude or state plane coordinates instead of the milepost or reference-point system traditionally used in transportation. More- over, the GIS can describe and analyze the topological relation- ship of the real world using the topological data structure and model Goulias 2002; Lee et al. 1996. GIS technology is also capable of rapidly retrieving data from a database and can auto- matically generate customized maps to meet specific needs such as identifying maintenance locations. Therefore, a G-PMS can be enhanced with features and functionality by using a GIS to per- form pavement management operations, create maps of pavement condition, provide cost analysis for the recommended mainte- nance strategies, and long-term pavement budget programming. With the increasing amount of pavement data collected, up- dated and exchanged due to deteriorating road conditions, in- creasing traffic loading, and shrinking funds, many knowledge- based expert systems KBESs have been developed to solve various transportation problems e.g., Abkowitz et al. 1990; Nas- sar 2007; Spring and Hummer 1995. A comprehensive survey of 1 Dept. of Civil Engineering and Technology, Old Dominion Univ., Kaufman Hall, Rm. 214, Norfolk, VA 23529; and, Virginia Tech Trans- portation Institute VTTI, Dept. of Civil and Environmental Engineering, Virginia Polytechnic Institute and State Univ., Blacksburg, VA 24061 corresponding author. E-mail: [email protected] 2 Virginia Tech Transportation Institute VTTI, Dept. of Civil and Environmental Engineering, Virginia Polytechnic Institute and State Univ., Blacksburg, VA 24061. E-mail: [email protected] 3 Virginia Tech Transportation Institute VTTI, Dept. of Civil and Environmental Engineering, Virginia Polytechnic Institute and State Univ., Blacksburg, VA 24061. E-mail: [email protected] 4 Dept. of Civil Engineering and Technology, Old Dominion Univ., Kaufman Hall, Rm. 214, Norfolk, VA 23529. E-mail: [email protected] Note. This manuscript was submitted on July 25, 2008; approved on July 31, 2009; published online on August 3, 2009. Discussion period open until September 1, 2010; separate discussions must be submitted for individual papers. This paper is part of the Journal of Transportation Engineering, Vol. 136, No. 4, April 1, 2010. ©ASCE, ISSN 0733-947X/ 2010/4-332–341/$25.00. 332 / JOURNAL OF TRANSPORTATION ENGINEERING © ASCE / APRIL 2010 J. Transp. Eng. 2010.136:332-341. Downloaded from ascelibrary.org by MARRIOTT LIB-UNIV OF UT on 11/26/14. Copyright ASCE. For personal use only; all rights reserved.

Upload: others

Post on 04-Nov-2021

2 views

Category:

Documents


0 download

TRANSCRIPT

Dow

nloa

ded

from

asc

elib

rary

.org

by

MA

RR

IOT

T L

IB-U

NIV

OF

UT

on

11/2

6/14

. Cop

yrig

ht A

SCE

. For

per

sona

l use

onl

y; a

ll ri

ghts

res

erve

d.

Integration of GIS and Data Mining Technology to Enhancethe Pavement Management Decision Making

Guoqing Zhou1; Linbing Wang2; Dong Wang3; and Scott Reichle4

Abstract: This paper presents a research effort undertaken to explore the applicability of data mining and knowledge discovery �DMKD�in combination with Geographic Information System �GIS� technology to pavement management to better decide maintenance strategies,set rehabilitation priorities, and make investment decisions. The main objective of the research is to utilize data mining techniques to findpertinent information hidden within the pavement database. Mining algorithm C5.0, including decision trees and association rules, hasbeen used in this analysis. The selected rules have been used to predict the maintenance and rehabilitation strategy of road segments. Apavement database covering four counties within the state of North Carolina, which was provided by North Carolina DOT �NCDOT�, hasbeen used to test this method. A comparison was conducted in this paper for the decisions related to a rehabilitation strategy proposed bythe NCDOT to the proposed methodology presented in this paper. From the experimental results, it was found that the rehabilitationstrategy derived by this paper is different from that proposed by the NCDOT. After combining with the AIRA Data Mining method, sevenfinal rules are defined. Using these final rules, the maps of several pavement rehabilitation strategies are created. When their numbers andlocations are compared with ones made by engineers at the Institute for Transportation Research and Education �ITRE� at North CarolinaState University, it has been found that error for the number and the location are various for the different rehabilitation strategies. With thepilot experiment in the project, it can be concluded: �1� use of the DMKD method for the decision of road maintenance and rehabilitationcan greatly increase the speed of decision making, thus largely saving time and money, and shortening the project period; �2� the DMKDtechnology can make consistent decisions about road maintenance and rehabilitation if the road conditions are similar, i.e., interferencefrom human factors is less significant; �3� integration of the DMKD and GIS technologies provides a pavement management system withthe capabilities to graphically display treatment decisions against distresses; and �4� the decisions related to pavement rehabilitation madeby the DMKD technology is not completely consistent with that made by ITRE, thereby, the postprocessing for verification and refinementis necessary.

DOI: 10.1061/�ASCE�TE.1943-5436.0000092

CE Database subject headings: Pavement management; Data collection; Geographic information systems; Decision making.

Author keywords: Pavement management; Data mining; GIS; Pavement distress.

Introduction

A pavement management system �PMS� is a valuable tool andone of the critical elements of the highway transportation infra-structure �Tsai et al. 2004; Kulkarni and Miller 2003�. The earliestPMS concept can be traced back to the 1960s. With rapid increase

1Dept. of Civil Engineering and Technology, Old Dominion Univ.,Kaufman Hall, Rm. 214, Norfolk, VA 23529; and, Virginia Tech Trans-portation Institute �VTTI�, Dept. of Civil and Environmental Engineering,Virginia Polytechnic Institute and State Univ., Blacksburg, VA 24061�corresponding author�. E-mail: [email protected]

2Virginia Tech Transportation Institute �VTTI�, Dept. of Civil andEnvironmental Engineering, Virginia Polytechnic Institute and StateUniv., Blacksburg, VA 24061. E-mail: [email protected]

3Virginia Tech Transportation Institute �VTTI�, Dept. of Civil andEnvironmental Engineering, Virginia Polytechnic Institute and StateUniv., Blacksburg, VA 24061. E-mail: [email protected]

4Dept. of Civil Engineering and Technology, Old Dominion Univ.,Kaufman Hall, Rm. 214, Norfolk, VA 23529. E-mail: [email protected]

Note. This manuscript was submitted on July 25, 2008; approved onJuly 31, 2009; published online on August 3, 2009. Discussion periodopen until September 1, 2010; separate discussions must be submitted forindividual papers. This paper is part of the Journal of TransportationEngineering, Vol. 136, No. 4, April 1, 2010. ©ASCE, ISSN 0733-947X/

2010/4-332–341/$25.00.

332 / JOURNAL OF TRANSPORTATION ENGINEERING © ASCE / APRIL 20

J. Transp. Eng. 2010.

of advanced information technology, many investigators havesuccessfully integrated the Geographic Information System �GIS�into PMS for storing, retrieving, analyzing, and reporting infor-mation needed to support pavement-related decision making.Such an integration system is thus called G-PMS �Lee et al.1996�. The main characteristic of a GIS system is that it linksdata/information to its geographical location �e.g., latitude/longitude or state plane coordinates� instead of the milepost orreference-point system traditionally used in transportation. More-over, the GIS can describe and analyze the topological relation-ship of the real world using the topological data structure andmodel �Goulias 2002; Lee et al. 1996�. GIS technology is alsocapable of rapidly retrieving data from a database and can auto-matically generate customized maps to meet specific needs suchas identifying maintenance locations. Therefore, a G-PMS can beenhanced with features and functionality by using a GIS to per-form pavement management operations, create maps of pavementcondition, provide cost analysis for the recommended mainte-nance strategies, and long-term pavement budget programming.

With the increasing amount of pavement data collected, up-dated and exchanged due to deteriorating road conditions, in-creasing traffic loading, and shrinking funds, many knowledge-based expert systems �KBESs� have been developed to solvevarious transportation problems �e.g., Abkowitz et al. 1990; Nas-

sar 2007; Spring and Hummer 1995�. A comprehensive survey of

10

136:332-341.

Dow

nloa

ded

from

asc

elib

rary

.org

by

MA

RR

IOT

T L

IB-U

NIV

OF

UT

on

11/2

6/14

. Cop

yrig

ht A

SCE

. For

per

sona

l use

onl

y; a

ll ri

ghts

res

erve

d.

KBESs in transportation is summarized and discussed by Cohnand Harris �1992�. However, only a few scholars have investi-gated applying data mining and knowledge discovery �DMKD� toPMSs. For example, Attoh-Okine �2002, 1997� presented appli-cation of rough set theory to enhance the decision support of thepavement rehabilitation and maintenance. Prechaverakul and Ha-dipriono �1995� applied KBES and fuzzy logic for minor rehabili-tation projects in Ohio. Wang et al. �2003� discussed the decision-making problem of pavement maintenance and rehabilitation. Leuet al. �2001� investigated the applicability of data mining in theprediction of tunnel support stability using an artificial neuralnetwork algorithm. Sarasua and Jia �1995� explored an integrationof GIS technology with knowledge discovery and expert systemfor pavement management. Ferreira et al. �2002� explored theapplication of probabilistic segment-linked pavement manage-ment optimization model. Chan et al. �1994� applied the geneticalgorithm for road maintenance planning. Soibelman and Kim�2000� discussed the data preparation process for constructionknowledge generation through knowledge discovery in databases,as well as construction knowledge generation and dissemination.

This paper presents a research effort undertaken to explore theapplicability of data mining in combination with GIS technologyto pavement management to better decide maintenance strategies,set rehabilitation priorities, and make investment decisions. Abrief overview of data mining techniques as applied to pavementmanagement is presented in “Data Mining and Inductive Learn-ing.” Data collected from the Institute for Transportation Re-search and Education �ITRE� of North Carolina State Universityand pavement condition rating are presented in “Pavement Per-formance Measures.” Examples using data mining to reveal un-known patterns and rules in the database are presented in “DataMining-Based Maintenance and Repair �M&R�.” Finally, thefindings and conclusions are discussed.

Data Mining and Inductive Learning

The goal of applying DMKD technology to a PMS is to discoverany pertinent information in the pavement management databaseby creating a decision tree and/or inducing rules. Some informa-tion is stored in an obvious location of the database, and thus canbe obtained by traditional database query operation, such as maxi-mum and minimum width of the roads. Other knowledge is hid-den in a more remote area of the database �e.g., pavementcondition patterns�, and thus only can be obtained by intelligenttechnology, such as data mining. Several data mining techniqueshave been developed over the last few decades by artificial intel-ligence community. Generally, the data mining techniques can becategorized in four categories, depending on their functionality:�1� classification; �2� clustering; �3� numeric prediction; and �4�association rules �Michalski 1983�.• Classification: this technique is designed essentially to gener-

ate predictive models by analyzing an existing database todetermine categorical divisions or patterns in the database.Thus, this method focuses on identifying the characteristics ofthe group or class to which each record belongs in the data-base;

• Clustering: the purpose of this data mining technique is togroup or classify items that seem to fall naturally together inthe database when there is no preidentified classification orgroup;

• Numeric prediction: this is essentially a classification learning

technique with the intended outcome of a numeric value �nu-

JOURNAL

J. Transp. Eng. 2010.

meric quantity� rather than a category �discrete class�. Thus,the numeric values are used for prediction; and

• Association rule: the purpose of this data mining technique isto find useful associations and/or correlation relationshipsamong large sets of data items. Association rules, expressed byin the form of “if-then” statements, shows attribute value con-ditions that occur frequently together in a given data set. Theserules are computed from the data, unlike the if-then logicrules, and thus are probabilistic in nature.The main differences among the earlier discussed techniques

lie in which algorithms and methods will be used to extractknowledge and how the mined knowledge and discovery rules areexpressed. The data mining technique used in this paper is anassociation/inductive learning rule, so that relevant and implicitknowledge, such as road condition distress pattern, can be ex-tracted. Many inductive learning algorithms, which mainly comefrom the machine learning, have been presented, such as AQ11�Michalski and Larson 1975�, AQ15 �Michalski et al. 1986; Honget al. 1995� and AQ19 �Kaufman and Michalski 1999�, AE1 andAE9 �Hong et al. 1995�, concept learning system �CLS� �Hunt1966�, ID3, C4.5, and C5.0 �Quinlan 1979, 1986, 1993�, and CN2�Clark and Niblett 1989�, and so on. CLS �Hunt et al. 1966� useda heuristic look ahead method to construct decision trees. Quinlanextended CLS method by using information content in the heu-ristic function, called ID3. ID3 method adopts a strategy, called“divide and conquer,” and selects classification attributes recur-sively on the basis of information entropy. Quinlan �1993� furtherdeveloped his method, called C4.5, which only dealt with stringsof constant length. C4.5 not only can create a decision tree, butinduce equivalent production rules as well, and deals with multi-class problems with continuous attributes. C5.0 is an upgradedversion of C4.5. This method requires the training data, which areusually constructed by several tuples, each of which has severalattributes. Note that if the records in the database are taken astuples and fields as attributes, the C5.0 algorithm is easily realizedin a pavement database. Thus, the knowledge discovered by C5.0algorithm is a group of classification rules and a default class, andfurthermore, with each rule there is a confidence value �between 0and 1�. Moreover, the C5.0 method can generate a decision treeand associated decision rule.

Decision Tree

The inductive learning C5.0 method is a top-down induction al-gorithm that is capable of generating a decision tree model toclassify data in a database. The decision tree is capable of parti-tioning the data at each level based on a particular feature of thedata. The processes of C5.0 algorithm are: �1� a decision tree isbuilt by choosing an attribute that best separates the classes in thedatabase and �2� a pruning strategy is used to prune a branchwhen the introduced error is one standard error of the existingerrors. The implementation of the C5.0 algorithm requires recur-sive iteration to the partitions until partitions only have a singleclass, or the partition becomes too small. With such recursiveiteration, data with higher partitioning ability will be toward theroot, and iterative classification occurs at the leaves in a decisiontree. At each iteration, the data are fed into a filtering method,which reduces the number of features. In our project, the decisiontree categorizes the entire subject according to whether or not

they are likely to have hypertension.

OF TRANSPORTATION ENGINEERING © ASCE / APRIL 2010 / 333

136:332-341.

Dow

nloa

ded

from

asc

elib

rary

.org

by

MA

RR

IOT

T L

IB-U

NIV

OF

UT

on

11/2

6/14

. Cop

yrig

ht A

SCE

. For

per

sona

l use

onl

y; a

ll ri

ghts

res

erve

d.

Association Rule

In addition to the decision tree generation, the C5.0 algorithmalso induces association rules. An association rule differs from aclassification rule because the associate rule can be used to pre-dict any attribute �not just the group or class�, and more than oneattribute’s value at a time. An associate rule also gives an occur-rence relationship among factors. A typical association rule isif-then sentence, i.e.,

IF Cause_1, AND/OR Cause_2

THEN Result �or consequence�

This rule means “if Cause_1 and/or Cause_2 are true, and thenResult �the association rule� would be generated at cases of x%with a confidence of x%.” With such as rule, a piece of knowledgecan be discovered from data set. The rules induced by C5.0algorithm also provide with support, confidence level, and captureusing statistical tests to correct the decision tree model �Chae et al.2001�.• Support: this measures the percentages of the training data that

support the rule; i.e., how many data generates the left handside of the rule. The support is the number of cases in whichthe rule is found and a pattern is defined by the rule. Once thispattern is identified, it can be used to predict the similar pat-terns in database;

• Confidence: the confidence indicates a probability of a certainrule. It can be used to measure the accuracy of rule; and

• Capture: this is used to measure the percentage of records thatare correctly captured by this rule. For example, if a rule withcapture is close to 100%, it means that all observations withthis class are closely clustered.In this paper, the initial association rules will be induced by

using the C5.0 algorithm. The final rules are obtained by postpro-cessing of the initial rules. The attributes for deductive learningare the same as that in inductive learning except the class labelattribute. The final rules are used to identify the occurrence rela-tionship between pavement rehabilitation and various road dis-tress conditions.

Pavement Performance Measures

Data Sources

In 1983, ITRE of North Carolina State University began workingwith the Division of Highways of the North Carolina DOT�NCDOT� to develop and implement a PMS for its 60,000 mi ofpaved state highways. At the request of several municipalities, theNCDOT has made the PMS available for North Carolina munici-palities. The ITRE modified the system for municipal streets andhas used it in more than 100 municipalities in North and SouthCarolina. The data sources for this experiment are provided by theITRE. They conducted a pavement distress survey for severalcounties in North Carolina in January of 2007 to determinewhether pavement rehabilitation was necessary. Field data collec-tion was performed by a two-person crew comprised of a “rater”and a “driver.” The rater performed the visual evaluation/inspection for each section of roadway and entered all field col-lected attributes to the corresponding line feature �by using acustomized ArcPad software application installed on a laptop/pentop computer�. The driver handled field measurement of pave-

ment width, as necessary. All pavement distress data was

334 / JOURNAL OF TRANSPORTATION ENGINEERING © ASCE / APRIL 20

J. Transp. Eng. 2010.

identified and quantified by means of visual inspection �wind-shield survey performed by trained staff�. A measuring wheel wasused as necessary.

The survey assessment was performed following the guide-lines provided in the Pavement Condition Rating Manual�AASHTO 2001, 1999�. The collected 1,285 records, to be usedin this project, were from a network-level survey, covering sev-eral county roads including Route U.S.1. The pavement databaseprovided by the ITRE is a spatial-based rational database �i.e., anArcGIS software compatible database.� In this database, 89 fieldsincluding geospatial data �e.g., X ,Y coordinates, central line,width of lane, etc.� and pavement condition data �e.g., cracking,rutting, etc.�, and traffic data �e.g., shoulder, lane number, etc.�,and economic data �e.g., initial cost, total cost� are recorded byengineers. They conducted these surveys by walking or drivingalong the roads and recording the pertinent information and theircorresponding maintenance and repair �M&R� strategy. Then, thedata are integrated into the database, as shown in Fig. 1. The firstthrough 19th columns recorded road name, type, class, owner;and the 31st through 43rd columns recorded the pavement condi-tion �distress�; the 44th through 50th columns recorded the differ-ent types of cost information; others columns included proposedactivities and other related information. From the expert’s sugges-tion in the ITRE, the following nine common types of distressesare considered to assess the necessity for road rehabilitation�Table 1�. The rating was determined by visual evaluation/inspection at each section of roadway.

Distress Rating

All the analyses and pavement condition evaluations presented inthis paper are based on the pavement performance measures. Acommon acceptable pavement performance measure is pavementcondition index �PCI�, which was first defined by the U.S. Army.Under the PCI, the pavement condition is related to factors suchas structural integrity, structural capacity, roughness, skid resis-tance, and rate of distress. These factors are quantified in theevaluation worksheet that field inspectors used to assess and ex-press the local pavement condition and severity of damage. Notethat inspectors primarily relied on their own judgment to assessthe distress conditions. Usually, the PCI is quantified into sevenlevels, corresponding to from Excellent �over 85� to Failed 0.

Table 1 presents eight types of distresses that are evaluatedfor asphaltic concrete pavements. The severity of distresses inTable 1 is rated in four categories ranging from very slight to verysevere. Extent �or density� is also classified in five categoriesranging from few �less than 10%� to throughout �more than 80%�.The identification and description of types of distress, severity,and density are as follow, respectively.• The road conditions of the alligator cracking are rated as a

percentage of the section that falls under the categories ofnone, light, moderate, and severe. Percentages are shown as1=10%, 2=30%, 3=60%, up to 10=100%. The appropriatepercentages were placed under none, light, moderate, and se-vere. These percentages should always add up to 100%;

• The severity levels of distresses for block cracking, transversecracking, bleeding, rutting, utility cut patching, patching dete-rioration, and raveling are rated four levels: none �N�, light�L�, moderate �M�, and severe �S�, respectively; and

• The severity levels of ride quality are classified as: average

�L�, slightly rough �M�, and rough �S�.

10

136:332-341.

Dow

nloa

ded

from

asc

elib

rary

.org

by

MA

RR

IOT

T L

IB-U

NIV

OF

UT

on

11/2

6/14

. Cop

yrig

ht A

SCE

. For

per

sona

l use

onl

y; a

ll ri

ghts

res

erve

d.

Potential Rehabilitation Strategies

Based on the knowledge obtained from ITRE in the field, theITRE has classified flexible pavement rehabilitation needs intothree main categories according to the type of the problem to becorrected: �1� cracking; �2� surface defect problems; and �3� struc-tural problems. These problems can be treated using crack treat-ment, surface treatment, and nonstructural overlay �one- and two-course overlay�, respectively. To select an appropriate treatmentfor rehabilitation and maintenance, seven potential rehabilitation

Fig. 1. Spatial-based pavem

Table 1. Nine Common Types of Distresses to Be Used in This Study

Distress

Alligator cracking �four types of rates are given� Alligator none

Alligator light

Alligator mode

Alligator sever

Block/transverse cracking �BK�

Reflective cracking �RF�

Rutting �RT�

Raveling �RV�

Bleeding �BL�

Patching �PA�

Utility cut patching

Ride quality �RQ�

JOURNAL

J. Transp. Eng. 2010.

and maintenance strategies have been proposed by the NCDOT�see Table 2�. Which treatment strategies will be carried out for apavement segment is traditionally determined by the ITRE whocomprehensively evaluate all types of distresses. This paper seeksto replace this process of decision making for rehabilitation strat-egies with data mining technology. Thus, one of the eight poten-tial rehabilitation strategies will be determined based upon theassociation rule in data mining, which will be generated by thedata mining technology.

tabase provided by NCDOT

Rating

Percentages of 1=10%, 2=20%, 3=30%, up to 10=100%indicate none, light, moderate, and severe, respectively

M�

indicates the overall condition of the section as follows:• N—none;• L—light;• M—moderate; and• S—severe

same rating as BK’s

The same rating as BK’s

The same rating as BK’s

The same rating as BK’s

The same rating as BK’s

The same rating as BK’s

condition is designated as follows:• L—average;• M—slightly rough; and• S—rough

ent da

�AN�

�AL�

rate �A

e �AS�

This

The

The

OF TRANSPORTATION ENGINEERING © ASCE / APRIL 2010 / 335

136:332-341.

Dow

nloa

ded

from

asc

elib

rary

.org

by

MA

RR

IOT

T L

IB-U

NIV

OF

UT

on

11/2

6/14

. Cop

yrig

ht A

SCE

. For

per

sona

l use

onl

y; a

ll ri

ghts

res

erve

d.

Data Mining-Based Maintenance and Repair „M&R…

Decision Trees of M&R

Implementation of data mining-based maintenance and rehabili-tation usually involves the following five basic steps: �1� problemidentification; �2� knowledge acquisition; �3� knowledge repre-sentation; �4� implementation; and �5� validation and extension.The first step of problem identification has already been dis-cussed, which is to reveal and extract the relevant informationhidden in the pavement database to predict potential pavementconditions and plan the rehabilitation. As to the second step, it hasbeen obtained a significant amount of knowledge from the ITREin the field, such as distress ratings, PCI ratings, and potentialrehabilitation strategies. As for the third step, the decision treeand induced rule to be derived by the C5.0 algorithm constitutethe knowledge representation. The theory behind the decision treeand inductive rules has already been described earlier �see DataMining and Inductive Learning, supra.� The experimental resultswill now be presented in this section, which constitutes the fourthstep of implementation. Finally, Step 5, validation, will be dis-cussed next as well.

ExperimentsThe CTree for Excel tool �Saha 2003� was used to create thedecision tree. This tool is based on the C5.0 algorithm, which letsuser build a tree-based classification model. The classification treecan generate the association rules. The basic steps are as follows.

Step 1: Load Pavement Database. It is first loaded the pave-ment database into the data worksheet, in which the observationswere placed in rows and the variables were placed in columns. Inaddition, for each column in the data worksheet, it is chosen anappropriate type, i.e., Omit, Class, Cont, and Cat. The implica-tions of these types are:• Omit=to drop a column from model;• Cat=to treat a column as categorical predictor;• Cont=to treat a column as continuous predictor; and• Output=to treat a column as class variable.

Because the type, Cont, in CTree only recognizes the digitalnumber, while the distresses in Table 1 were measured by letter,N, L, M, and S, these letters have to be quantified into 100, 75,50, and 25. In the CTree tool, the maximum predictor variablesare 50, and only a class variable is allowed. Application will treatthe class variable as categorical, and each of categorical predic-tors �including class variable� is limited to 20. After the data areuploaded into software, the class variables and data do not haveblank rows and blank columns, which are treated as missing val-

Table 2. Potential Rehabilitation Strategies Proposed by Engineers ofNCDOT

ID Rehabilitation strategies

0 Nothing

1 Crack pouring �CP�

2 Full-depth patch �FDP�

3 1-in. plant mix resurfacing �PM1�

5 2-in. plant mix resurfacing �PM2�

6 Skin patch �SKP�

7 Short overlay �SO�

ues in the CTree tool.

336 / JOURNAL OF TRANSPORTATION ENGINEERING © ASCE / APRIL 20

J. Transp. Eng. 2010.

Step 2: Fill-Up Model Inputs. In addition, the tool requiresinputting some parameters to optimize the process of decision treegeneration. These parameters include:1. Adjust for number categories of a categorical predictor:

while growing the tree, child nodes are created by splittingparent nodes. Which predictor is used for this split is decidedby certain criterion. Because this criterion has an inherentbias toward choosing predictors with more categories, inputof an adjustment factor will be used to adjust this bias;

2. Minimum node size criterion: while growing the tree, thedecision as to whether to stop splitting a node and declare thenode as a leaf node will be determined by some criteriawhich need to be chosen. These criteria are:

• Minimum node size: a valid minimum node size is between 0and 100;

• Maximum purity: an effective value is between 0 and 100. Thehigher the value of this, the larger will be the tree. Stop split-ting a node if its purity is 95% or more �e.g., purity is 90%means� Also, stop splitting a node if number of records in thatnode is 1% or less of total number of records; and

• Maximum depth: a valid maximum depth is greater than 1 andless than 20. The higher the value, the larger will be the tree.Stop splitting a node if its depth is 6 or more �depth of rootnode is 1; any node’s depth is that its parent’s depth +1�.

3. Pruning option: this option allows us to determine whetheror not to prune the tree when tree is growing, which can helpus to study the effect of pruning; and

4. Training and test data: in this research, a subset of data areused to build the model and the rest to study the performanceof the model. Also, the tool is required to randomly select thetest set at a ratio of 10%.

Experimental ResultsWith the aforementioned data input, a decision tree is generated.The parameters of the decision tree are listed in Table 3. As seenfrom Table 3, the total number of nodes is 72, in which the num-ber of leaf nodes is 37. The details on each node can be obtainedfrom the NodeView sheet, which shows the class distribution oneach node. Moreover, the decision tree consists of 20 levels. Inthis test, the number of training observations achieved is 555, thenumber of test observations is 510, and the number of predictorsis 9. The percentage of misclassified data reaches 61.2% for train-ing data and 60.0% for test data. The number for different typesof rehabilitation strategies, including nothing, crack pouring �CP�,full-depth patch �FDP�, 1-in. plant mix resurfacing �PM1�, 2-in.plant mix resurfacing �PM2�, skin patch �SKP�, and short overlay�SO� in the study area is calculated for training data and test data�Table 4�.

In addition, the C5.0 algorithm also generates rules. Theseassociated rules discovered rehabilitation strategies that will becarried out if the conditions are met �see Fig. 2�. Since only sevenrehabilitation strategies in the study area were suggested by theITRE at the NCDOT, the generated rules by the C5.0 algorithmsare clustered manually because of the fact too many rules were

Table 3. Decision Tree Model

Tree information Percentage for misclassified data

Total number of nodes 72 On training data 61.2%

Number of leaf nodes 37 On test data 60.0%

Number of levels 20

generated by the CTree software tool. The method of clustering

10

136:332-341.

Dow

nloa

ded

from

asc

elib

rary

.org

by

MA

RR

IOT

T L

IB-U

NIV

OF

UT

on

11/2

6/14

. Cop

yrig

ht A

SCE

. For

per

sona

l use

onl

y; a

ll ri

ghts

res

erve

d.

the rules are as follows: merging the similar rules in each leaf andnode together. As a result, their support and confidence and cap-ture are merged as well. Finally, seven rules, associated theirSUPPORT and confidence and capture, are listed in Table 5. Asobserved from Table 5, The confidence for rehabilitation strate-gies, CP and PM2 are 100%, and support for rehabilitation strat-egies are 100% as well. in addition, the support, confidence, andcapture for rehabilitation, FDP, are lowest. This means that thedecision made for FDP rehabilitation has a low reliability.

M&R Rules Verification

The aforementioned generated decision tree theoretically orga-nizes the “hidden” knowledge in a logical order. A determinationas to whether the tree is used to provide a useful methodology forselecting a feasible and effective treatment strategy for rehabili-tation �from the seven predetermined strategies� will require thisprototype of rules to be tested and validated, and then modified orextended, if necessary. In the study area, seven rules have beencreated to handle operations involved in the spatial knowledge ofthe PMS. Upon carefully checking the rules, it was determined

Table 4. Number for Different Types of Rehabilitation Strategies for theTraining Data and Test Data

Rehabilitationstrategies

Trainingdata

Testdata

Nothing 417 589

CP 12 23

FDP 13 3

PM1 3 8

PM2 2 5

SKP 8 3

SO 4 4

Total 459 640

Fig. 2. Rules induced by AIAR algorithm

JOURNAL

J. Transp. Eng. 2010.

that these rules are not completely correct. Some of rules areredundant. For this reason, the AIRA tool for Excel v1.3.3 is usedto verify these rules. This tool is an add-in for MS-Excel andallows user to extract the “hidden information” �i.e., discoverrules� right from the spreadsheets for small to midrange databasefiles. With the same data set and the AIRA tool, 41 rules aregenerated, part of which are listed in Fig. 2. Obviously, manyrules will result in misclassification, and thus have to be reduced.To this end, the following schemes are adopted in this research:1. If the attribute values simultaneously match the condition of

the rules induced by both decision tree and AIRA reasoning,the rules will be retained;

2. If attribute values simultaneously match the conditions ofseveral rules, those rules with the maximum confidence willbe kept;

3. If attribute values simultaneously match several rules withthe same confidence values, those rules with the maximumcoverage of learning samples will be kept; and

4. If attribute values do not match any rule, this class of at-tribute is defined as the rule, nothing treatment.

With the schemes proposed earlier for rule reduction, 14 ruleswere still retained. However, only seven rehabilitation strategiesin the study area were suggested by the ITRE. Thus, the followingmethod is suggested to further reduce the number of rules:1. Reduce the attribute data sets from alligator cracking family

through

AC = ��AN�,�AL�,�AM�,�AS��

2. Reduce the attribute data through checking the PCI values.The principles are,

a. If AC=M or S reduce other attributes; andb. If RT=S, reduce other attributes.

With the aforementioned reduction methods, the seven rules isfinally refined �see Fig. 3�.

Mapping of Data Mining-Based Decision of M&R

With the rules induced earlier, the rehabilitation strategies can bepredicted and decided for each road segment in the database. Inother words, the operation using DMKD only occurs in the data-base, and thus the results cannot be visualized and displayed oneither map or screen. However, the GIS system is capable ofrapidly retrieving data from the database and automatically gen-erating customized maps to meet specific needs such as identify-ing maintenance locations, visualizing spatial and nonspatial data,and linking data/information to its geographical location. Thus,this paper employed ArcGIS software in combination with theearlier induced results to create the decision-making map formaintenance and rehabilitation. The basic operation consists of

Table 5. Support, Confidence, and Capture for Each Generated Rule

RuleID Classes Support Confidence Capture

0 NO 100.0% 86.7% 93.0%

1 CP 60.7% 100.0% 75.6%

2 SKP 60.5% 66.7% 82.5%

3 FDP 71.6% 55.6% 66.2%

4 PM2 80.2% 100.0% 73.1%

5 PM1 81.3% 71.4% 85.3%

6 SO 81.6% 66.7% 73.5%

taking each of the above rules as a logic query in the ArcGIS

OF TRANSPORTATION ENGINEERING © ASCE / APRIL 2010 / 337

136:332-341.

Dow

nloa

ded

from

asc

elib

rary

.org

by

MA

RR

IOT

T L

IB-U

NIV

OF

UT

on

11/2

6/14

. Cop

yrig

ht A

SCE

. For

per

sona

l use

onl

y; a

ll ri

ghts

res

erve

d.

software, and then queried results are displayed in the ArcGISlayout map. The results are listed in Figs. 4–9, corresponding toeach of the preset rehabilitation strategies, respectively. The reha-bilitations suggested by the ITRE of North Carolina State Univer-sity are superimposed with the decisions made under the analysispresented in this paper. As seen from Figs. 4–9, each rehabilita-tion strategy derived in this paper can be located with its geo-graphical coordinates, and visualized with its spatial, nonspatialdata, and different colors.

Comparison Analysis and Discussion

Comparison Analysis

To verify the results of the proposed method for the decisionmaking of road maintenance and rehabilitation, the recommendedpavement segment treatments produced in this paper were com-pared with those suggested by the ITRE at the NCDOT. All pave-ment treatments derived by this paper and by NCDOT aredisplayed in Fig. 10. A comparison analysis for both the numbers

Fig. 3. Final rules after verification and postprocessing

Fig. 4. Comparison for the CP road rehabilitation made by the pro-posed method and by NCDOT �ITRC�

338 / JOURNAL OF TRANSPORTATION ENGINEERING © ASCE / APRIL 20

J. Transp. Eng. 2010.

and the location of each treatment strategy derived by this paperand by NCDOT in the study area is listed in Table 6. As seen fromTable 6, the number of the crack pouring treatment derived bythis paper and by NCDOT is the same, i.e., 3, but the locations ofthe three roads are not the same. Specifically, the location of oneroad derived by the method presented in this paper is differentfrom the one derived by NCDOT. The number of the full-depthpatch �FDP� treatments suggested by NCDOT is 34, but 29 underthe method employed in this paper. The difference between twomethods is five. Moreover, the location of three road segments forFDP treatment is different for two methods. In the 1-in. plant mixresurfacing �PM1� strategy, NCDOT suggested six roads for PM1treatment, but this paper indicates seven roads using data mining.Moreover, a road location is different for two methods. For theskin patch �SKP� rehabilitation strategy, 65 roads are suggestedby NCDOT for treatment, but 56 roads are identified using datamining. Moreover 13 road locations are different between twomethods. For the short overlay rehabilitation strategy, three roads

Fig. 5. Comparison for the FDP road rehabilitation made by theproposed method and by NCDOT �ITRC�

Fig. 6. Comparison for the PM1 road rehabilitation made by theproposed method and by NCDOT �ITRC�

10

136:332-341.

Dow

nloa

ded

from

asc

elib

rary

.org

by

MA

RR

IOT

T L

IB-U

NIV

OF

UT

on

11/2

6/14

. Cop

yrig

ht A

SCE

. For

per

sona

l use

onl

y; a

ll ri

ghts

res

erve

d.

are suggested for treatment, but five roads are recommended bythe methods employed in this paper, of which the locations of tworoads are different between the two methods.

Discussion

The decision trees are based on the knowledge acquired frompavement management engineers to be used for rehabilitationstrategy selection. A decision tree is used to organize the obtainedknowledge in a logical order. Thus, the decision trees can deter-mine a technically feasible rehabilitation strategy for each roadsegment. Different decision trees can be built to acknowledgechanges. For example, the decision trees were based on severitylevels of individual distresses in this paper. If the pavement layerthickness and material type are taken as knowledge, or workhistory, pavement type, and ride data are taken as knowledgefor generating decision trees, these decision trees generated bydifferent types of knowledge are different. This means the asso-ciation rules generated by different knowledge are different aswell.

Fig. 7. Comparison for the PM2 road rehabilitation made by theproposed method and by NCDOT �ITRC�

Fig. 8. Comparison for the SKP road rehabilitation made by theproposed method and by NCDOT �ITRC�

JOURNAL

J. Transp. Eng. 2010.

On the other hand, from our experiment, the rules inducedfrom the decision tree are probably inexact, thereby verifica-tion of the rules is needed. This project, in fact, tested differenttools �e.g., SPSS, ACRT, RSES, CHIAD� to endeavor to generatean “optimum” or “ideal” decision tree and rules. Unfortunately,the decision trees and rules generated by these tools are notcompletely the same. Therefore, postprocessing for verification ofthe rules is still needed. Through the project, it has been realizedthat:1. The decision tree and rules induced by the data mining

method are not completely exact, i.e., the DMKD method isnot as exact as individual inspection and observation. There-fore, postprocessing for refining the rules is still essential;and

2. The advantages of applying the DMKD method for roadmaintenance and rehabilitation are: �a� it can largely increasethe speed of choosing a rehabilitation strategy, thus largelysave time and money, and shorten the project period and �b�the same treatments proposed by the DMKD method in a

Fig. 9. Comparison for the SO road rehabilitation made by the pro-posed method and by NCDOT �ITRC�

Fig. 10. All decisions for the road rehabilitation made by ours andNCDOT

OF TRANSPORTATION ENGINEERING © ASCE / APRIL 2010 / 339

136:332-341.

Dow

nloa

ded

from

asc

elib

rary

.org

by

MA

RR

IOT

T L

IB-U

NIV

OF

UT

on

11/2

6/14

. Cop

yrig

ht A

SCE

. For

per

sona

l use

onl

y; a

ll ri

ghts

res

erve

d.

road network level should have similar pavement conditions,thus, this method avoids any human factors such as visualdistress data by engineers.

Conclusions

This paper conducts research and analysis related to applying datamining technology in combination with GIS for pavement man-agement. The main purpose of the research is to utilize data min-ing techniques to find some useful knowledge hidden in thepavement database. The C5.0 algorithm has been employed togenerate decision trees and association rules. The induced ruleshave been used to predict which maintenance and rehabilitationstrategy to be selected for each road segment. A pavement data-base covering four counties in the state of North Carolina, whichare provided by the ITRE at NCDOT, has been used to test theproposed method. The comparison of two decisions for rehabili-tation treatment suggested by NCDOT and by the methodologypresented in this paper has been conducted. From the experimen-tal results, it was found that the rehabilitation strategies derivedby the rules, i.e., C5.0 method, are different from those suggestedby NCDOT. After combining other technologies, e.g., AIRAmethod, and postprocessing, seven rules are finally refined. Usingthe final rules, mapping for different types of pavement rehabili-tation strategies is created using ArcGIS v. 9.1. When comparedwith the results from NCDOT, the number and location of thesuggested road rehabilitations are different. The maximum errorfor the number of suggested road rehabilitations is nine, and forthe location is 13 out of 65 �see Table 6�.

Through this project, it has been concluded that1. The DMKD technology can make consistent decisions re-

garding road maintenance and rehabilitation for the sameroad conditions with less interference from human factors;

2. The DMKD technology can largely increase the speed ofdecision making because the technology automatically gen-erates a decision tree and rules if the expert knowledge isgiven. This can save significant time and money for PMS;

3. Integration of the DMKD and GIS can provide the PMS withthe capabilities of graphically displaying treatment decisions,visualize the attribute and nonattribute data, and link dataand information to the geographical coordinates; and

4. The DMKD method is not quite as intelligent as individual

Table 6. Comparison Analysis for Each Treatment Strategy Made by the

ID Proposed treatment strategies Fro

1 Crack pouring �CP� NCD

This p

2 Full-depth patch �FDP� NCD

This p

3 1-in. plant mix resurfacing �PM1� NCD

This p

4 2-in. plant mix resurfacing �PM2� NCD

This p

5 Skin patch �SKP� NCD

This p

6 Short overlay �SO� NCD

This p

observation. In other words, the decisions and rules deter-

340 / JOURNAL OF TRANSPORTATION ENGINEERING © ASCE / APRIL 20

J. Transp. Eng. 2010.

mined by DMKD are not exact, and thus, postprocessing forverification and refinement is necessary.

Acknowledgments

The pavement database was provided by Greg Ferrara, GIS Pro-gram Manager at the Institute for Transportation Research andEducation �ITRE� of North Carolina State University. John Rok-levi at the ITRE provided essential help in data delivery, datainterpretation, and explanation related the metadata. The writerssincerely thank both of them. The writers also thank the projectadministrators for granting permission to use their data.

References

AASHTO. �1999�. AASHTO guidelines for pavement management sys-tem, AASHTO, Washington, D.C.

AASHTO. �2001�. Pavement management guide, AASHTO, Washington,D.C.

Abkowitz, M., Walsh, S., Hauser, E., and Minor, L. �1990�. “Adaptationof geographic information systems to highway management.”J. Transp. Eng., 116�3�, 310–327.

Attoh-Okine, N. O. �1997�. “Rough set application to data mining prin-ciples in pavement management database.” J. Comput. Civ. Eng.,11�4�, 231–237.

Attoh-Okine, N. O. �2002�. “Combining use of rough set and artificialneural networks in doweled-pavement-performance modeling—A hy-brid approach.” J. Transp. Eng., 128�3�, 270–275.

Chae, Y. M., Ho, S. H., Cho, K. W., Lee, D. H., and Ji, S. H. �2001�.“Data mining approach to policy analysis in a health insurance do-main.” Int. J. Med. Inf., 62, 103–111.

Chan, W. T., Fwa, T. F., and Tan, C. Y. �1994�. “Road maintenanceplanning using genetic algorithms. I: Formulation.” J. Transp. Eng.,120�5�, 693–709.

Clark, P., and Niblett, T. �1989�. “The CN2 induction algorithm.” Mach.Learn., 3, 261–283.

Cohn, L. F., and Harris, R. A. �1992�. Knowledge-based expert system intransportation. NCHRP synthesis 183, TRB, National Research Coun-cil, Washington, D.C.

Ferreira, A., Antunes, A., and Picado-Santos, L. �2002�. “Probabilisticsegment-linked pavement management optimization model.” J.Transp. Eng., 128�6�, 568–577.

Goulias, D. G. �2002�. “Management systems and spatial data analysis in

rs and NCDOT

NumberDifferencein number

Differencein location

3 0 1

3

34 5 3

29

6 1 1

7

3 1 1

4

65 9 13

56

3 2 2

5

Write

m

OT

aper

OT

aper

OT

aper

OT

aper

OT

aper

OT

aper

transportation and highway engineering.” Proc., Management Infor-

10

136:332-341.

Dow

nloa

ded

from

asc

elib

rary

.org

by

MA

RR

IOT

T L

IB-U

NIV

OF

UT

on

11/2

6/14

. Cop

yrig

ht A

SCE

. For

per

sona

l use

onl

y; a

ll ri

ghts

res

erve

d.

mation Systems, 2002: Incorporating GIS and Remote Sensing, Vol.2002, Univ. of Illinois at Urbana-Champaign, Urbana, Ill., 321–327.

Hong, J., Mozetic, I., and Michalski, R. S. �1995�. “AQ15: Incrementallearning of attribute-based descriptions from examples, the methodand user’s guide.” Rep. Prepared for Intelligent Systems Group, Univ.of Illinois at Urbana-Champaign, Urbana, Ill.

Hunt, E. B., Marin, J., and Stone, P. T. �1966�. Experiments in induction,Academic, San Diego.

Kaufman, K. A., and Michalski, R. S. �1999�. “Learning in an inconsis-tent world: Rule selection in AQ19.” Rep. No. MLI 99-2, GeorgeMason Univ., Fairfax, Va.

Kulkarni, R. B., and Miller, R. W. �2003�. “Pavement management sys-tems: Past, present, and future.” Transp. Res. Rec., 1853, 65–71.

Lee, H. N., Jitprasithsiri, S., Lee, H., and Sorcic, R. G. �1996�. “Devel-opment of geographic information system-based pavement manage-ment system for Salt Lake City.” Transp. Res. Rec., 1524, 16–24.

Leu, S.-S., Chen, C.-N., and Chang, S.-L. �2001�. “Data mining for tunnelsupport stability: Neural network approach.” Autom. Constr., 10�4�,429–441.

Michalski, R. S. �1983�. “A theory and methodology of machine learn-ing.” Machine learning: An artificial intelligence approach, R. S.Michalski, J. G. Carbonell, and T. M. Mitchell, eds., Univ. of Illinoisat Urbana-Champaign, Urbana, Ill., 83–134.

Michalski, R. S., and Larson, J. �1975�. “AQVAL/1 �AQ7� user’s guideand program description.” Rep. No. 731, Dept. of Computer Science,Univ. of Illinois at Urbana-Champaign, Urbana, Ill.

Michalski, R. S., Mozetic, I., Hong, J., and Lavrac, N. �1986�. “Themulti-purpose incremental learning system AQ15 and its testing ap-plication to three medical domains.” Proc., 5th National Conf. on

JOURNAL

J. Transp. Eng. 2010.

Artificial Intelligence (AAAI-86), AAAI Press, 1041–1045.Nassar, K. �2007�. “Application of data-mining to state transportation

agencies, projects databases.” ITcon, 12, 139–149.Prechaverakul, S., and Hadipriono, F. C. �1995�. “Using a knowledge

based expert system and fuzzy logic for minor rehabilitation projectsin Ohio.” Transp. Res. Rec., 1497, 19–26.

Quinlan, J. R. �1979�. “Discovering rules from large collections of ex-amples: A case study.” Expert systems in the microelectronic age,D. Michie, ed., Edinburgh University Press, Edinburgh, U.K.

Quinlan, J. R. �1986�. “Induction of decision trees.” Mach. Learn., 1,81–106.

Quinlan, J. R. �1993�. C4.5: Programs for machine learning, MorganKaufmann, San Mateo, Calif.

Saha, A. �2003�. “CTree software for Excel.” Classification tree in Excel,�http://www.geocities.com/adotsaha/� �Feb. 24, 2003�.

Sarasua, W. A., and Jia, X. �1995�. “Framework for integrating GIS-Twith KBES: A pavement management system example.” Transp. Res.Rec., 1497, 153–163.

Soibelman, L., and Kim, H. �2000�. “Generating construction knowledgewith knowledge discovery in databases.” Computing in Civil andBuilding Engineering, 2, 906–913.

Spring, G. S., and Hummer, J. �1995�. Identification of hazardous high-way locations using knowledge-based GIS: A case study.” Transp.Res. Rec., 1497, 83–90.

Tsai, Y., Gao, B., and Lai, J. S. �2004�. “Multiyear pavement-rehabilitation planning enabled by geographic information system:Network analyses linked to projects.” Transp. Res. Rec., 1889, 21–30.

Wang, F., Zhang, Z., and Machemehl, R. B. �2003�. “Decision-makingproblem for managing pavement maintenance and rehabilitationprojects.” Transp. Res. Rec., 1853, 21–28.

OF TRANSPORTATION ENGINEERING © ASCE / APRIL 2010 / 341

136:332-341.