combining case-based reasoning and statistical method for proposing solution in ricad

7
Combining case-based reasoning and statistical method for proposing solution in RICAD 1 J. Daengdej a , *, D. Lukose a , E. Tsui b , P. Beinat b , L. Prophet b a Distributed Artificial Intelligence Centre (DAIC), Department of Mathematics, Statistic and Computing Science, University of New England, Armidale, NSW 2351, Australia b Expert Systems Group, Continuum (Australia) Ltd., 201 Miller Street, North Sydney, NSW 2060, Australia Received 15 May 1997; accepted 29 May 1997 Abstract Most case-based reasoning (CBR) systems concentrate on retrieving cases which are most similar to a case at hand. When a similar case is found, the system will proceed to adapt (or modify) this solution to solve the case at hand. This method of problem solving cannot be easily applied in our real-world problem domain (i.e. insurance). In this domain, sufficient number of similar cases have to be retrieved so that the system could confidently calculate the final solution. More than one similar case must be retrieved due to the fact that most of the cases which are similar to the one at hand almost always contain inconsistent results. This paper describes a CBR system called risk cost adviser (RICAD) which applies a statistical function in order to propose a reliable answer. RICAD differs from other CBR systems as, in most cases, in addition to the use of the statistical function, it has to repeat its reasoning process until an adequate number of cases are collected to calculate the answer. q 1997 Elsevier Science B.V. Keywords: Case-based reasoning; Central limit theorem 1. Introduction Case-based reasoning (CBR) systems deal with new cases by using (or adapting) solutions of previously encountered cases [1,2]. Since CBR systems deal with their problems by using available historical cases rather than other sources of knowledge (e.g. knowledge from human experts), they can be applied to problem domains where there is a lack of clearly defined knowledge [3,4,2]. The CBR approach has been adapted in various ways. For example, one can build a CBR system which only retrieves relevant cases. These type of systems do not construct a new solution. MetVUW [5], a weather forecasting system, does not construct a new solu- tion, but tries to retrieve only the cases most similar to the problem at hand. A similar approach to CBR is also used in BankXX [6], another case retrieval system which is used in the legal domain. The complexity of how CBR systems construct their solu- tion vary from simply using the solution of a single matched case, to that of constructing a solution by merging and/or adapting solutions of various historical cases [7]. Case- based planning (CBP) systems such as CaPER [8] (i.e. a CBP system for package delivering), and CHEF [9] (a sys- tem which suggests an appropriate recipe when cooking) also have complex mechanisms for merging or adapting different plans. Similar mechanisms can be found in CBR systems that are used in design [10]. On the other hand, a number of CBR systems are used for classification tasks, such as finding whether a particular credit card number should be approved or rejected [11]. In this type of CBR systems, a single answer solution indicates to which category a new case should belong (e.g. approve or reject [11]). The most important aspect in any CBR system is to ensure that the system retrieves only similar cases. This involves using the correct set of indices. 2 A number of CBR systems built to date predefine a set of indices that will be used, in advance by using an inductive method [13,14]. On the other hand, similarity between a new case Knowledge-Based Systems 10 (1997) 153–159 0950-7051/97/$17.00 q 1997 Elsevier Science B.V. All rights reserved PII S0950-7051(97)00027-0 * Corresponding author. e-mail: [email protected] 1 Revised version of the article originally published in Knowledge-based Computer Systems: Research and Applications. (Eds K.S.R. Anjaneyulu, M. Sasikumar and S. Ramani) Narosa Publishing House, New Delhi, 1997. 2 In CBR, indices are used to label all cases in the case base in order to distinguish a case (or a group of similar cases) from others (see Ref. [12] for an extensive details on what are the indices). This is similar to the way a supermarket assigns bar-codes to all of its products. The bar-code allows a record of a particular product to be quickly retrieved from a database. In our case, indices are represented using attribute-value pairs.

Upload: j-daengdej

Post on 05-Jul-2016

214 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Combining case-based reasoning and statistical method for proposing solution in RICAD

Combining case-based reasoning and statistical method for proposingsolution in RICAD1

J. Daengdeja,*, D. Lukosea, E. Tsuib, P. Beinatb, L. Prophetb

aDistributed Artificial Intelligence Centre (DAIC), Department of Mathematics, Statistic and Computing Science, University of New England, Armidale,NSW 2351, Australia

bExpert Systems Group, Continuum (Australia) Ltd., 201 Miller Street, North Sydney, NSW 2060, Australia

Received 15 May 1997; accepted 29 May 1997

Abstract

Most case-based reasoning (CBR) systems concentrate on retrieving cases which are most similar to a case at hand. When a similar case isfound, the system will proceed to adapt (or modify) this solution to solve the case at hand. This method of problem solving cannot be easilyapplied in our real-world problem domain (i.e. insurance). In this domain, sufficient number of similar cases have to be retrieved so that thesystem could confidently calculate the final solution. More than one similar case must be retrieved due to the fact that most of the cases whichare similar to the one at hand almost always contain inconsistent results. This paper describes a CBR system called risk cost adviser (RICAD)which applies a statistical function in order to propose a reliable answer. RICAD differs from other CBR systems as, in most cases, inaddition to the use of the statistical function, it has to repeat its reasoning process until an adequate number of cases are collected to calculatethe answer.q 1997 Elsevier Science B.V.

Keywords:Case-based reasoning; Central limit theorem

1. Introduction

Case-based reasoning (CBR) systems deal with new casesby using (or adapting) solutions of previously encounteredcases [1,2]. Since CBR systems deal with their problems byusing available historical cases rather than other sources ofknowledge (e.g. knowledge from human experts), they canbe applied to problem domains where there is a lack ofclearly defined knowledge [3,4,2]. The CBR approach hasbeen adapted in various ways. For example, one can build aCBR system which only retrieves relevant cases. These typeof systems do not construct a new solution. MetVUW [5], aweather forecasting system, does not construct a new solu-tion, but tries to retrieve only the cases most similar to theproblem at hand. A similar approach to CBR is also used inBankXX [6], another case retrieval system which is used inthe legal domain.

The complexity of how CBR systems construct their solu-tion vary from simply using the solution of a single matchedcase, to that of constructing a solution by merging and/or

adapting solutions of various historical cases [7]. Case-based planning (CBP) systems such as CaPER [8] (i.e. aCBP system for package delivering), and CHEF [9] (a sys-tem which suggests an appropriate recipe when cooking)also have complex mechanisms for merging or adaptingdifferent plans. Similar mechanisms can be found in CBRsystems that are used in design [10]. On the other hand, anumber of CBR systems are used for classification tasks,such as finding whether a particular credit card numbershould be approved or rejected [11]. In this type of CBRsystems, a single answer solution indicates to whichcategory a new case should belong (e.g. approve or reject[11]).

The most important aspect in any CBR system is toensure that the system retrieves only similar cases. Thisinvolves using the correct set of indices.2 A number ofCBR systems built to date predefine a set of indices thatwill be used, in advance by using an inductive method[13,14]. On the other hand, similarity between a new case

Knowledge-Based Systems 10 (1997) 153–159

0950-7051/97/$17.00q 1997 Elsevier Science B.V. All rights reservedPII S0950-7051(97)00027-0

* Corresponding author. e-mail: [email protected] Revised version of the article originally published in Knowledge-based

Computer Systems: Research and Applications. (Eds K.S.R. Anjaneyulu,M. Sasikumar and S. Ramani) Narosa Publishing House, New Delhi, 1997.

2 In CBR, indices are used to label all cases in the case base in order todistinguish a case (or a group of similar cases) from others (see Ref. [12] foran extensive details on what are the indices). This is similar to the way asupermarket assigns bar-codes to all of its products. The bar-code allows arecord of a particular product to be quickly retrieved from a database. In ourcase, indices are represented using attribute-value pairs.

Page 2: Combining case-based reasoning and statistical method for proposing solution in RICAD

and historical cases in the case base can be measured byusing various similarity measurement techniques [11,14].No matter which similarity measurement method is chosen,typically, if a similar case(s) is found, the system canconfidently propose solutions.

In our case, the real-world data set consists of 2 millioncases. Each case contains 30 attributes. Some attributescomprise up to 2000 different values. As a result, pre-defining all possible indices in advance is very difficult.On the other hand, the traditional method of immediatelyproposing a solution based on a solution of matched casescannot be applied with our real-world data set. In this parti-cular data set, there will be a number of cases that aresimilar (or even identical) to a given new case. Furthermore,all these similar cases are associated with largely differentsolutions.

The purpose of this research is to find an efficient methodthat can be used for identifying a set of indices whenRICAD is dealing with a large number of cases. In addition,since there will be a number of similar cases found for eachnew case, finding a method that enables the system to createa reliable solution is also an important issue. In this case,RICAD should propose a solution together with its level ofconfidence on the solution. As a result, two experimentalmechanisms have been implemented in RICAD: dynamicindex creation mechanism (DICM) which is used to identifythe set of indices at run time; and confidence identificationmechanism (CIM) enables RICAD to propose its solutionwith an optimum level of confidence. This paper only con-centrates on the second issue, that is, how RICAD is able topropose its solution by using CIM. Details of the method foridentifying a set of indices in which RICAD (DICM) can befound in Ref. [15].

First, we describe the main problem under examination.This is then followed, in Section 3, by a description ofRICAD’s architecture. Section 4 discusses in detail theCIM, and provides an example of how this particularmechanism is used to calculate the solution for RICAD.

Finally, we summarize the main contribution of this paperand outline the direction of future research.

2. Problem description

Risk reduction is often applied in all organizations,including insurance companies. The insurance businessdeals directly with different kinds of risk (e.g. personnel,home, car, etc.). In general, the more risk they can reducethe more profit they can gain. In vehicle insurance, a custo-mer has to pay a certain amount of premium to insure theirmotor vehicle. Usually, the premium for a particular custo-mer is calculated by summing risk cost together withapproximate management cost and required profit. Riskcost is an estimated amount of money that is expected tobe claimed by a particular customer.

It is quite easy to approximate the management cost andrequired profit. However, there is no set of rules that can beused for calculating risk cost of a particular customerbecause no one can predict when an accident will occur.Historical records (or cases) of customers seem to be theonly source of information that can help the insurance com-panies to predict how much risk cost should be charged for aparticular customer. The problem is how to construct anexpert system that can help an insurance company to dealwith the risk cost calculation problem.

Recall the fact that there are 30 attributes in each case.Some of these attributes are: car model, driver age, residentarea of the driver, financial type, and claim cost. The valueof financial type indicates the type of financial institutewhere a customer borrows his/her money from when pur-chasing a vehicle (i.e. a bank, a financial company, or cash).The value of claim cost represents an amount of moneywhich is claimed by a particular customer. The risk costthat is required by the insurance companies is equal to anaverage claim cost of a particular group of similar cases.

For example, in Table 1, if a new case involves a man

Table 1Example of cases

Case number Vehicle model Driver age Suburb Sum Insured Claim cost

1 Toyota Corolla 19 Newtown 18 000 02 Toyota Corolla 19 Newtown 18 000 31203 Toyota Corolla 19 Newtown 18 000 04 Nissan Pulsar 20 Newtown 18 000 7005 Toyota Corolla 19 Newtown 18 000 06 Toyota Corolla 19 Newtown 18 000 07 Ford XR-6 32 St James 60 000 08 Toyota Corolla 19 Newtown 18 000 09 Nissan Pulsar 34 St James 18 000 13 60010 Toyota Corolla 19 Newtown 18 000 011 Toyota Corolla 19 Newtown 18 000 012 Ford XR-6 32 St James 60 000 013 Toyota Corolla 19 Newtown 18 000 014 Toyota Corolla 19 Newtown 18 000 015 Toyota Corolla 19 Newtown 18 000 12 000

154 J. Daengdel et al./Knowledge-Based Systems 10 (1997) 153–159

Page 3: Combining case-based reasoning and statistical method for proposing solution in RICAD

who is 19 years old, drives a Toyota Corolla, and lives in asuburb called Newtown, then the risk cost for all similarcases is equal to ($3120þ $12000)/11¼ $1374.50. Notethat for this example, 11 historical cases were identifiedfrom the case base, but only two of them have claim cost(i.e. case numbers 2 and 15, with $3120 and $12000,respectively).

One important characteristic of this data set which leadsto the implementation of CIM is that the cases in our casebase are very noisy. That is, for a given new case, there willbe a number of cases in the case base which are similar to it,but only 10% of them have value of the claim cost greaterthan zero. These 10% of cases also have an inconsistentamount of claim cost between them. In other words,approximately 90% of matched cases will have their claimcost equal to zero. In the worst situation, all similar casesmay have their claim cost equal to zero. For example, inTable 1, if the new case involves a man who is 32 years oldand drives a Ford XR-6, there are only two matched cases(i.e. case numbers 7 and 12), and both of these cases havetheir claim cost equal to zero. For this example, if we use thetraditional CBR approach (i.e. propose the solution based onsolution of similar cases), then the system would suggestthat all men who are 32 years old and drive an Ford XR-6should be treated with no risk at all. However, we cannot letthe system simply suggest such an answer. Further analyseshave to be made by the system, even though its first retrievalmay suggest that the risk cost should be equal to zero. Thefurther analysis or reasoning involves retrieving additionalsimilar cases into its reasoning process.

The system retrieves these cases by relaxing values of itsindices. For example, in this case, the system may retrieveadditional cases of men who are 31 and 33 into its reasoningprocess. The value relaxation process may be performed upto three iterations for each attribute, in this example, the firstrelaxation result in adjusting6 1 to the original value ofage (i.e. 32 years old). If the number of retrieved cases isstill not adequate (Section 4 explains how the system calcu-late the number of cases required), then the system has toperform the second relaxation. The second adjustment is torelax the original value of age by6 2. If after 6 3 thesystem still cannot retrieve enough number of cases, thenthe value of other attributes will be relaxed instead. Thisresults from the fact that we want the system to retrieve anadequate number of cases by trying to make a small relaxa-tion on values of a number of attributes rather than making alarge relaxation range on just only one attribute. The systemuses two sources of similarity knowledge: human expertknowledge, and statistically derived information. Fig. 1shows example of similarity measurement knowledgeused in RICAD. The details on this knowledge and how itis derived can be found in Ref. [15].

Part of Fig. 1 depicts the hierarchical structure of four-door cars. The Toyota Starlet, Mazda 121, and the DaihatsuCharade are classified as small cars, while the ToyotaCamry, Misubishi Magna, and Ford Falcon are classified

as family cars. Here, the Toyota Camry is more similar tothe Ford Falcon than the Toyota Starlet. The spatial struc-ture in Fig. 1 is used to represent values of all postcodesused in RICAD (here assume thatA, B,… , I are postcodes).For example, if the value of the postcode of a query case is2010, then postcodes 2011, 2016, 2021, 2033 and 2000 arecloser to the specified value than postcodes 2041, 2044 and2034.

3. RICAD’s architecture

Since all historical cases of customers contain confiden-tial information, they have to be securely stored in the data-base. RICAD has to retrieve similar cases from this databaseby constructing an appropriate SQL statement at run-time.Fig. 2 shows RICAD’s architecture.

There are seven main components in RICAD:

1. Pre-analysis mechanism. The mechanism is used by aknowledge engineer when analysing all cases in thecase base in order to update the domain knowledge(i.e. as shown in Fig. 1) and heuristic used by the system.

2. Dynamic index creation mechanism (DICM). When theuser enters a new case into RICAD, DICM is the firstcomponent which deals with the input. DICM performsthree subs-tasks:2.1 It heuristically selects important attributes that

should be used as indices from the new case.2.2 It assigns an appropriate range which is used to

retrieve partially matched cases from the case base(i.e. depending on how the system relax value ofeach attribute). This range is used when the systemconstruct its SQL statement. For example, an appro-priate range for values of age might be 31–33 if itresults from the first iteration. The range might thenbe changed to 30–34 if the number of casesretrieved is not adequate and the system performsthe second iteration.

Fig. 1. Example of the taxonomic and spatially structured knowledge inRICAD.

155J. Daengdel et al./Knowledge-Based Systems 10 (1997) 153–159

Page 4: Combining case-based reasoning and statistical method for proposing solution in RICAD

Fig. 2. Architecture of RICAD.

Fig. 3. A sample run of RICAD.

156 J. Daengdel et al./Knowledge-Based Systems 10 (1997) 153–159

Page 5: Combining case-based reasoning and statistical method for proposing solution in RICAD

2.3 It creates an appropriate SQL statement for the newcase. An example of an automatically generatedSQL statement for a case of a man who is 32 yearsold, lives in a suburb called Redfern, drives a Nissan300ZX, borrowed money from the bank, and com-mercially bought the car is shown in the bottom leftcorner of Fig. 3. The resultant SQL statement fromDICM is then passed to the next mechanism.

3. Case retrieval mechanism. This mechanism consists of aSQL processor. It uses SQL statements produced byDICM and retrieves all matching cases from the casebase.

4. Confidence identification mechanism (CIM). Thismechanism is used to calculate the mean and standarddeviation of the claim cost from all retrieved cases. Inaddition, it also calculates the number of required casesrequired to propose a reliable and valid solution. This isdiscussed in more details in Section 4.

5. Case ranking mechanism. If the number of matchedcases is greater than what it is required by the system,RICAD will use this mechanism to rank all matchedcases, from the most to the least similar case based ontheir similarity score. Starting from the most similarcase, only a certain number of cases will be selected.For example, if the system requires 100 cases, but 116cases were retrieved, after ranking, only cases numbered1–100 will be passed to the next process.

6. Risk cost calculation mechanism. Recall that risk cost isequal to average value of claim cost for a particulargroup of cases. This mechanism calculates the risk costfor all cases that are passed from the previous process.This risk cost value will then be proposed as a finalrecommendation.

7. Case base update mechanism. After the risk cost isknown, this mechanism adds the new cases with thisrisk cost back into the case base for future reference.

4. Confidence identification mechanism

As stated earlier, for a given new case, 90% of matchedcases have a claim cost equal to zero. Furthermore, in theworst case, all matched cases can have zero claim cost. Forexample, consider the sample cases listed in Table 1, parti-cularly the two historical cases which contain the Ford XR-6(i.e. case numbers 7 and 12). If one tries to calculate the riskcost for customers driving this type of car by retrievinghistorical cases from the case base, the result will be zero.On the other hand, there are also two historical cases whichcontain the Nissan Pulsar (i.e. case numbers 4 and 9) andtheir approximate risk cost associated with customers driv-ing this type of car is equal to $7150 [($13 600þ $700)/2].

Such a conclusion will certainly not be appropriate forany new customer who drives a Nissan Pulsar. It is also notcorrect to judge that a Ford XR-6 will never have any

accidents in the future. However, for most CBR systemsthat propose a single-numeric answer, once all similarcases are selected, the system will simply calculate themeans of solution of the selected cases. In addition, theydo not provide any indication regarding how much confi-dence they have when proposing their solutions. For exam-ple, with the case of customers who drive a Ford XR-6, whatis the confidence of the system when proposing that therewill be no claims against any customer who drives thisparticular car. As a result, a mechanism that allowsRICAD to propose a more reliable solution is required.

In CBR, ensuring that the systems retrieve only the mostrelevant cases is a necessary feature. For RICAD, similaritybetween historical cases and the case at hand is measured byusing similarity measurement knowledge as shown in Fig. 1.As mentioned previously, RICAD retrieves more similarcases by relaxing content of the indices. This relaxationmay be performed up to three iterations for each attribute.According to the hierarchical structure of car models shownin Fig. 1, when relaxing the value of this particular attribute,if there is only a small number of cases which contain theFord Falcon, then RICAD has to adjust its indices (value ofattribute car model) to the next closest or similar car model(i.e. the Toyota and Mitsubishi Magna). If the number ofsimilar cases after using a value of car model equal to theFord Falcon, Toyota Camry and Mitsubishi Magna is stillsmall, RICAD has to readjust the value of car model to thenext level (i.e. small car or high-performance car).

As mentioned previously, in order to retrieve an adequatenumber of cases when solving a problem, the system mayneed to relax values of a number of attributes. Decidingvalues of which attributes should be adjusted first or last(i.e. order of relaxation) is based on an associated averageclaim cost of that particular value. For example, consider acase of a 22-year-old man who drives a Ford Falcon. Out ofall cases in the case base, if the average claim cost of a 22-year-old man is higher than the one of the Ford Falcon thenthe value of the attribute car model will be relaxed first. Thisresults from the fact that when an attribute has a higherclaim cost than others it is considered to be an importantattribute. We want to retrieve similar cases from the casebase by trying to avoid the relaxing value of the importantattribute. This is because we want to concentrate on thatparticular attribute. In this example, we want to retrieveall cases which contain 22 as a value of their driver age,while the value of the attribute car model may be relaxed toinclude all car models which are in the same category as theFord Falcon. Examples and details of how to find order ofrelaxation can be found in Ref. [15].

Recall the fact that the noise associated with our data setmainly comes from differences in claim cost. Apart fromhaving to retrieve an adequate number of cases before pro-posing its answer, RICAD also takes the value of the claimcost into consideration when solving a problem. The valueof the claim cost plays a very important role when RICAD isproposing its answer. In RICAD, the CIM is responsible for

157J. Daengdel et al./Knowledge-Based Systems 10 (1997) 153–159

Page 6: Combining case-based reasoning and statistical method for proposing solution in RICAD

identifying a reliable risk cost. Proposing the reliableanswer is achieved by calculating the answer based on areasonable number of cases (how RICAD calculates anumber of required cases is discussed in the next section).Furthermore, the cases that are retrieved should also haveminimal difference in their claim cost. In order to identifythe number of cases which should be retrieved, and judgingwhether those cases have least differences in their claimcost, CIM applies the closeness factor [16]. The closenessfactor is a statistical function which is adapted from thecentral limit theorem [17]. The following explains an objec-tive of using the closeness factor.

For a given new case, there will be a number of similarcases that can be retrieved from the cases base. Even thoughvalues of the attributes of these cases may be similar to eachother, values of their claim cost are always (largely) differ-ent from each other. Furthermore, we want to retrieve thecases which have values in their content closest to the onesappearing in the query case, but we also want only thosecases that have a similar amount of claim cost attached tothem. Recall the fact that the system may perform a numberof iterations in order to retrieve an adequate number ofcases. Let us assume that the average amount of claimcost from the first case retrieval is $400, then if the secondretrieval is required due to inadequate number of cases, wewant these additional cases to have a similar amount ofaverage claim cost, say within6 10% of the averageclaim cost of the first retrieval. The closeness factor isused to control the case retrieval and ensure that the systemretrieves enough number of cases which have a similaramount of claim cost attached to them.

4.1. Statistical closeness

As noted previously, RICAD bases its answer on a num-ber of similar cases with the least difference in their claimcost. The closeness factor is used to indicate how much wewant the claim cost of additional cases to be close to theaverage claim cost, which arises from the first retrieval. Forexample, consider the case of a man who is 32 years old. Letus assume that the average claim cost of all men in this agegroup is $400. If we want the system to retrieve additionalcases which have their average claims within 10% of theactual mean ($400), then the closeness value is equal to 0.1.This also means that we want the system to retrieve addi-tional cases which have their average claim cost between$360 and $440. Consequently, the system has to adjust thevalue of its indices and repeat the case-retrieval processuntil it can find enough cases which have their averageclaim cost between this particular range. However, bound-aries of relaxation have to be predefined. For example, wemay not allow the system to adjust the value of age morethan 6 3 years from the original value. In RICAD, thetermination points are predefined by a human expert.

A closeness factor is based on the statistical properties ofthe retrieved cases. An initial closeness factor is calculated

based on cases which result from the first case-retrieval, andthen recalculated once the indices are adjusted and morecases are retrieved.

The central limit theorem states that whenever the samplesize (n) is large (n . 30), the sampling distribution ofX canbe approximated by a normal probability distribution [17].Symbolically this can be written as

X¹ mj���

np

,N(0,1)

whereX is the sampling mean,m is the population mean,j ispopulation variance of the claim cost, andn is a number ofsampling (i.e. number of retrieved cases). According to theabove facts, assuming that we want to be 95% confident thatthe estimated claim cost is within theCm of populationmean, then

P

X¹ mj���

np

,Cmj���

np

0

B

B

@

1

C

C

A

¼ 0:95⇒ Cmj���

np

¼ 1:96:

Note : 1:96 is the 97:5% percentile ofN(0,1)

whereC is the closeness factor.According to Beinat [16], we do not know what the popu-

lation parameters are, but an estimate would be the samplemean and variance. That is,m andj can be replaced bym(sample mean) andj (sample variance), respectively. As aresult, the value ofC is equal to:

C¼1:96j

m���

np (1)

wheren is a number of cases that should be retrieved inorder to achieve 95% confidence, whilem and j are meanand variance of the claim cost after adding additional cases,respectively.

To judge whether the additional cases are good enough tobe included when proposing the answer, RICAD calculatesthe closeness factor for all exactly matched cases (Ca) andthen recalculates it again with the additional cases (Caþb).

=C¼Ca

Caþ b(2)

According to Eq. (2), if=C . 1 this means that the addi-tional cases result in improving the closeness factor. In thiscase, the additional similar cases will be used when thesystem is calculating its answer. However, if the value of=C , 1, the system then has to try to adjust value(s) of otherattribute(s) and perform a further retrieval process.

4.2. How closeness is applied in RICAD

It can be deduced from Eq. (1) that the larger the numberof similar cases (n) the smaller theC value will become.Based on the closeness factor, the value ofC indicates how

158 J. Daengdel et al./Knowledge-Based Systems 10 (1997) 153–159

Page 7: Combining case-based reasoning and statistical method for proposing solution in RICAD

close we want the average claim cost of additional cases tobe similar to the average claim cost of cases from the firstretrieval. For example, if we want the final solution to bewithin 3% of the mean value of the claim cost of caseswhich are initially retrieved, thenC is equal to 0.03. Con-sider the following example:

The mean and the standard deviation value of allclaim cost from the first retrieval is equal to 850 and190, respectively. Furthermore, a knowledge engineerrequires RICAD to propose its solution with 95% con-fidence and average claim cost of additional cases tobe only within 3% of the actual mean value. Then, thenumber of cases that have to be retrieved is equal to

0:03¼1:963 190

850���

np n¼ 213 cases

From the above example, RICAD needs to retrieve at least213 similar cases from the case base into its reasoning pro-cess before it can calculate the final solution (risk cost). IfRICAD can find 213 similar cases (or more), it can theninstantly compute the risk cost by calculating the averagevalue of claim cost from all retrieved cases. In this case,RICAD is 95% confident that its answer will be within63% from the overall mean value. That is, RICAD is 95%confident that the risk cost should be between $824.50 and$875.50. On the other hand, after the termination points, ifthe number of similar cases that RICAD found is lessthan 213 cases then its confidence level will decreaseaccordingly.

5. Conclusion

Producing a reliable solution is one of the problemswhich needs to be considered when building CBR systems.We have briefly described the architecture of RICAD, anddiscussed how RICAD proposes its answer by applying thestatistical method called the closeness factor. The use ofcloseness factor indirectly forces RICAD to retrieve acertain number of historical cases in order to allow a user-selected confidence level to be achieved. As a result,RICAD has to iteratively adjust its indices and performthe case-retrieval process until either the required numberof similar cases are found, or until no further improvementcan be achieved. The results of our experiments suggest thatthe use of closeness factor can be used as a better approachfor CBR systems which propose a single numerical answer.

The problem solving approach in RICAD can easily beapplied to other domains (especially in the financial sectors,e.g. banking). Currently, we are in the process of refining

and evaluating RICAD. This also includes trying to identifythe limitations of the current architecture, especially thelimitations of DICM and CIM.

References

[1] J. Kolodner, Case-Based Reasoning, Morgan Kaufmann. 1993.[2] I. Watson, An introduction to case-based reasoning, in: I. Watson

(Ed.), Proceedings of the First United Kingdom Workshop: Progressin Case-Based Reasoning, Springer, New York, 1995.

[3] S. Rougegrez, A case-based reasoning system that avoid the problemof the case identification, in: Proceedings of International Conferenceon System, Man, and Cybernetics: System Engineering in the Serviceof Humans, Vol. 3, 1993, pp. 182–186.

[4] C. Bento, P. Machado, E. Costa, Evaluation of RECIDEpsy—an advi-ser in the domain of psychology, in: D.W. Aha (Ed.), Proceedings ofAAAI Workshop on Case-Based Reasoning, AAAI Press, 1994.

[5] E.K. Jones, A. Roydhouse, Iterative design of case retrieval systems,in: D.W. Aha (Ed.), Proceedings of AAAI Workshop on Case-BasedReasoning, AAAI Press, 1994.

[6] E.L. Rissland, D.B. Skalak, M.T. Friedman, Evaluating BankXX:heuristic harvesting of information for case-based argument, in:D.W. Aha (Ed.), Proceedings of AAAI Workshop on Case-BasedReasoning, AAAI Press, 1994.

[7] K. Hanney, M. Keane, B. Smyth, P. Cunningham, What kind ofadaptation do CBR systems need? A review of current practice, in:AAAI-95 Fall Symposium Series, Workshop on Adaptation ofKnowledge for Reuse, MIT, Cambridge, MA, 1995.

[8] B. Kettler, J. Hendler, Evaluating a case-based planning system, in:D.W. Aha, (Ed.), Proceedings of AAAI Workshop on Case-BasedReasoning, AAAI Press, 1994.

[9] K. Hammond, CHEF: a model of case-based planning, in: Proceedingsof AAAI-86, MA: AAAI Press/MIT Press, Cambridge, MA, 1986.

[10] B. Ungureanu, D. Rusu, I. Ziman, Case-based assistance in CAD, in:M. Keane, J-P. Haton, M. Manago (Eds.), Proceedings of the SecondEuropean Workshop on Case-Based Reasoning (EWCBR 94),Springer, New York, 1994.

[11] E.B. Reategui, J.A. Campbell, Classification system for credit cardtransactions, in: M. Keane, J-P. Haton, M. Manago (Eds.), Proceed-ings of the Second European Workshop on Case-Based Reasoning(EWCBR 94), Springer, New York, 1994.

[12] J. Kolodner, D.B. Leake, A Tutorial Introduction to Case-Based Rea-soning, Case-Based Reasoning: Experience, Lessons, and FutureDirections, D.B. Leake (Ed.), AAAI Press/MIT Press, Menlo Park,CA, 1996.

[13] K. Goss, Preselection strategies for case-based classification, in: Pro-ceedings of the 18th German Annual Conference on Artificial Intelli-gence, Springer, Berlin, 1994.

[14] R.A. Barletta, Hybrid indexing and retrieval strategy for advisory cbrsystems built with ReMind, in: M.Keane, J-P. Haton, M. Manago(Eds.), Proceedings of the Second European Workshop on Case-Based Reasoning (EWCBR 94), Springer, New York, 1994.

[15] J. Daengdej, D. Lukose, E. Tsui, P. Beinat, L. Prophet, Dynamicallycreating indices for 2 million cases: a real world problem, in: Proceed-ings of the European Workshop in Case-Based Reasoning(EWCBR’96), Springer, Berlin, Germany, 14–16 November, 1996.

[16] P. Beinat, L. Prophet, L. Tranquille, Mining data with enzymes (sub-mitted).

[17] D.R. Anderson, D.J. Sweeney, T.A. Williams, Introduction to Statis-tics: An Applications Approach, West Publishing, 1981.

159J. Daengdel et al./Knowledge-Based Systems 10 (1997) 153–159