discovering similar patterns for characterizing time series in a medical domain

DOI 10.1007/s10115-003-0098-5Springer-Verlag London Ltd. c© 2003Knowledge and Information Systems (2003) 5: 183–200

Discovering Similar Patterns for CharacterizingTime Series in a Medical Domain

Fernando Alonso1, Juan P. Caraca-Valente1, Loıc Martınez1

and Cesar Montes2

1Department of Languages and Systems, Polytechnic University Madrid, Boadilla del Monte, Spain2Department of Artificial Intelligence, Polytechnic University Madrid, Boadilla del Monte, Spain

Abstract. In this article, we describe the process of discovering similar patterns in time seriesand creating reference models for population groups in a medical domain, and particularlyin the field of physiotherapy, using data mining techniques on a set of isokinetic data. Thediscovered knowledge was evaluated against the expertise of a physician specializedin isokinetic techniques, and applied in the I4 (Intelligent Interpretation of IsokineticInformation) project developed in conjunction with the Spanish National Center forSports Research and Sciences for muscular diagnosis and rehabilitation, injury prevention,training evaluation and planning, etc., of elite athletes and ordinary people.

Keywords: Data mining; Isokinetics; Time series characterization

1. Introduction

An important domain for the application of data mining (DM) in the medicalfield is physiotherapy and, more specifically, muscle function assessment basedon isokinetic data. Physicians collect this data using a mechanical instrumentcalled an isokinetics machine (Fig. 1a). This machine can be described as a pieceof apparatus on which patients perform strength exercises (in this case, kneeextensions and flexions). This machine has the peculiarity of limiting the range ofmovement and the intensity of effort at a constant velocity (which explains theterm isokinetic). Data concerning the strength exerted by the patient throughoutthe exercise is recorded and stored in the machine so that physicians can visuallyanalyze the results using specialized computer software. This data takes the formof a strength curve, with additional information on the angle of the knee (Fig. 1b).

Received 22 Feb 2002Revised 25 Jul 2002Accepted 14 Aug 2002

184 Fernando Alonso et al.

Fig. 1. Diagram of isokinetics machine and resultant data.

The top half of the curve represents extensions (knee angle from 90◦ to 0◦) andthe bottom half, flexions (knee angle from 0◦ to 90◦).

The isokinetic data is processed by the I4 system (Intelligent Interpretationof Isokinetic Information), which has built-in knowledge elicited from an expertin isokinetics. It cleans and pre-processes the data and conducts an intelligentanalysis of the parameters and morphology of the isokinetic curves. DMtechniques on time series were applied to analyze isokinetic exercises in orderto discover new and useful information for later use in a range of applications.Patterns discovered in isokinetic exercises performed by injured patients are veryuseful, in particular for monitoring injuries, detecting potential injuries early ordiscovering fraudulent sickness leave. These patterns are also useful for creatingreference models for population groups that share certain characteristics.

Figure 2 shows the functional structure of the I4 system. First, data collectedfrom the isokinetic machine is transformed, formatted and stored in a database(test database). Then data is pre-processed to build a clean and normalizeddatabase. Once the data has been prepared it can be either directly displayedor automatically processed using any of the two main analysis components. Thefirst one (KBS) is a knowledge-based system used to analyze individual exercises(Alonso et al, 2001a). The second one is a knowledge discovery in databasessystem (KDD), which is the focus of this paper. The results of both modules areshown to the user by the visualization component.

The article describes the process of discovering similar patterns in time seriesand how to create reference models for population groups in the field of isokineticphysiotherapy, using data mining techniques.

Section 1.1 presents the state of the art regarding work related to time series.Section 2 deals with the discovery of similar patterns in time series. It describes

isokinetic data pre-processing and the algorithm used to discover similar patterns.This algorithm is suitable for searching a large set of time series of non-homogeneous lengths and has interesting features as regards execution timedepending on the number of series values and the number of series.

Section 3 describes the automatic creation of reference models for populationgroups, using the discrete Fourier transform to transfer the exercises from the timedomain to the frequency domain and the fast Fourier to characterize curve trends.

Section 4 presents the discovered knowledge evaluation process (pattern-based injury detection and model-based population characterization), addressing

Discovering Similar Patterns for Characterizing Time Series in a Medical Domain 185

Fig. 2. Overview of the I4 system.

the inherent difficulties of there being no background knowledge regarding thebehavior of most of the evaluated populations. Finally, Section 5 discusses theconclusions of this research.

1.1. Current Related Work

Medical applications are not, of course, the only ones in which data mining in timeseries can be extremely useful. The analysis of time-ordered data sets is essentialin many other fields, including engineering, meteorology or the business world.The future behavior of a variable can be predicted by studying how it behavedbefore. Moreover, time series could be characterized to improve diagnostic tasks.Again, determining what other values have behaved similarly can be an aid fordeciding what actions to take either to continue or reverse a trend. Data miningtechniques are very useful tools for doing this.

There is a growing need to search databases for similar time series data. Forexample, we might want to find companies with a similar growth pattern ordiscover products with similar sales patterns. One important question is to decidewhat similarity means. The simplest solution is to calculate some sort of distance,like the Euclidean or Manhattan distance, between two time series. These seriesare considered similar if the above distance is less than a given threshold value.In an attempt at reducing the time taken to calculate this distance, some authors(Agrawal and Srikant, 1994; Faloutsos et al, 1994) suggest that the Fouriertransform (FT) be used to transfer the series from the time to the frequencydomain, using only the first coefficients to filter dissimilar series.

Agrawal and Srikant (1994) divide queries concerning similarity between timeseries into two categories: total comparison, where the sequences for comparisonhave the same length, or partial comparison, which involves checking whether asequence appears as part of another.

Most of the work on total comparison has focused on the search for aparticular sequence within a set of time series or on searching for all the sequencesthat are similar to a given one. The basic technique is to index all the time series,using some sort of spatial access method, like R* trees (Beckman et al, 1990).This index will contain the first coefficients of the Fourier transform of the series.Thus, it will be possible to find similar patterns by running through sequences


that are close together in the R* tree rather than through all the sequences. Morerecently, other transform types (mainly the wavelet transform, WT) have provedto perform better than the Fourier transform in some domains or applications(Povinelli, 1999). Thus, whereas the FT stores the general curve trend in its firstcoefficients, the WT encodes a coarser resolution of the original time sequence(Chan and Fu, 1999).

Another question to be taken into account is the similarity metric to be used.The Euclidean distance is the most commonly used. However, it is beset by aseries of problems, as it is invariant to transformations like time shifts, changesof scale, etc. Other distance types, like the warping distance, which provides forchanges and shifts on the time scale, can be better suited for a variety of problems(Park et al, 2000). The proposal by Perng et al (2000) is more original, andproposes a different similarity model that is based on the identification of thesingular points of each data sequence (points of inflection, maxima and minima,etc.) rather than on the use of a metric.

Partial comparison, or subsequence matching, that is, the search for subseriesthat are repeated throughout a particular series, is a tougher problem that hasgained in importance recently. Here, the query sequence is potentially shorterthan the sequences in the database. For example, one might ask for companiesthat at some point behaved similarly to company x in the spring of the year 2000.The a priori property was of fundamental importance in the early approachesto this problem: Agrawal and Srikant (1995) developed an a priori algorithm-based technique to discover sequential patterns; Mannila et al (1995) address theproblem of recognizing frequent episodes in event collections.

An interesting line of work is indexation of the traces that represent sub-sequences of a longer sequence, which we owe to Faloutsos et al (1994). Thebasic technique involves indexing the n-dimensional points output by applyingthe Fourier transform to the subsequences of length l within a sequence L (L � l)using an R-tree. These subsequences are output by means of small shifts, whichmeans that they are very similar to each other. The subsequent search will beperformed in the transform domain by means of the R-tree. Kahveci and Singh(2001) refine this technique, proposing a matrix of R-trees as the indexationstructure, where the node Tij of the matrix stores the R-tree for the subsequences

of length 2i of the sequence Sj of the DB. Accordingly, they store differentsearch resolutions and speed up the search by sequence, providing for the use ofparallelism.

A third category in time series comparison is pattern search. This is not, strictlyspeaking, a partial comparison problem, as the aim is to locate the presenceof a known subsequence in a set of sequences: the problem involves findingsubsequences that are frequently repeated in a set of times series, about whichthere is, however, no background knowledge. There are several examples ofpattern searching in symbolic series. Han et al (1998) try to find subsequences thatare periodically repeated within a symbolic sequence. The a priori property is alsoused, in this case to prune infrequent patterns. Their algorithm detects patterns ofany length, while the pruning of infrequent patterns succeeds in providing highefficiency. The time series classification problem, addressed by the machinelearning community, resembles pattern searching in many ways. Geurts (2001)tries to classify a series of objects on the basis of variables that are time series.For this purpose, he defines a set of tests to give the measure of the presenceof a pattern in a series, and these tests are the decision nodes of a futureclassification tree.


Our case differs significantly from most of the above-mentioned work. TheDB is basically composed of one time series per patient (which provides thevalues of the muscular strength exerted by a joint) and the aim is to find apattern (or subsequence) in this series that is characteristic of any given muscularor joint dysfunction. A priori there is no expert knowledge of what the patternscharacteristic of any injury type are like. Therefore, if we are to find a means ofcharacterizing patients with a given injury X, we have to look for patients thatwe know have this injury, get a series of respective time series and search forsubsequences that are repeated in all these series and that do not occur in healthypatients. Only subsequences that are repeated in a sufficient number of data seriescan be considered patterns. Blind search, that is, without background knowledge,is the main obstacle to be overcome by the method proposed later and is themain contribution of this paper.

Han et al’s algorithm has proved to be useful for the above-mentionedproblem. However, this algorithm only discovers patterns that are repeated exactly(although some points of the sequence can be ignored) in different series (symbolicvalues), which means that it is not suitable for searching for similar patterns inmore than one time series (real values). This has obliged us to design a newalgorithm, based on Han’s, to deal with this problem.

2. Discovering Similar Patterns in Time Series

One of the most important potential applications of DM algorithms for this sortof time series is to detect parts of the graph that are representative in order tocharacterize the series. As far as isokinetic exercises are concerned, the presenceof this sort of pattern could be representative of some kind of injury, and thecorrect identification of the deviation could be an aid for detecting the injury intime. So, the identification of patterns (similar portions of data that are repeatedin more than one exercise graph of different patients) is of vital importance forbeing able to establish criteria by means of which to class the exercises and,therefore, patients.

Isokinetic exercises have a series of characteristics that cannot be overlookedwhen designing a pattern identification algorithm. Owing to the special charac-teristics of the individuals who complete isokinetic exercises, the patterns mayhave different amplitudes and be distant in time. Therefore, some sort of distancehas to be used to take into account not only the parts that are repeated exactlybut also any that are more or less the same.

Another particular to be considered in pattern search is that there is noexpert knowledge about the possible patterns and their length. Therefore, all theexercises have to be run through to get patterns of any length. The memoryconsumption and execution time of this process can be very high, and these areboth factors that were considered in algorithm design.

The process of developing a DM method for identifying patterns thatpotentially characterize some sort of injury was divided into two phases:

1. Develop an algorithm that detects similar patterns in exercises.

2. Use the algorithm developed in point 1 to detect any patterns that appearin exercises done by patients with injuries and do not appear in exercisescompleted by healthy individuals.


The algorithm developed is capable of detecting similar sequential patternsin a set of time series. It reuses some state-of-the-art ideas (Agrawal et al, 1993;Faloutsos et al, 1994b; Han et al, 1998), like the R-tree for indexing patterns, thea priori property to prune the search tree, etc. Owing to the special features of ourproblem, however, major changes had to be made to state-of-the-art algorithmsin order to consider pattern similarity using the Euclidean distance, as the above-mentioned papers either search for identical patterns in the series or consideronly patterns of a given length. In identical pattern-searching algorithms, eachpattern matches a branch of the tree. In the similar pattern-searching algorithmin question, however, a pattern can match several branches. For example, if thepatterns (12, 14, 16, 18) and (12, 14, 16, 19) are considered similar (depending onthe algorithm parameters), this must be taken into account when calculating thefrequency of the two patterns.

2.1. Data Cleaning and Pre-processing

An optimum, or at least good, preparation of the initial data is crucial forachieving useful results in any DM or discovery task. However, this is often anextremely hard endeavor for a variety of reasons (the sort of task, the problemtype, the amount and quality of the data, the noise or even the validity of thedata collection procedures used, to name but a few). This means that no standarduniversally valid procedure can be designed for this stage, so solutions varysubstantially from one problem to another.

From a data preparation point of view, we can classify DM problems andapplications into three major groups: task driven, goal driven and data driven.Task-driven applications are generically problems for which the initial data setsare collected and built specifically for the target application. The problem hereis how to find and collect rather than how to prepare the data, as the entirestructures are designed from scratch to fit the objectives. Goal-driven applicationsare applications that make use of existing and not necessarily the best-suited datasets to achieve a known goal. In this problem group, the researcher knows whatto look for but the data is often unsuitable, either semantically or structurally, forthe intended task. They have the slight advantage, always from the pre-processingperspective, that expert knowledge can be useful for improving the structure andthe content of the data, as the goal is well known. Finally, data-driven applicationsare problems in which there are no predefined goals and the researcher has toperform some sort of blind search to find any interesting relationship within thedata. Jokingly, the best thing about these applications is that if we do not findthe best solution, nobody will notice, because nobody knows what to search for.

I4 belongs to the second problem group. The available isokinetic test data setshave been used to assess the physical capacity and injuries of top competitionathletes since the early 1990s. An extensive collection of tests and exercises hasbeen gathered since then, albeit unmethodically. Hence, we had a set of heterodox,unclassified data files in different formats, which were, partly, incomplete. On thepositive side, the quality of the data within each exercise was unquestionable, asthe protocols had been respected in the huge majority of cases, the isokineticsystem used was of proven quality and the operating personnel had been properlytrained.

A series of tasks, summarized in Fig. 3, had to be carried out before theavailable data set could be used. The first one involved decoding, as the input


Fig. 3. Data pre-processing tasks.

came from a commercial application (the isokinetic system) that has its owninternal data format.

After, the curves had to be evaluated to identify any that were invalid and toremove any irregularities entered by mechanical factors of the isokinetic system.Two data cleaning tasks were performed using expert knowledge in both cases:

• Removal of incorrect tests. The goal of this task is to determine that theisokinetic test protocol has been correctly applied. All the exercises definedin the protocol must have been completed successfully in the correct order.Additionally, the strength values must demonstrate that patients exertedthemselves during the exercises and, therefore, tired to some degree.

• Elimination of incorrect extensions and flexions. Even if the isokinetic protocolhas been correctly implemented, some of the extensions and/or flexions withinan exercise may be of no use, owing mainly to lack of concentration by thepatient during the exercise. I4 detects extensions and flexions that are invalidbecause much less effort was employed by the patient than was the case inothers.

Having validated all the exercises as a whole and each exercise individually,they have to be filtered to remove noise introduced by the machine itself. Againexpert knowledge had to be used to automatically identify the irregularities causedby the strength employed by patients and any that are due to the isokineticmachine. So, the strength curves are pre-processed in order to eliminate flexionpeaks, that is, maximum peaks produced by machine inertia.


Fig. 4. Format of the similar pattern search tree.

The result of this process is a DB in which tests are homogeneous, consistentand noise free.

2.2. Algorithm for Discovering Similar Patterns

A pattern search tree was built to speed up the pattern-searching algorithm.Each depth level of this tree coincides with the length of each pattern, that is, abranch of depth 2 corresponds to a given pattern of length 2. In algorithms thatsearch for identical patterns (Han et al, 1998), it suffices to store the pattern anda counter of appearances in each node. As there are similar patterns in our case,the list of series in which the pattern appears (SA) and the list of series in whicha similar pattern appears (SSA) has to be stored (Fig. 4). This is because patternsimilarity is not transitive (i.e., we can have p1 similar to p2, p2 similar to p3 andp1 not similar to p3).

The problem defined in phase 1 of the method of injury identification is setout as follows:

• Given:

– A collection S of time series, composed of sequences of values (usually realnumbers or integers), of variable length, where the length of the longest ismax-length.

– The value (supplied by the user) of minimum confidence min-conf (numberof series in which pattern appears divided by the total number of series).

– The maximum distance between patterns to be considered similar (max-dist).

• Find:

– All patterns of length 0 � i � max-dist, that is, identical or similar sequencesthat appear in S with a confidence greater or equal to min-conf.

First, patterns that appear in the time series are built in the same manneras an identical pattern-searching algorithm would do. However, it is not enoughjust to store the number of times the pattern appears in the series to calculateits confidence. It is important to find out in which series the pattern appears inorder to be able to analyze its similarity to other patterns. Then the algorithmhas to run through the patterns to take into account the appearances attributedto similar patterns. For each pattern p, all the patterns of the same length in anyseries that are at a lesser distance than threshold max-dist from pattern p areconsidered similar patterns.

Special care has to be taken in the pruning phase not to prune patterns, which,although not frequent themselves, play a role in making another pattern frequent.If this sort of pattern were pruned, the algorithm would not be complete, thatis, would not find all the possible patterns. Only patterns that are infrequent and


Fig. 5. Execution time over number of values in the series.

Fig. 6. Execution time over number of series.

whose minimum distance from the other patterns is further than the requireddistance (max-dist) will be pruned. Having completed the tree-pruning phase, thenext level of the tree is generated using the longest patterns. For a full descriptionof this algorithm, see Caraca-Valente and Lopez-Chavarrıas (2000).

This algorithm is suitable for searching a large set of time series of non-homogeneous lengths, finding the patterns (time subsequences of undeterminedlength) that are repeated in any position of a significant number of series.Therefore, the algorithm will be useful for finding significant patterns that arelikely to characterize a set of non-uniform time series, even though importantcharacteristics of these patterns, like length or position within the time series, areunknown.

To test pattern discovery algorithm operation and effectiveness, a series ofrandom curves were generated to simulate isokinetic strength curves. Figure 5shows how long it takes the algorithm to execute depending on the number ofvalues in the series used.

Algorithm execution time grows linearly as the number of elements in the timeseries increases. Thus, the number of values within the series would not appear tobe too significant as regards algorithm execution time, as it takes the algorithm 15seconds to execute with series that have 50 values and approximately 40 secondswith series containing 5000 elements.

Algorithm efficiency falls as the number of series increases (as shown in Fig. 6).When the number of series is over 12, the time increases considerably. Thus, anincrease in the number of series (as of 12) used to execute the algorithm raisesalgorithm execution time substantially. Therefore, it is the number of series usedand not the number of elements in the series that has the biggest impact on


Fig. 7. Eight isokinetic exercises performed by injured patients (knee cartilage disease).

Fig. 8. Pattern possibly characteristic of cartilage disease.

algorithm execution time. However, this is not a serious problem in our domain,as more than 10 series for the same injury and group of patients are seldomfound.

It is clear that execution time also grows as the max-dist increases, becausemore patterns will be discovered when the distance is greater. If this distance isincreased considerably, the tree cannot be retained in memory, and the time wouldrise to unsuspected levels. On the other hand, execution time is lower when theparameter min-conf increases. This is because the search space is smaller, as morebranches are pruned.

A real example of similar pattern-searching algorithm application to patterndetection is shown below. In this case, we had a set of eight exercises completed byinjured female patients (knee cartilage disease). This is a feasible number, becauseit is difficult to find more patients with the same sort of injury for evaluation in agiven environment (in this case, Spanish top-competition athletes). The graphs ofthe exercises used are shown in Fig. 7.

The problem is to detect patterns symptomatic of knee cartilage disease. Thesimilar pattern-searching algorithm was able to identify patterns depending onthe parameters used. For example, if threshold min-conf is 0.8 and the max-distbetween similar patterns is 50, the similar pattern-searching algorithm finds anumber of patterns, the most promising of which is shown in Fig. 8(a). Thispattern corresponds to the lower part of the curves, as shown in Fig. 8(b).


Fig. 9. Process for creating reference models.

We then tried to match this pattern against a set of healthy patients’ exercises,and this pattern did not show up. After a positive expert evaluation, we will beable to use this pattern as a symptom of knee cartilage disease.

3. Creating Reference Models for Population Groups

One of the most common tasks involved in the assessment of isokinetic exercisesis to compare a patient’s test against a reference model created beforehand. Thesemodels represent the average profile of a group of patients sharing commoncharacteristics. Figure 9 summarizes the process for creating reference models.

All the exercises done by individuals with the desired characteristics of weight,height, sport, sex, etc., must be selected to create a reference model for a particularpopulation. However, there may be some heterogeneity even among patients of thesame group. Some will have a series of particularities that make them significantlydifferent from the others. Take a sport like American football, for instance, whereplayers have very different physical characteristics. Here, it will make no sense tocreate a model for all the players, and individual models would have to be builtfor each subgroup of players with similar characteristics. Therefore, exercises haveto be discriminated and the reference model has to be created using exercisesamong which there is some uniformity.

An expert in isokinetics used to be responsible for selecting the exercises thatwere to be part of the model. It is not easy to manually discard exercises that differconsiderably from others. This meant that it was mostly not done. The idea weaim to implement is to automatically discard all the exercises that are differentand create a model with the remainder.

The problem of comparing exercises can be simplified using the discreteFourier transform to transfer the exercises from the time domain to the frequencydomain. The fact that most of the information is concentrated in the firstcomponents of the discrete Fourier transform will be used to discard the remainderand simplify the problem. The advantage of the discrete Fourier transform is thatthere is an algorithm, known as the fast Fourier transform, that can calculate therequired coefficients in a time of the order of O(n log n) when the number of dataitems is a ‘power of 2’. In our case, we restricted the strength values of the seriesto 256 values, which is roughly two leg flexions and extensions and, therefore,sufficient to characterize each exercise.


The time it takes to make the comparisons is drastically reduced using thistechnique, a very important factor in this case, since there are a lot of exercisesfor comparison in the database and comparison efficiency is important.

Once the user has selected all the tests of the patient population to bemodeled, there is a preliminary pre-processing phase, also affecting the othersystem modules, which removes the irregularities in the graph caused by systeminertia rather than the patient’s strength. Furthermore, any full extensions andflexions incorrectly performed by the patient are also removed. Then the actualprocess for creating a new reference model begins, which is as follows:

1. Calculate the fast Fourier transform of all the 256-value exercises.

2. Class the first four coefficients of the Fourier transform of these exercises, usinga clustering algorithm. These four coefficients are enough to represent the curvetrends. Thus, the groups of similar exercises are clearly identified as they aregrouped into different classes.

3. Normally, users mostly intend to create a reference model for a particular group,and there is clearly a majority group of similar exercises, which represents thestandard profile that the user is looking for, and a disperse set of groups ofone or two exercises. The former are used to create a reference model, in whichall the common characteristics of the exercises are unified.

The first step for creating the actual model is normalization. This step levelsout the size of the different isokinetics curves and adjusts the times whenpatients exert zero strength (switches from flexion to extension and vice versa),as these are singular points that should coincide in time. The second stepis to calculate the mean value of the curves point to point. Finally, we applythe pattern discovery algorithm to a set group of exercises and add everypattern found as an attribute of the model. This last step is taken because,when calculating the average curve, patterns may be smoothed if they appearat different moments in time in the exercises.

4. An isokinetic exercise for a patient, or a set of isokinetic exercises, will laterbe able to be compared with the models stored in the database. Thus, we willbe able to determine the group to which patients belong or should belong andidentify what sport they should go in for, what their weaknesses are with aview to improvement or how they are likely to progress in the future.

Figure 10 shows an example of how the models can be used by sport physiciansto draw useful conclusions in their everyday practice. Particularly, it showsthe results of the comparison between an isokinetic exercise (right part of thescreenshot) and a given reference model (left part of the screenshot). Similarregions are highlighted for rapid location. Again, we use the Fourier transformfor comparison of the exercise and the model. As explained earlier, the comparisonis more efficient in this way, and it is easier to make partial or total comparisons.

4. Evaluation of the Discovered Knowledge

The discovered knowledge was rather difficult to evaluate as there was nobackground knowledge concerning the behavior of most of the populationsunder study. Furthermore, there are very few experts in this field, and evenacknowledged experts experience great difficulty in assessing the quality of a model,


Fig. 10. Comparison of graphs.

as there are no well-known widely accepted models against which a comparisoncan be made.

Therefore, the evaluation process focused on the following goals:

• Goal 1: Verify whether the models obtained actually represented the populationsfor which they had been created and validate how representative the modelswere.

• Goal 2: Validate their fitness for achieving the selected goals: pattern-basedinjury detection and model-based population characterization.

The sources of knowledge for both evaluations were confined to the experts,the cases database, the few existing models and everyday practice. So, the wholeprocess had to be carefully planned as a long-term evaluation, which is stillongoing. A five-step procedure was used for both evaluation goals:

1. Subjective appraisal of the results by the expert. The aim of this step was toget a rough idea of the possible quality of the results.

2. Statistical tests comparing the results with previous known cases. These testswere very limited because the only available sources for this task were the casesthemselves and the few existing models.

3. Turing test-based validation tests, in which the effectiveness of the discoveredknowledge was compared against the expert. This task was planned to get aneat idea of the strength of the results when applied in everyday practice.

4. Continuous daily evaluation with real-life cases. This is a corrective stage andwill continue throughout the research project life cycle.

5. Evaluation of satisfaction. This is an important part of any applied knowledgediscovery process evaluation, which is often overlooked. Its goal is to gain anunderstanding of the feelings of practitioners when the new technology derivedfrom DM is transferred to everyday practice. The information obtained here isindispensable for defining new tasks or research lines and for getting accuratedata about the potential of technology of this sort within the target domain.

The first evaluation step took place during the actual knowledge discoveryin the database process. The goal was to detect any significant deviation of thepartial results obtained during DM from the results expected by the practitioners.


This evaluation step happened to be much more important than expected due tothe results obtained, and enabled us to identify new tasks to correct the deviationsat an early stage.

Regarding the creation of population reference models, the absence ofbackground knowledge made it very difficult for the experts to evaluate howrepresentative the models were. Therefore, the efforts focused on testing thequality of the partial models using a small set of carefully selected cases. Thesecontained typically representative cases of each population, plus a small groupof atypical occurrences and some instances that did not even belong to thepopulation. The results were as follows:

• Typical cases – they were all well characterized by the reference models.

• Atypical cases – the results were inconclusive, as 50% of the cases were similarto a reference model and the remainder were not.

This early evaluation output even better information for the injury detectionproblem. The method involved the experts, who were asked for a justifiedevaluation of each injury pattern. If positive, they should provide at least fiveexamples that confirmed the pattern; if negative, they had to provide the samenumber of negative examples.

As a result, we were able to identify some strengths and weaknesses of the DMmethods. The first outcome was related to the detection process as a whole. Theplanned procedure involved matching any new case against the discovered injurypatterns. The problem with this kind of operation is that not every single injuryhas a pattern, so a few injury cases were detected as perfectly normal. There aresome possible explanations for this effect, but the most critical interpretation is tothink that the above injury patterns belong to rather atypical cases (minorities)that will never be discovered by the system.

This is an extremely tricky matter in medical domains and cannot be ignored.The solution was to change the actual pattern application procedure. Apart fromfinding patterns for cases with injuries, another disjunct set of patterns for normalcases was to be found, and any new case was to be compared with both sets. Ifthe case matched neither, a possible exceptional case would be identified. Thisprocedure met the needs of the experts.

Another potential problem that was detected early on was a consequenceof the DM method used and had to do with the two parameters max-dist andmin-conf. As no background knowledge was available for use, the optimal valuesfor both parameters in this domain had to be determined empirically, and thequality of the results would depend on this. However, it was not possible toovercome this obstacle at this stage.

The second evaluation step was intended to empirically check the results and,at the same time, determine the best values for the parameters max-dist andmin-conf. Five scenarios were defined using the same database. Twenty per centof the cases in each scenario were randomly chosen for testing, while the other80% were used to create the patterns. For each scenario, 25 different sets ofpatterns were generated for each combination of values of min-conf (0.60, 0.70,0.75, 0.80 and 0.90) and max-dist (100, 150, 175, 200, 250) for series of period500. The best scenarios proved to be those where min-conf was 0.75 or 0.80 andmax-dist was 175 or 200 (for period 500). In these cases, the results were verysimilar, and about 85% of the test cases where found to contain the respectivepatterns.


Table 1. Evaluation of injury detection

System Expert Novice

15 common injuries 15 OK 15 OK Failed 25 uninjured 4 OK and 1 don’t know 5 OK 3 OK (no mistakes)

(no mistakes)5 rare injuries Detected as rare cases 2 mistakes and 1 3 don’t knows

(no mistakes) don’t know

Table 2. Evaluation of reference models creation

System Expert Novice

30 members 4 mistakes 9 mistakes 21 mistakes10 non-members No mistakes No mistakes No mistakes10 unclassified No mistakes 2 mistakes 2 mistakes

For the reference model domain, tests were run in two stages. Cross-validationwas used in stage 1. During stage 2, the models were compared against those builtby the knowledge-based system that was developed using expert knowledge. Theresults were extremely good in both cases. During cross-validation, a precisionof 0.98 was achieved. The 2% of misclassifications happened to be cases thatwere not initially classed as belonging to the population but shared all thecharacteristics typical of the population. This happens because many features ofathletes from different sports are the same (for instance, a volleyball player maybe very similar to a basketball player). About 4% of the cases that were notrecognized as belonging to the population were in fact members of it. They wereall representative of minorities within the population and did not share its typicalcharacteristics. These results far surpass the preliminary models.

The third step of the evaluation process was to compare the effectiveness ofthe discovered knowledge against expert knowledge. The most straightforwardmeans of carrying out such a comparison is to use an approach based on theTuring test (Gupta, 1991).

The information supplied to the system, the expert and the novice physicianwas exactly the same (an isokinetic test) in order to assure that the test was mean-ingful. This test was repeated for 25 occurrences in the injury problem (15 withcommon injuries, 5 with rare injuries and 5 without injuries at all), and 50 in thereference model problem (30 belonged to different kinds of population, 10 did notand 10 had not been classified at all). The results are shown in Tables 1 and 2.

Although the information provided to each of the actors was exactly the same(the isokinetic data of the individuals), physicians use other sources in everydaypractice, which explains some of the mistakes made by the experts.

The project has produced two applications (Caraca-Valente et al, 2000;Alonso et al, 2001b): ES for Isokinetics Interpretation (ISOCIN), intended forany kind of patient and ES for Interpreting Isokinetics in Sport (ISODEPOR),which is being used at the National High Performance Center to evaluate themuscle strength of Spanish top-competition athletes.


5. Conclusions

Some important conclusions can be drawn from the results after the evaluation:

• The pattern-searching algorithm has proved to be useful for injurycharacterization and, therefore, has the potential of becoming an effective toolin routine medical practice in the near future. Also, the model creation moduleconsiderably improves model quality, permitting an improved assessment ofathletes and a better characterization of their potential.

• The importance of early evaluation is fundamental. The main problems weredetected in a phase in which they could be easily prevented, corrected or, atleast, identified for further testing and research.

• The minorities problem is crucial for diagnosis domains. Although the I4approach finds a way around and thus solves the problem, a DM method ableto detect and find patterns or models within minorities would be important.

• The values of the above-mentioned empirical parameters are valid for thisparticular domain, but if the method is used for other similar problems theirvalues should be recalculated in a similar way to what we did for the injuryproblem.

• The results are not error free, but the number of mistakes found during testingis considerably lower than the threshold that is acceptable to practitioners.

• Finally, physicians are now using the two applications of the system (ISOCINand ISODEPOR) as a powerful tool in everyday practice. This is the best proofof the quality of the final results.

References

Agrawal R, Srikant R (1994) Mining sequential patterns, In Proceedings of the 1994 internationalconference on very large data bases, Santiago, Chile, pp 487– 499

Agrawal R, Srikant R (1995) Fast algorithms for mining association rules. In Proceedings of the 11thinternational conference on data engineering. IEEE Computer Society Press Taiwan, pp 3–14

Agrawal R, Faloutsos C, Swami A (1993) Efficient similarity search in sequence databases. InProceedings of the 4th international conference on foundations of data organisations andalgorithms (FODO), Chicago, IL, October 13–15, 1993.

Alonso F, Caraca-Valente JP, Lopez-Illescas A, Martınez L, Montes C (2001a) Analysis of strengthdata based on expert knowledge. LNCS 2199 Medical Data Analysis, Springer, Berlin, pp 35–41

Alonso F, Caraca-Valente JP, Martınez L, Montes C (2001b) Discovering similar patterns for char-acterizing time series in a medical domain, In Proceedings of data mining, IEEE InternationalConference, San Jose, CA, pp 577–579

Beckman N, Kriegel HP, Schneider R, Seeger B (1990) The R*-tree: an efficient and robust methodfor points and rectangles. ACM SIGMOD, pp 322–331

Caraca-Valente JP, Lopez-Chavarrıas I (2000) Discovering patterns in time series, In Proceedings ofthe 6th international conference on knowledge discovery and data mining (KDD), Boston, MA,pp 497–505

Caraca-Valente JP, Lopez-Chavarrıas I, Montes C (2000) Functions, rules and models: threecomplementary techniques for analyzing strength data. In Proceedings of the 15th annualsymposium on applied computing (SAC 2000), Vol 1, Villa Olmo, Como, Italy, 19–21 March,pp 60–64

Chan KP, Fu A (1999) Efficient time series matching by wavelets. In Proceedings of the internationalconference on data engineering, Sydney, Australia, pp 126–133

Faloutsos C, Ranganathan M, Manolopoulos Y (1994b) Fast subsequence matching in time seriesdatabases. In Proceedings of SIGMOD’ 94, Minneapolis, MN, pp 419–429

Geurts P (2001) Pattern extraction for time series classification. In LNAI no. 2168 (Proceedings ofPKDD 2001), pp 115–127


Gupta U (ed) (1991) Validating and verifying knowledge based systems. IEEE Computer SocietyPress, Los Alamitos, CA

Han J, Dong G, Yin Y (1998) Efficient mining of partial periodic patterns in time series database. InProceedings of the 4th international conference on knowledge discovery and data mining. AAAIPress, Menlo Park, CA, pp 214–218

Kahveci T, Singh A (2001) Variable length queries for time series data. In Proceedings of theinternational conference on data engineering, Heidelberg, Germany, pp 273–282

Mannila H, Toivonen H, Verkamo A (1995) Discovering frequent episodes in sequences. In Proceedingsof the 1st international conference on knowledge discovery and data mining, Montreal, Canada,pp 210–215

Park S, Chu W, Yoon J, Hsu C (2000) Efficient searches for similar subsequences of different lengthsin sequence databases. In Proceedings of the international conference on data engineering, SanDiego, CA, pp 23–32

Perng C, Wang H, Zhang S, Parker DS (2000) Landmarks: a new model for similarity-basedpattern querying in time series databases. In Proceedings of the international conference on dataengineering, San Diego, CA, pp 33–44

Povinelli R (1999) Times series data mining: identifying temporal patterns for characterization andprediction of time series. PhD thesis, Milwaukee, WI

Author Biographies

Fernando Alonso is Professor of Computer Science and Artificial Intelli-gence at the Universidad Politecnica de Madrid (UPM), from which hereceived his PhD. At present he is R&D director at the UPM’s Centerof Computing and Communications Technology Transfer (CETTICO). Hehas previously held several management posts at the Spanish Ministry ofEducation and Science’s Data Processing Center. He is the author of severalbooks on programming methodology as well as over 40 papers on softwareand knowledge engineering. His research interests lie in the design and ap-plication of knowledge and software engineering methodologies, techniquesand benchmarks to improve the quality of life of the disabled, especially ofvisually impaired people.

Juan Pedro Caraca-Valente is Associate Professor of Databases at the Uni-versidad Politecnica de Madrid. He received his PhD from the Departmentof Artificial Intelligence. He has been the principal investigator of the I4project for the last three years. His research interests are intelligent systemsand, chiefly, data mining applied to the analysis of time series.

Loıc Martınez is an Assistant Professor of Computer Science at the Univer-sidad Politecnica de Madrid (UPM), from which he received his computerscience degree. He is currently working on his PhD thesis on the analy-sis phase in software development methodologies. He is a researcher atCETTICO’s Department of Computer Technology Transfer for the Dis-abled (SETIAM) based at the UPM, working on the development of soft-ware systems for the disabled. His research interests lie in the developmentof software adapted for the disabled, man–machine interaction model, andmethodologies in software and knowledge engineering.


Cesar Montes holds a PhD in Computer Science from the UniversidadPolitecnica de Madrid. He is Associate Professor of the Artificial Intel-ligence Department and since 1990 has led CETTICO’s Department ofComputer Technology Transfer for the Disabled. His research interests in-clude machine learning and data analysis, intelligent interface models andadaptive models. Author of three books and over 40 scientific publications,he is a member of the European Network of Excellence for IntelligentInformation Interfaces, an initiative of the European Community Espritprogram for Long-Term Research Projects.

Correspondence and offprint requests to: F. Alonso, Department of Languages and Systems, Poly-

technic University Madrid, Campus de Montegancedo, 28660 Boadilla de Monte, Spain. Email:

[email protected]

discovering similar patterns for characterizing time series in a medical domain

Documents