discovering hybrid temporal patterns from sequences consisting of point- and interval-based events

22
Discovering hybrid temporal patterns from sequences consisting of point- and interval-based events Shin-Yi Wu a, * , Yen-Liang Chen b a Industrial Technology Research Institute, Hsinchu 310, Taiwan, ROC b Dept. Information Management, National Central University, Chung-Li 320, Taiwan, ROC article info Article history: Received 2 April 2008 Received in revised form 25 June 2009 Accepted 26 June 2009 Available online 5 July 2009 Keywords: Data mining Hybrid temporal pattern Temporal pattern Sequential pattern Hybrid event sequences abstract Previous sequential pattern mining studies have dealt with either point-based event sequences or interval-based event sequences. In some applications, however, event sequences may contain both point-based and interval-based events. These sequences are called hybrid event sequences. Since the relationships among both kinds of events are more diversiform, the information obtained by discovering patterns from these events is more informative. In this study we introduce a hybrid temporal pattern mining problem and develop an algorithm to discover hybrid temporal patterns from hybrid event sequences. We carry out an experiment using both synthetic and real stock price data to compare our algorithm with the traditional algorithms designed exclusively for mining point-based patterns or interval-based patterns. The experimental results indicate that the efficiency of our algorithm is satisfactory. In addition, the experiment also shows that the predicting power of hybrid temporal patterns is higher than that of point-based or interval-based patterns. Ó 2009 Elsevier B.V. All rights reserved. 1. Introduction Sequential pattern mining is an important data mining technique that can be used to help make decisions in a variety of applications [17,16,14]. This technique can be utilized to discover the sequential patterns that occur frequently in a huge sequence database [2,27]. For example, a typical sequential pattern that shows up in video rentals [2] is that customers will rent a series of movies in a certain order, for example ‘‘Star Wars”, followed by ‘‘The Empire Strikes Back”, and finally ‘‘Return of the Jedi.” As defined in sequential pattern mining, this pattern is supported by a customer sequence when the customer rents the items in the above-mentioned order (although not necessarily consecutively). When a pattern is supported by at least a min_sup (minimum support, a user-specified threshold) percentage of customer sequences, we say that this pattern is frequent (or large). In other words, frequent sequential patterns are patterns which have occurred frequently in past expe- rience. Sequential patterns can be helpful in making crucial decisions and used to predict future events. Since sequential pattern mining is so valuable, it has been studied by many researchers. Recent studies include: (1) improved algorithms [31,34,18,42,23]; (2) constraint-based sequential pattern mining [15,28,37,6,33,10]; (3) incremental sequential pattern mining [36,45,26,8,41]; (4) mining variants of sequential patterns, including maximum sequential patterns [2,40], similar sequential patterns [30,4] closed sequential patterns [39,35,9,41,5], iterative patterns [25], and fuzzy sequential patterns [21,19,24,7]; (5) mining sequential pattern from different sources [46]; (6) storage and querying 0169-023X/$ - see front matter Ó 2009 Elsevier B.V. All rights reserved. doi:10.1016/j.datak.2009.06.010 * Corresponding author. Tel.: +886 3 5914010; fax: +886 3 5820085. E-mail address: [email protected] (S.-Y. Wu). Data & Knowledge Engineering 68 (2009) 1309–1330 Contents lists available at ScienceDirect Data & Knowledge Engineering journal homepage: www.elsevier.com/locate/datak

Upload: shin-yi-wu

Post on 26-Jun-2016

212 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Discovering hybrid temporal patterns from sequences consisting of point- and interval-based events

Data & Knowledge Engineering 68 (2009) 1309–1330

Contents lists available at ScienceDirect

Data & Knowledge Engineering

journal homepage: www.elsevier .com/locate /datak

Discovering hybrid temporal patterns from sequences consistingof point- and interval-based events

Shin-Yi Wu a,*, Yen-Liang Chen b

a Industrial Technology Research Institute, Hsinchu 310, Taiwan, ROCb Dept. Information Management, National Central University, Chung-Li 320, Taiwan, ROC

a r t i c l e i n f o a b s t r a c t

Article history:Received 2 April 2008Received in revised form 25 June 2009Accepted 26 June 2009Available online 5 July 2009

Keywords:Data miningHybrid temporal patternTemporal patternSequential patternHybrid event sequences

0169-023X/$ - see front matter � 2009 Elsevier B.Vdoi:10.1016/j.datak.2009.06.010

* Corresponding author. Tel.: +886 3 5914010; faE-mail address: [email protected] (S.-Y. Wu).

Previous sequential pattern mining studies have dealt with either point-based eventsequences or interval-based event sequences. In some applications, however, eventsequences may contain both point-based and interval-based events. These sequences arecalled hybrid event sequences. Since the relationships among both kinds of events aremore diversiform, the information obtained by discovering patterns from these events ismore informative. In this study we introduce a hybrid temporal pattern mining problemand develop an algorithm to discover hybrid temporal patterns from hybrid eventsequences. We carry out an experiment using both synthetic and real stock price data tocompare our algorithm with the traditional algorithms designed exclusively for miningpoint-based patterns or interval-based patterns. The experimental results indicate thatthe efficiency of our algorithm is satisfactory. In addition, the experiment also shows thatthe predicting power of hybrid temporal patterns is higher than that of point-based orinterval-based patterns.

� 2009 Elsevier B.V. All rights reserved.

1. Introduction

Sequential pattern mining is an important data mining technique that can be used to help make decisions in a variety ofapplications [17,16,14]. This technique can be utilized to discover the sequential patterns that occur frequently in a hugesequence database [2,27]. For example, a typical sequential pattern that shows up in video rentals [2] is that customers willrent a series of movies in a certain order, for example ‘‘Star Wars”, followed by ‘‘The Empire Strikes Back”, and finally ‘‘Returnof the Jedi.” As defined in sequential pattern mining, this pattern is supported by a customer sequence when the customerrents the items in the above-mentioned order (although not necessarily consecutively). When a pattern is supported by atleast a min_sup (minimum support, a user-specified threshold) percentage of customer sequences, we say that this pattern isfrequent (or large). In other words, frequent sequential patterns are patterns which have occurred frequently in past expe-rience. Sequential patterns can be helpful in making crucial decisions and used to predict future events.

Since sequential pattern mining is so valuable, it has been studied by many researchers. Recent studies include: (1)improved algorithms [31,34,18,42,23]; (2) constraint-based sequential pattern mining [15,28,37,6,33,10]; (3) incrementalsequential pattern mining [36,45,26,8,41]; (4) mining variants of sequential patterns, including maximum sequentialpatterns [2,40], similar sequential patterns [30,4] closed sequential patterns [39,35,9,41,5], iterative patterns [25], and fuzzysequential patterns [21,19,24,7]; (5) mining sequential pattern from different sources [46]; (6) storage and querying

. All rights reserved.

x: +886 3 5820085.

Page 2: Discovering hybrid temporal patterns from sequences consisting of point- and interval-based events

1310 S.-Y. Wu, Y.-L. Chen / Data & Knowledge Engineering 68 (2009) 1309–1330

methods for sequential patterns [29]; (7) mining patterns from interval-based event sequences [22,38,20]; (8) mining pat-terns from sequences with point- and interval-based events [12,11] and this work; and many others.

The seven types of researches mentioned above can be further divided into three categories: (1) mining patterns frompoint-based event sequences; (2) mining patterns from interval-based event sequences; and (3) mining patterns from hybridevent sequences (sequences consisting of point- and interval-based events). The point-based category includes types (1)through (6). In this category, researchers deal with point-based event data, wherein a sequence is a series of events or itemsthat occur at specific time points.

The second category, the interval-based category, includes type (7). In this category, the patterns are discovered frominterval-based event sequences. Events in this domain last for periods of time instead of happening at specific points in time,and the starting and ending times of these interval-based events are known and stored in databases. Temporal patterns,which are frequent subsequences consisting of interval events, can be discovered from these interval-based event sequences[22,38]. For example, in a hospital database, each disease a certain patient suffers from can be viewed as an interval-basedevent, and a temporal pattern may be that patients frequently start a ‘‘fever” when they start to ‘‘cough” and all these symp-toms occur when they catch the flu.

Hoppner [20] for once worked on interval-based event sequences, aimed at discovering temporal pattern rules ratherthan temporal patterns, as was done in Refs. [22,38]. These rules answer the question, how often will pattern B occur in agiven sequence if pattern A has already occurred. This method has been shown to be useful in time series problems.

Type (8) is included in the third category. As mentioned previously, sequential patterns are discovered using either point-based or interval-based approaches. In some applications, however, events are neither purely point-based nor purely inter-val-based; there may include both kinds of events in data sequences, for example, meteorological phenomena. Thunder andlightning are point-based events, while rain, snow, and sunshine are interval-based events. One common example for mete-orological hybrid temporal pattern is that of ‘‘lightning (point-based event) followed by thunder (point-based event), both ofwhich happen during a rain storm (interval-based event)”. This pattern, consisting of point- and interval-based events, canbe called a hybrid temporal pattern. If it is supported by sufficient hybrid event sequences, it is a frequent hybrid temporalpattern. In this study we develop an algorithm that can be used to discover all frequent hybrid temporal patterns from aset of hybrid event sequences given a threshold min_sup. These patterns are more informative than the patterns discoveredby either point-based methods or interval-based methods alone.

There has been some previous research on hybrid event sequences. For example Amo et al. proposed a constraint-basedmethod to discover hybrid temporal patterns [12,11]. Constraint-based methods, such as SPIRIT [15], in the point-based cat-egory and MILPRIT * [12,11] in the hybrid category, allow the user to reduce the search space by setting regular expressionconstraints on the patterns. In our study, we focus on discovering the hybrid temporal pattern by employing the embeddingstore technique, which has been used in other mining problems [1,42,43]. Since doing so can reduce the number of databasescans, it is expected that our proposed method could have a better performance than MILPRIT* when no pattern constraint isimposed. Further comparisons between MILPRIT* and our method are discussed in Section 2.3.1.

1.1. Applications of hybrid temporal patterns

Hybrid temporal patterns have many applications. In meteorology, we can use the discovered hybrid temporal patterns topredict typhoons, earthquakes, or even tsunamis. In the financial domain the relationships among similar, contiguous sub-sequences were established in a recent research [13]. In this paper, fluctuations of stock indexes are treated as interval-basedevents, while announcements of cash dividends and stock splits can be treated as point-based events. The financial hybridtemporal patterns are discovered to help people determine when to buy or sell stocks. Similarly, in the world of medicine,most diseases are interval-based, while treatments are often point-based. Medical hybrid temporal patterns describe therelations between diseases and treatments. Without hybrid temporal pattern mining, we can only find the relations amongpoint-based events or among interval-based events, which may lead to incorrect decisions due to incomplete knowledge.Clearly, hybrid temporal pattern mining is useful and necessary in diverse applications.

1.2. Paper organization

Although mining hybrid temporal patterns is a significant problem, to the best of our knowledge, very few researcheshave considered this problem. In Section 2, related work is introduced. The difference between these and the hybrid tempo-ral pattern mining problem are explained. In Section 3, we formally define the hybrid temporal pattern mining problem.Based on the definitions, we propose the algorithm, named HTPM, for mining hybrid temporal patterns in Section 4. Perfor-mance evaluations using both synthetic and real data sets are given in Section 5 while conclusions are drawn in Section 6.

2. Related works

The sequential pattern mining (point-based), temporal pattern mining (interval-based), and hybrid temporal patternmining (hybrid) problems are quite different. In this section, we discuss these mining problems and the main methods usedto resolve them and point out the differences among them.

Page 3: Discovering hybrid temporal patterns from sequences consisting of point- and interval-based events

S.-Y. Wu, Y.-L. Chen / Data & Knowledge Engineering 68 (2009) 1309–1330 1311

2.1. Sequential pattern mining

In sequential pattern mining, patterns of point-based events that frequently occur in databases are discovered given a setof sequences and a threshold, the minimum support. This mining problem was first proposed by Agrawal and Srikant in 1995[2]. GSP [34], PrefixSpan [31] and SPADE [42], which are probably the most popular methods to solve this problem, will beintroduced in this subsection.

2.1.1. GSPGSP (Generalized Sequential Pattern) [34] is an Apriori-based algorithm for mining sequential patterns. The basic steps in

GSP are: (1) candidate pattern generation; and (2) frequent pattern generation. In the candidate pattern generation phase, Ck,candidate patterns with length k, are generated by combining two promising large (k � 1)-patterns in Lk�1. In the frequentpattern generation phase, all candidate patterns in Ck are examined to see if their frequencies are larger than or equal to thespecified minimum support threshold. To determine the supports of these candidate patterns, one database scan is needed. Ifa candidate k-pattern satisfies the minimum support threshold, it becomes a large k-pattern. The two steps are executed iter-atively until no further patterns can be generated.

2.1.2. PrefixspanPrefixSpan [31] uses a divide-and-conquer strategy to solve the sequential pattern mining problem. First, the database

is scanned to find the frequent 1-patterns ðL1Þ. Second, suppose there are jL1j patterns in L1. The original database is di-vided into jL1j partitions, where each partition is the projection of the sequence database with respect to the correspond-ing 1-pattern. Third, similar to the first step, each partition is treated as the original one and all large 1-patterns in thispartition are found. Appending these large 1-patterns, say b, to the original prefix, say a, will generate frequent patternsa0 ¼ aþ b, with the length increased by one. In this way, the prefixes are successfully extended. Finally, steps two andthree are run recursively until the prefixes can no longer be extended. In this way all frequent sequential patterns areobtained.

2.1.3. SPADEIn the above-mentioned sequential pattern mining methods a horizontal database is used to store event sequences. The

horizontal database consists of a set of point-based event sequences, where each event is represented by three attributes, SID(sequence id), EID (event id) and time (event occurring time). SPADE [42], in contrast, uses a vertical database, in which everytransaction stores the id-list which includes the ids of those sequences containing the item. The vertical database enablesSPADE to check support via simple id-list joins, which is similar to set intersections.

The main steps of SPADE include: (1) generating frequent 1-patterns ðL1Þ and frequent 2-patterns ðL2Þ; (2) decomposingthe original search space (lattice) into prefix-based parent equivalence classes; and (3) enumerating all other frequentpatterns via the depth first search method or the breadth first search method. In the third step, patterns are generated basedon lattice theory. By joining two k-patterns with same prefix (their first (k � 1) items are of the same EID) to generate (k + 1)-patterns. For example, if we join ha,bi and ha,ci, there are three possible outcomes: ha, (b,c)i, ha,b,ci, and ha,c,bi. When com-puting the new id-list of ha,(b,c)i, it is only necessary to check if the SID and EID are equal. When computing the new id-listof ha,b,ci or ha,c,bi, it has to check if the time order of ha,bi and that of ha,bi are of the right order.

The lattice-based method is not suitable for dealing with temporal pattern mining or hybrid temporal pattern miningproblems, because the relations among patterns of interval-based events or patterns of both interval- and point-based eventsare too complicated. The number of possible relations among two (k � 1)-patterns are more than that in sequential patternmining. Adopting the lattice-based approach to resolve temporal or hybrid temporal pattern mining problems will lead tobad performance.

2.2. Temporal pattern mining

Temporal pattern mining is a variant of the sequential pattern mining problem. It discovers interval-based event patternsrather than point-based event patterns from sequence databases. In temporal pattern mining problem, there are 13 possiblerelations between two interval-based events (Table 3), but in traditional sequential pattern mining problem, there are onlythree possible relations between two point-based events. Possible relations among more than two events are much morecomplicated in temporal pattern mining than in sequential pattern mining.

To the best of our knowledge, there have been only two studies on temporal pattern mining problem: Kam and Fu’s meth-od (designated KF method) [22] and TPrefixSpan [38].

2.2.1. Kam and Fu’s methodThe KF method is an Apriori-based algorithm. Similar to Apriori and GSP, patterns are generated length by length. The

difference between KF and GSP is the candidate pattern generation phase. In the KF method, the candidate pattern generationphase is adjusted to handle the more complicated relations among interval-based events. The major drawback of the KFmethod is its ambiguity problem of pattern representations, which is discussed in detail in [38].

Page 4: Discovering hybrid temporal patterns from sequences consisting of point- and interval-based events

1312 S.-Y. Wu, Y.-L. Chen / Data & Knowledge Engineering 68 (2009) 1309–1330

2.2.2. TPrefixspanTPrefixSpan [38] is a PrefixSpan-based approach to tackle the temporal pattern mining problem. In addition to handling

pair-wise relation between two event points of each interval-based event, the main difference between TPrefixSpan and Pre-fixSpan is the pattern appending operation in the third step. In PrefixSpan, 1-pattern b generated from each projected data-base can be appended simply to the prefix; however, this process is more complicated in TPrefixSpan since the 1-pattern bhas multiple possibilities when appending to the prefix. Handling the complicated relations among interval-based eventsmakes the performance of TPrefixSpan worse than that of PrefixSpan.

2.3. Hybrid temporal pattern mining

Hybrid temporal pattern mining is a method for discovering patterns in both point- and interval-based event sequences.The relations between two hybrid temporal patterns are even more complicated than those discussed in Section 3.2, whichdiscussed only relations between events (point- or interval-based). When designing methods to resolve this mining problem,efficiency is an important issue and the hybrid temporal pattern mining method should be able to find either sequential pat-terns or temporal patterns.

2.4. MILPRIT*

MILPRIT* [12], [11] is a constraint-based hybrid temporal pattern mining method. It allows users to specify patternconstraints with the defined regular expression. For example, in medical domain, users can discover patterns such as‘‘patients take some medicine during a certain period of time and present some symptom at the end of this period” by spec-ifying a pattern constraint. In MILPRIT*, a hybrid temporal pattern is represented by the triple (K,D,T). For example, we canset K = Patient(x) (where x is a registered patient), D = {Med(x,penicillin,e), Symp(x,dizziness, f), Hist(x,st.surgery, t) (represent-ing the events which take place in the pattern), T = {before(e, f),during(t, f)} (representing the temporal relations among theseevents, the temporal relation representations are based on Allen’s First Order Interval Logic [3]). The meaning of this patternis: the patient takes penicillin during a certain period of time e; during a period of time f after his taking the medicine, hefeels dizzy and undergoes a stomach surgery someday t during f (the examples are quoted directly from [11]). For the details,please refer to [12,11].

Since both MILPRIT* and HTPM are proposed to discover hybrid temporal patterns, it is necessary to discuss the differencesbetween them. Table 1 summarizes the comparisons.

The first difference between the two methods is that MILPRIT* allows users to specify pattern constraint but HTPM doesnot. Although MILPRIT* can work without setting any pattern constraints, it is expected to have a worse performance due tothe algorithm strategy in pattern generation and support counting. Thus, when users need to find all frequent patterns, HTPMwould be a better choice than MILPRIT*. The second difference lies in that HTPM employs an embedding store technique,which makes it need only one database scan. However, the embedding store technique would lead to more memory require-ment. Another difference is the types of temporal relation handled by these two methods. HTPM can handle all 21 temporalrelations as listed in Section 3.2, while MILPRIT* can handle 19 types (without ‘‘equal” and ‘‘p-equal”). Finally, the patternformats adopted by these two methods are different. The sample pattern representations of both formats for the same pat-tern are given in Table 1.

2.5. Hybrid temporal pattern mining cannot be resolved by point-and interval-based methods

The hybrid temporal pattern mining problem cannot be resolved by any of the existing point- or interval-based methodsmentioned before. The reasons are two-fold. First, the temporal pattern mining problem cannot be reduced to a sequentialpattern mining problem. Second, the hybrid temporal pattern mining problem cannot be reduced to a temporal pattern min-ing problem.

Table 1Comparison between HTPM and MILPRIT*.

HTPM MILPRIT*

Pattern constraint No YesPattern generation and support counting Embedding store technique Classical level-wise approach (Apriori-based)Database scan One time Multiple timesMemory requirement More FewerTypes of temporal relations 21 19Pattern representation Time ordering of event points. Ex.

(e+ < e� < f+ < t < f�), where e is an interval-basedevent that the patient take penicillin, f is an interval-based event that the patient feels dizzy and t is apoint-based event that the patient undergoes astomach surgery

First-order linear temporal logic Ex. (K,D,T), whereK = Patient(x), D = {Med(x,penicillin,e),Symp(x,dizziness, f), Hist(x,st.surgery, t),T = {before(e, f), during(t, f)}

Page 5: Discovering hybrid temporal patterns from sequences consisting of point- and interval-based events

b

a a a

b

sx = <a+, b+, a-, b-> sy = <a+, a-, a+1, b+, a-1, b->

Fig. 1. Representing interval-based event sequences by point-based methods.

S.-Y. Wu, Y.-L. Chen / Data & Knowledge Engineering 68 (2009) 1309–1330 1313

The temporal pattern mining problem cannot be reduced to a sequential pattern mining problem, even if each interval-based event is transformed into two point-based events. For example, let eþ and e� be the starting and ending points of aninterval-based event e. An interval event a can be represented by two event points aþ and a�. The reason why the temporalpattern mining problem cannot be reduced to a sequential pattern mining problem has three aspects. First, traditional point-based methods discover patterns without considering the pair-wise relation between two events transformed from the sameinterval event. Therefore, when we calculate the extent of the support of pattern hðaþÞ; ða�Þi, we cannot distinguish whetherthey come from the same event. As a result, incorrect patterns are generated which are not really frequent. Second, someuseless patterns will be generated. For example, point-based methods may generate the following patterns:hðaþÞ; ðb�Þi; hðb�Þ; ðaþÞ; ðbþÞi; hðaþ;bþÞ; ðc�Þi, which cannot completely describe the relations of interval events. Third, somefrequent patterns may be lost. A simple way to handle pair-wise relations of events is to add an event index to each occur-rence of the event. In this way, two occurrences of the same interval-based event will be treated as different events. Unfor-tunately, this may cause the support of each event to be underestimated, and thus some frequent patterns will be lost. Forexample, in Fig. 1, we see that sx is a part of sy. However, in traditional point-based sequential pattern mining methods,haþ;bþ; a�;b�iðsyÞ and haþ1;bþ; a�1;b�i (the right part of sxÞ will be treated as two different patterns. When counting supportof haþ;bþ; a�;b�i; sx will be counted but sy will not be. Since the pattern support cannot be counted correctly, some frequentpatterns will be lost.

Hybrid temporal pattern mining problems cannot be reduced to temporal pattern mining problems. Although a point-based event seems similar to a zero-time interval-based event, whose starting and ending times are the same, the semanticsof these two kinds of events are different. If we represent a point-based event as a zero-time interval-based event, someproblems will occur. First, this may degrade the efficiency of the mining hybrid patterns because sequences will becomelonger due to replacing a single point by an interval event with two points. In addition, our experiments show that the algo-rithm developed in this study is much more efficient than TPrefixSpan. If we transform TPrefixSpan to deal with the hybridtemporal pattern mining problem, it will not perform efficiently, either. Second, some unexpected patterns may be generatedif an event is of double type, i.e., point-based and interval-based. Since point-based events are treated as zero-time interval-based events, we cannot distinguish between these two event types; thus, the support of each event will be overestimatedand some patterns which are not really frequent will be generated. Third, the discovered patterns may have semantic ambi-guity problems. For example, when seeing a pattern such as ðaþ ¼ a� ¼ cþ ¼ c�Þwe have no idea if this pattern contains twointerval events, two point events, or one interval event and one point event. The semantic ambiguity problem may lead tosome difficulties when making crucial decisions.

3. Problem definition

3.1. Notations

Often, a group of time sequences composed of both point- and interval-based events is collected. From this sequence set,we would like to know which events frequently occur together and in what order they appear. For example, in meteorology,several weather stations collect a bundle of meteorological data. After summarizing, these data can be represented as inTable 2, where the ID is the station identifier, each point-based event occurs at a time point (Tp), and each interval-basedevent occurs during a time interval ([Ts,Te]). From such a hybrid event sequence database, the hybrid temporal patternmining algorithm will discover all frequent hybrid temporal patterns. The following definitions formally describe this miningproblem.

In what follows let E be the fty1; ty2; . . . ; tyug of all event types that may occur in the point-based and interval-basedevents.

Definition 1 (Point-based event). A point-based event (poE) is an event occurring at a certain time point. A poE is stored inthe form (et, Tp) in a hybrid event sequence database, where et 2 E and Tp is the time at which et occurs, denoted as et in ahybrid temporal pattern (defined in Definition 5).

Definition 2 (Interval-based event). An interval-based event (inE) is an event occurring over a time period. An inE is stored inthe form (et, [Ts,Te]) in a hybrid event sequence database, where et 2 E, and Ts and Te are the starting time and ending timeof et, respectively. In a hybrid temporal pattern, an inE consists of two event nodes et+num and et�num, called inE+ and inE�,

Page 6: Discovering hybrid temporal patterns from sequences consisting of point- and interval-based events

Table 2Database D with both point- and interval-based events.

ID Event Time Illustration

1 C 6

aba

cc1 C 8

1 A [5,10]1 B [6,12]1 A [8,12]

2 C 6

ba

cc2 C 8

2 B [6,11]2 A [8,11]

3 C 4

a

ba

c3 A [4,10]3 B [4,12]3 A [9,12]

1314 S.-Y. Wu, Y.-L. Chen / Data & Knowledge Engineering 68 (2009) 1309–1330

respectively, where et 2 E;num 2 f0;1;2; . . .g are the occurrence marks used to identify the pair-wise relation of end pointsfor each inE, which will be described in detail in Definition 5. Combining the two end points, an inE is noted as etþnum < et�num

or etþnum ¼ et�num.

Example 1 (Point-based and Interval-based events). For database D in Table 2, we have event set E = {a, b, c}, where c is apoint-based event; and a and b are interval-based events. According to Definitions 1 and 2, events a and b should be repre-sented as (a+ < a�) and (b+ < b�); and event c should be simply represented as c.

In database D, the events can be divided into three event sets according to their IDs. Since the events in each set can beordered by time, these three event sets can be treated as three hybrid event sequences. Note that an event may have multipleoccurrences in each hybrid event sequence. Definition 3 defines the formation of a hybrid event sequence. The occurrence ofan event is defined and explained in Definition 4 and Example 2.

Definition 3 (Hybrid event sequence). A hybrid event sequence is composed of a series of point-based and/or interval-basedevents. Thus, a hybrid event sequence can be represented as si ¼ fðSIDi; ei

0Þ; ðSIDi; ei1Þ; ðSIDi; ei

2Þ; . . . ; ðSIDi; eini Þg, where SIDi is

the sequence id of si and eij ð0 6 j 6 niÞ is either poE or inE.

Definition 4 (Occurrence of event). In a hybrid event sequence with ID ¼ sj, the occurrence of event ei is recorded asoccurðei; sjÞ ¼ fTp1; Tp2; . . .g, if ei is a poE; occurðei; sjÞ ¼ f½Ts1; Te1�; ½Ts2; Te2�; . . .g, if ei is an inE. The operation occur outputsall occurrence time values for the given event in a certain hybrid event sequence.

Example 2 (Occurrence of event). In database D, event c occurs twice in the hybrid event sequence with ID = 1; therefore,occur(c,1) = {6,8}. Event a occurs once in the hybrid event sequence with ID = 2 and twice in the hybrid event sequence withID = 3; therefore, occur(a,2) = {[8,11]} and occur(a,3) = {[4,10], [9,2]}.

In Definitions 1 and 2, point-based events or end points of interval-based events are called event nodes. A hybridtemporal pattern is composed of n event nodes (poE, inE+, or inE�) and (n � 1) order relations (�Þ, ‘‘<” or ‘‘=”, which describethe time relationship between two adjacent event nodes. The order relation is ‘‘<” if the time value of the preceding eventnode is smaller than that of the succeeding one, and ‘‘=” if they have an equal time value. The formal representation of ahybrid temporal pattern is given in Definition 5. Additionally, the arrangement of event points in a hybrid temporal pattern isregulated (in Definition 6) to ensure that each pattern has a unique expression.

Definition 5 (Hybrid temporal pattern). A hybrid temporal pattern htp is represented as htp ¼ ðN0�0N1�1N2�2 � � � �ðk�1ÞNnÞ,where Ni 2 fpoE;inEþ; inE�gð0 6 i 6 nÞ, and �i 2 f<;¼g; ð0 6 i 6 ðn� 1ÞÞ. In this representation, two event nodes, inE+ andinE�, coming from the same inE occurrence must be assigned the same occurrence mark (refer to Definition 2). Since theremay be multiple occurrences of an inE in a hybrid pattern, it is necessary to distinguish which two event nodes, an inE+ andan inE�, represent the same inE occurrence. Throughout this paper, the occurrence marks of an inE are omitted when this inEhas only one occurrence in this pattern.

Page 7: Discovering hybrid temporal patterns from sequences consisting of point- and interval-based events

S.-Y. Wu, Y.-L. Chen / Data & Knowledge Engineering 68 (2009) 1309–1330 1315

Definition 6 (Arrangement of event nodes in htp). An event node Nx is arranged before event node Ny in a hybrid temporalpattern if the following conditions are satisfied:

(1) Timing: if timeðNxÞ < timeðNyÞ, where time(N) is the occurrence time of N.(2) Alphabet: if timeðNxÞ ¼ timeðNyÞ, but the event name of Nx alphabetically precedes that of Ny

(3) Event node type: if criteria 1 and 2 are tied, but one of the following holds: (a) Nx is an inE+ and Ny is an inE�; (b) Nx isan inE+ and Ny is a poE; (c) Nx is a poE and Ny is an inE�.

(4) Occurrence mark: if the above criteria are tied, but numðNxÞ < numðNyÞ, where num(N) is the attached occurrencemark of N.

Since the above-mentioned definitions are a little complicated, we use Example 3 to explain Definitions 5 and Definition 6in detail.

Example 3 (Hybrid temporal patterns). A hybrid temporal pattern htp1 consists of five events: (a+ < a�) during [3,8]; (b+ < b�)during [3,5]; (b) at time 3; (a+ < a�) during [3,9]; and (a+ < a�) during [4,9], as shown in Fig. 2. Notice that an event type suchas b is not restricted to a unique form (point or interval). Based on Definitions 5and 6, this pattern should be represented ashtp1 = (a+0 = a+1 = b+ = b < a+2 < b� < a�0 < a�1 = a�2).

In htp1, we determine this part b < a+2 < b� < a�0 < a�1 according to rule 1. Based on rule 2, we obtain another part a+1 = b+.Furthermore, by rule 3 we assign b+ = b. Finally, according to rule 4, we derive a+0 = a+1 and a�1 = a�2.

The length of a hybrid temporal pattern is the number of event occurrences in a pattern, rather than the number of eventnodes k. A hybrid temporal pattern with length l is called an l-events hybrid temporal pattern. For example, the length of htp1in Fig. 2 is 5, not 9.

Now that we understand the formal expression of a hybrid temporal pattern, we explain how to recognize a hybrid tem-poral pattern as frequent. Similar to the original sequential pattern mining problem, a frequent hybrid temporal patternshould have a support of at least min_sup percentage of sequences. The following definitions formally state this idea.

Definition 7 (Occurrence of pattern). The occurrence of a hybrid temporal pattern htpi with k event nodes in a sequence sj isdenoted as occurðhtpi; sjÞ ¼ fot1; ot2; . . .g, where each ota is a sequence of k time values, which are the occurrence times of thek event nodes of htpi in sj. occurðhtpi; sjÞ returns all occurrences of htpi in sj. If occurðhtpi; sjÞ returns /, we say that sj does notsupport htpi, otherwise, we say that sj supports htpi, noted as htpiINsj.

Example 4 (Occurrence of pattern). Suppose we are given four hybrid temporal patterns: htp2 ¼ ðcÞ;htp3 ¼ ðaþ < a�Þ;htp4 ¼ ðaþ < bþ < a� < b�Þ;htp5 ¼ ðaþ ¼ bþ < a� < b�Þ. s1 is the hybrid event sequence with ID = 1 in hybrid event sequencedatabase D in Table 2. According to Definition 7, occur ðhtp2; s1Þ ¼ fð6Þ; ð8Þg, occur ðhtp3; s1Þ ¼ fð5;10Þ; ð8;12Þg, occurðhtp4; s1Þ ¼ fð5;6;10;12Þg, occur ðhtp5; s1Þ ¼ /. Therefore, s1 supports htp2;htp3, and htp4, but does not support htp5, notedas htp2INs1;htp3INs1; htp4INs1, and htp5 � INs1.

Definition 8 (Support). The support of a hybrid temporal pattern htp in a hybrid event sequence database D is defined asEq. (1).

Supportðhtpi;DÞ ¼jfsjjhtpi IN sj; sj 2 Dgj

jDj ; ð1Þ

where jDj is the number of sequences in D.Given a threshold, min_sup, and a database D, a hybrid temporal pattern htpi is called frequent, if Supportðhtpi;DÞ is no less

than min_sup.

3.2. Temporal relations in hybrid domain

Since hybrid temporal pattern mining model can handle not only purely interval-based events and purely point-basedevents, but also both kinds of events, the proposed hybrid model is a general case of sequential pattern mining and temporal

a

a

b

a

b

htp1 = (a+0=a+1=b+=b<a+2<b-<a-0<a-1=a-2)

Fig. 2. Formal expression of a hybrid temporal pattern.

Page 8: Discovering hybrid temporal patterns from sequences consisting of point- and interval-based events

1316 S.-Y. Wu, Y.-L. Chen / Data & Knowledge Engineering 68 (2009) 1309–1330

pattern mining. From Tables 3–5, we see that there are 13 possible temporal relations between two interval-based events,and three relations between two point-based events, but there are a total of 21 possible relations between two events (point-or interval-based) in hybrid event sequences. Using the proposed hybrid model, we can not only handle the two existingmining problems (Tables 3 and 4), but we can also discover another five temporal relations (Table 5) that cannot be foundin pure point- or pure interval-based model.

There are 13 possible temporal relations between two intervals, which are listed in Table 3 [3,22]. Using the expressionsgiven in Definitions 3and 4, the hybrid temporal patterns of these temporal relations are shown in the far right-hand columnof Table 3. All 13 temporal relations have an inverse relationship, except for ‘‘equal”. For example, ‘‘X before Y” is differentfrom ‘‘Y before X” (or ‘‘X after Y”), but they are in an inverse relationship. The corresponding hybrid temporal patterns de-fined in this paper are also different: they are ‘‘X+ < X� < Y+ < Y�” and ‘‘Y+ < Y� < X+ < X�”. We omit the hybrid temporal pat-

Table 3Temporal relations between two interval-based events.

No. Temporal relation Inverse temporal relation Pictorial example Hybrid temporal pattern

1, 2 X before Y Y after X X Y X+ < X� < Y+ < Y�

3 X equal Y – XY

X+ = Y+ < X� = Y�

4, 5 X meets Y Y met by X X Y X+ < X� = Y+ < Y�

6, 7 X overlaps Y Y overlapped by X XY

X+ < Y+ < X� < Y�

8, 9 X during Y Y contains X XY

Y+ < X+ < X� < Y�

10, 11 X starts Y Y started by X XY

X+ = Y+ < X� < Y�

12, 13 X finishes Y Y finished by X XY

Y+ < X+ < X�Y�

Table 4Temporal relations between two point-based events.

No. Temporal relation Inverse temporal relation Pictorial example Hybrid temporal pattern

14, 15 X p-before Y Y p-after X X Y X < Y

16 X p-equal Y – XY

X = Y

Table 5Temporal relations between a point-based event and an interval-based event.

No. Temporal relation Pictorial example Hybrid temporal pattern

17 X h-before Y X Y X < Y+ < Y�

18 X h-starts Y XY

X = Y+ < Y�

19 X h-during Y XY

Y+ < X < Y�

20 X h-finishes Y XY

Y+ < X = Y�

21 X h-after Y Y X Y+ < Y� < X

Page 9: Discovering hybrid temporal patterns from sequences consisting of point- and interval-based events

S.-Y. Wu, Y.-L. Chen / Data & Knowledge Engineering 68 (2009) 1309–1330 1317

terns of all inverse temporal relations, since they are inferable. There are only three possible temporal relations between twopoints, ‘‘before”, ‘‘after”, and ‘‘equal” (Table 4). We name these three temporal relations ‘‘p-before”, ‘‘p-after”, and ‘‘p-equal”in order to distinguish them from temporal relations among interval-based events. In hybrid temporal pattern mining, 16temporal relations, previously handled by temporal pattern mining and sequential pattern mining methods, are handled.In addition, additional five temporal relations (Table 5) in hybrid model are also handled. In the second and third columnsof Table 5, the ‘‘A h-Rel B” format is used to name the temporal relations between a point-based event A and an interval-based event B.

4. Mining hybrid temporal patterns

4.1. Algorithm HTPM

We design a new algorithm, HTPM (Hybrid Temporal Pattern Mining), to address the problem of hybrid temporal patternmining. HTPM discovers patterns of point-based events, interval-based events, or both. The entire HTPM process requiresonly one database scan. Let Lk be the set of frequent k-events hybrid temporal patterns. The process starts by scanningthe hybrid event sequence database D to generate L1. The occurrence time values of each pattern in every sequence of Din every frequent pattern set ðL1; L2; L3; . . .Þ are recorded; the set of occurrence time values is called the occurrence record(OR). By joining ORs of patterns with length (k � 1), HTPM generates patterns with length k without another database scanðk P 2Þ. Before stepping into HTPM, we give some definitions related to this mining method.

Definition 9 (Subpattern). Given two hybrid temporal patterns a ¼ ðN0�0N1�1N2�2 � � � �ðm�1ÞNmÞ and b ¼ ðN00�00N01�01N02�02 � � � �0ðn�1ÞN

0nÞ, where n > m;a is called a subpattern of b iff we can find m node indexes (the subscripts)

0 6 w1 < w2 < � � � < wm 6 n for m event nodes in b such that the following conditions are satisfied.

(1) Ni and N0wiare both poE, both inE+, or both inE� ð0 6 i 6 mÞ.

(2) Ni and N0wiare of the same event type.

(3) If Nx and Ny are inE+ and inE� in a, respectively, and the occurrence mark of Nx is equal to that of Ny, then the occur-rence mark of N0wx

must be equal to that of N0wy.

(4) �i ¼ Smallð�0wi;�0wðiþ1Þ�1Þ, for 0 6 i 6 m� 1. The function Small will return ‘‘=”, if all order relations,

�0wi;�0wiþ1; . . . ;�0wðiþ1Þ�1, are ‘‘=”, otherwise it will return ‘‘<”.

Definition 10 (Prefix (k� 1)-subpattern). Given two hybrid temporal patterns a and b (the length of b is k), where a is a sub-pattern of b, we call a the prefix (k � 1)-subpattern of b, if a is equal to b after deleting the last occurring event from b. (Thelast occurring event of b is a poE or an inE in b with the maximum index among all poEs and inE+s in b. After deleting anevent node N0p from b ¼ ðN00�00 � � �N

0ðp�1Þ�0ðp�1ÞN

0p�0pN0ðpþ1Þ � � � �0ðn�1ÞN

0nÞ, the order relation between N0ðp�1Þ and N0ðpþ1Þ should be

replaced by Smallð�0ðp�1Þ;�0pÞ.)

The prefix (k � 2)-subpattern of b can be obtained from deleting the last two occurring events.

Method: Call HTPM(D, min_sup)Input: D: Hybrid event sequence database; min_sup: Support threshold given by the user

Output: FPS: The set of all frequent hybrid temporal patterns

Procedure HTPM(D, min_sup){

FPS ¼ /;ðL1;OR1Þ ¼ GenL1ðD;min supÞ; //Generate 1-event hybrid temporal patternsFor ðk ¼ 2; jLðttk�1Þj > 1; k++) {ðLk;OR kÞ ¼ GenLkðLðk � 1Þ;OR ðk � 1Þ, min_sup); //Generate k-events hybrid temporal patterns by joining occurrencerecords of (k � 1)-events patternsFPS ¼ FPS [ Lk; //The Symbol [ represents the set union operation.

}Output FPS;

}

In the hybrid event sequence database D in Table 2, we have event set E = {a,b,c}. When min_sup is set to 50%, GenL1()scans database D once, and obtains L1 ¼ fðaþ < a�Þ; ðbþ < b�Þ; ðcÞg. Each pattern in L1 is associated with an occurrence recordlisted in Table 6. HTPM maintains an occurrence record for each pattern in order to generate other patterns in the next step.The occurrence record of pi in D consists of occurðpi; s1Þ; occurðpi; s2Þ, and occurðpi; s3Þ, where sj is the sequence with ID = j in D.Table 6 lists 1-event hybrid temporal patterns with occurrence records for the sequence data in Table 2. Some examples ofoccurrence records for patterns with length P 2 are listed in Figs. 3 and 4.

Page 10: Discovering hybrid temporal patterns from sequences consisting of point- and interval-based events

Table 6Frequent 1-event hybrid temporal patterns discovered from D (min_sup = 50%).

L1 Index Occurrence Illustration

a+ < a� s1 (5,10), (8,12) as2 (8,11)s3 (4,10), (9,12)

b+ < b� s1 (6,12) bs2 (6,11)s3 (4,12)

c s1 (6), (8) cs2 (6), (8)s3 (4)

(4, 10), (9, 12)(8, 11)(5, 10), (8, 12)

(a+<a-)

(4)(6), (8)(6), (8)

(c)Join

φφ(5, 6, 10), (5, 8, 10)

(a+<c<a-)

(4, 9, 12)(6, 8, 11)(6, 8, 12)(c<a+<a-)

min_sup = 50%

Delete!

(4, 4, 10)(8, 8, 11)(8, 8, 12)(a+=c<a-)

Fig. 3. Example of generating L2 from joining L1.

Join

(4, 4, 9, 12, 12)(6, 6, 8, 11, 11)(6, 6, 8, 12, 12)

(b+=c<a+<a-=b-)

φ(6, 8, 8, 11, 11)(6, 8, 8, 12, 12)

(b+<a+=c<a-=b-)

φ(6, 8, 8, 11)(6, 8, 8, 12)(c<a+=c<a-)

φ(6, 6, 8, 11)(6, 6, 8, 12)(b+=c<c<b-)

φ(6, 6, 8, 8, 11, 11)(6, 6, 8, 8, 12, 12)

(b+=c<a+=c<a-=b-)

Fig. 4. Example of generating Lk from joining Lðk�1Þðk > 2Þ.

1318 S.-Y. Wu, Y.-L. Chen / Data & Knowledge Engineering 68 (2009) 1309–1330

After generating L1, GenLk joins the ORs of two (k � 1)-events hybrid temporal patterns to obtain one or more patternswith length k. When generating L2;GenLk joins all pairs of patterns (including self-join) in L1. For example, we have to jointhe following pairs: ððaþ < a�Þ; ðaþ < a�ÞÞ; ððaþ < a�Þ; ðbþ < b�ÞÞ; ððaþ < a�Þ; ðcÞÞ; ððbþ < b�Þ; ðbþ < b�ÞÞ; ððbþ < b�Þ; ðcÞÞ, and((c), (c)) from Table 6. When generating Lkðk > 2Þ, GenLk joins the ORs of two patterns in Lðk�1Þ if they have the same prefix(k � 2)-events subpattern. Examples 5 and 6 explain GenLk, and Definition 7 explains how to join two ORs. The pseudo codeof subroutine GenLk is shown below.

Subroutine: Call GenLk(L(k � 1), OR(k � 1), min_sup)Input: L(k � 1): (k � 1)-event hybrid temporal pattern set; OR(k � 1): Occurrence records for all (k � 1)-

patterns in L(k � 1); min_sup: Support threshold given by the user

Output: Lk: k-event hybrid temporal pattern set; ORk: Occurrence records for all k-patterns in Lk.

Procedure GenLk(L(k � 1), OR(k � 1), min_sup) {

Lk = /; ORk = /;For each two patterns pa, pb in L(k � 1) {
Page 11: Discovering hybrid temporal patterns from sequences consisting of point- and interval-based events

S.-Y. Wu, Y.-L. Chen / Data & Knowledge Engineering 68 (2009) 1309–1330 1319

If (pa and pb share common prefix (k � 2)-events subpattern) {X & ORx = JoinOR(pa, pa.OR, pb, pb.OR, min_sup); //joining ORs of pa and pb to generate k-patternsLk = Lk [ X; // The Symbol [ represents the set union operation.

ORk = ORk [ ORx;

}}return Lk & ORk;

}

Example 5 (Generate L2 from L1). In Table 6, if we join ðaþ < a�Þ and (c), we obtain the following patterns:ðaþ < c < a�Þ; ðc < aþ < a�Þ, and ðaþ ¼ c < a�Þ, as shown in Fig. 3. Since min_sup is set to 50%, only ðc < aþ < a�Þ andðaþ ¼ c < a�Þ are frequent.

Example 6 (Generate Lk from L(k � 1) (k > 2)). In Fig. 4, we have four 3-events hybrid temporal patterns discovered from D inTable 2. GenLk joins only the pair ðbþ ¼ c < aþ < a� ¼ b�Þ and ðbþ ¼ c < c < b�Þ, because the first two events in both pat-terns are ðbþ < b�Þ and (c), and the prefix 2-events subpattern of these patterns is ðbþ ¼ c < b�Þ. Therefore, the two patternsshare the same prefix 2-events subpattern. GenLk then joins ðbþ ¼ c < aþ < a� ¼ b�Þ and ðbþ ¼ c < c < b�Þ to generate thepattern ðbþ ¼ c < aþ ¼ c < a� ¼ b�Þ.

Example 7 (Joining two occurrence records). In Example 5, when joining ðaþ < a�Þ and (c), we first join the first, second, andthird tuples of occurrence records for ðaþ < a�Þ and (c) separately. Taking the first tuple as an example, we have to join set{(5,10), (8,12)} with set {(6), (8)}, which results in set {(5,6,10), (5,8,10), (6,8,12), (8,8,12)}. In turn, the first two occurrencesgenerate pattern ðaþ < c < a�Þ and the last two generate patterns ðc < aþ < a�Þ and ðaþ ¼ c < a�Þ. This process is illustratedin Fig. 5.

In GenLk, before calling the subroutine JoinOR to join ORs of patterns pa and pb, we have to check if pa and pb have a com-mon prefix (k � 2)-subpattern, as defined in Definition 10. When checking for the common prefix, a mark is attached to theevent nodes for the common prefix part in patterns pa and pb. The kernel part of JoinOR, the subroutine to join ORs of two(k � 1)-patterns, is a process like a sequence alignment process, ORAlign. The differences between ORAlign and traditionalsequence alignment is that in ORAlign three additional things need to be considered: (1) the comparing event nodes mustbe of the common prefix part; (2) the time order in ORs pa.OR and pb.OR; and (3) the arrangement criteria as defined in Def-inition 6 for patterns pa and pb. The pseudo code for ORAlign() is shown below.

Subroutine: Call ORAlign(pa, ta, pb, tb)Input: pa, pb: two (k � 1)-patterns; ta: one occurrence record for a certain sid in pa.OR, tb: one occur-

rence record for the same sid to ta in pb.OR.Output: pc: a candidate k-pattern joined from pa and pb; tc: the occurrence record for pa which is joinedfrom ta and tb.Procedure ORAlign(pa, ta, pb, tb) {

ia = 0; ib = 0; pc = /; tc = /;While(ia < pa.size && ib < pb.size) {

Case 1: ta[ia] == tb[ib]

if (pa[ia] and pb[ib] are both prefix) {Append pa[ia] to pc; Append ta[ia] to tc;ia++; ib++;

} elseif (pa[ia]’s node priority is higher than that of pb[ib]) //According to Definition 6

(4, 10), (9, 12)

(8, 11)

(5, 10), (8, 12)

(a +<a -)

(4)

(6), (8)

(6), (8)

(c)

Join

join((5, 10), (6))join((5, 10), (8))join((8, 12), (6))join((8, 12), (8))

(a(a ++<c<a<c<a --))(c<a(c<a ++<a<a --))

(a(a ++=c<a=c<a --))

(a(a ++<c<a<c<a --))

Fig. 5. Joining two occurrence records.

Page 12: Discovering hybrid temporal patterns from sequences consisting of point- and interval-based events

1320 S.-Y. Wu, Y.-L. Chen / Data & Knowledge Engineering 68 (2009) 1309–1330

Append pa[ia] to pc; ia++;else

Append pb[ib] to pc; ib++;Case 2: ta[ia] < tb[ib]

Append pa[ia] to pc; ia++;Case 3: ta[ia] > tb[ib]

Append pb[ib] to pc; ib++;}if (ia < pa.size)Append the surplus part of pa and ta to pc and tc, respectively;

if (ib < pb.size)

Append the surplus part of pb and tb to pc and tc, respectively;return pc & tc;

}

In summary, HTPM first generates L1, and then iteratively executes GenLk to join pairs of patterns with length (k � 1) togenerate Lk (k P 2). If the generated Lk has more than one pattern, we increase k by one and then perform GenLk to generatethe next set of longer patterns. The entire HTPM process (for mining database D shown in Table 2) is listed in detail in Appen-dix A. All mining results from Table 2 are shown in Fig. 6. In addition, the correctness and completeness of HTPM are provenas follows:

4.2. The correctness and completeness of HTPM

The time complexity and memory usage for HTPM can be referred to Appendix B. The correctness and completeness ofHTPM are proven below:

Lemma 1 (Correctness of HTPM). The hybrid temporal patterns obtained by the HTPM algorithm are frequent.

Proof. The algorithm outputs a pattern only after its support has been examined and found satisfactory. h

Lemma 2 (Completeness of HTPM). The HTPM algorithm can find every frequent hybrid temporal k-pattern.

Proof. Given the set of all events E, L1 is constructed by removing non-frequent events from E. Thus, all frequent 1-patternscan be found. Assume that all frequent k-patterns Lk, are found. Let htp be a frequent (k + 1)-pattern. The following showsthat htp can be found from Lk. Since htp is frequent, all its subpatterns are frequent. Thus, we can find a pair pa and pb (sub-patterns of htp) in Lk that shares the same prefix (k � 1)-events subpattern. After joining pa and pb, we obtain htp and findthat it is frequent. h

4.3. Discussion of HTPM

Similar to sequential mining methods such as GSP, PrefixSpan, and SPADE, HTPM generates patterns based on theanti-monotone property. One of the drawbacks of GSP is its problem of efficiency, since it needs multiple database scansand generates a great deal of candidate patterns. This problem will worsen when discovering hybrid temporal patterns, be-

Fig. 6. Frequent hybrid temporal patterns discovered from database D.

Page 13: Discovering hybrid temporal patterns from sequences consisting of point- and interval-based events

S.-Y. Wu, Y.-L. Chen / Data & Knowledge Engineering 68 (2009) 1309–1330 1321

cause the relations among events are more complicated and the lengths of sequences are much longer in the hybrid problemthan those in the point-based problem. In HTPM, we employ the embedding store technique to solve the problem of discov-ering hybrid temporal patterns. This technique can reduce the number of database scans to a unique scan thereby speedingup the mining process.

The major difference between HTPM and GSP is that HTPM generates patterns by maintaining an OR set for each frequentpattern. By joining the ORs of frequent (k � 1)-event patterns, HTPM generates k-event patterns without the necessity of anyother database scan and can avoid non-necessary candidate pattern generation. In addition, in the pattern generation pro-cess, HTPM does not use a pruning phase like that used in GSP. There are two reasons for this. First, the pruning phase is notas efficient in the mining hybrid temporal pattern as that in the mining sequential pattern. Since a hybrid temporal pattern isalmost two times the length of a sequential pattern, pruning candidate patterns with infrequent subpatterns are much morewasteful of time in the former than in the latter. For example, it is easy to find 3-event subpatterns from a sequential patternwith four events ha,b,c,di. The results are ha,b,ci, ha,b,di, ha,c,di, hb,c,di. On the contrary, it is not easy to find all 3-eventsubpatterns from a hybrid temporal pattern with four events ðaþ < bþ < c < dþ < b� < a� < d�Þ. Second, generating pat-terns by joining the ORs of the patterns, instead of joining the patterns is efficient. In HTPM, (k + 1)-patterns are generatedby joining the ORs of two k-patterns. This ensures that only those (k + 1)-patterns that occur in at least one sequence will begenerated. In addition, when joining ORs, the supports of each (k + 1)-patterns are counted at the same time. The purpose ofthe pruning phase is to avoid support computations for non-necessary candidate patterns. Since the pattern supports arecounted in the OR joining process, there is no need to add a pruning phase to HTPM.

The usage of the OR in HTPM is similar to that of the id-list in SPADE. However, the purpose of OR and that of the id-list inHTPM and SPADE are not the same. SPADE uses the id-list for support-counting, while in HTPM, the OR is used not only forsupport counting but also pattern generation. Note that SPADE generates patterns based on the lattice structure. The candi-date k-patterns are generated by extending from the (k � 1)-patterns. In the sequential pattern mining problem, joining two(k � 1)-patterns will generate at most three candidates. For example, joining hb,a,ai and hb,a, fi will generate three candi-dates hb,a, (a, f)i, hb,a,a, fi, and hb,a, f,ai. However, in hybrid temporal patterns, we could obtain thirteen candidates by joiningtwo (k � 1)-patterns. For example, joining (a+ < a� < b + < b�), and (a+ < a� < d+ < d�) will generate 13 3-event candidate pat-terns, e.g., (a+ < a� < d+ < d� < b+ < b�), (a+ < a� < d+ < b+ = d� < b�), (a+ < a� < d+ < b+ < d� < b�), (a+ < a� < d+ < b+ < b�=d�),(a+ < a� < d+ < b+ < b� < d�), and (a+ < a� < b+ = d+ < d� < b�). To avoid generating this large number of candidates, HTPM gen-erates candidate patterns by joining ORs instead of joining patterns.

The pattern generation process in HTPM is greatly different from the one adopted in PrefixSpan and TPrefixSpan. HTPMgenerates patterns by joining the ORs of (k � 1)-patterns, while PrefixSpan and TPrefixSpan generate patterns by appendingthe large 1-patterns of the projected database to (k � 1)-patterns. In PrefixSpan, the processing of pattern generation is effi-cient, because there are only two possibilities for appending a large 1-pattern to a (k � 1)-pattern, equal and smaller. Forexample, appending event d to pattern ha,b,ci, gives only two possible results ha,b,c,di and ha,b, (c,d)i. However, in TPrefix-Span, the pattern generation process is not so easy. For example, appending event (d+ < d�) to pattern (a+ < b+ < a� < b�) weobtain 13 candidate k-patterns. Appending event (d+ < d�) to pattern (a+ < b+ < c+ < a� < b� < c�) we obtain 25 candidatek-patterns. To improve the pattern generation performance we do not adopt the PrefixSpan-based method to generate pat-terns in HTPM. Instead, we generate patterns by joining the ORs of the (k � 1)-patterns. With this approach, HTPM generatesonly the promising candidate patterns, which really occur in the sequence database.

5. Experiments

In this section, we discuss a two-part experiment carried out to illustrate the performance of HTPM and to verify the effec-tiveness of the discovered hybrid temporal patterns. In the first part, we compare the execution time and scalability of HTPMwith those of PrefixSpan [31], TPrefixSpan [38], and GSP [34]. In the second part we apply HTPM to a real case scenario to eval-uate the effectiveness of hybrid temporal pattern mining.

5.1. Performance evaluation

In our comparison of the HTPM performance with the existing methods of mining sequential patterns (point-based), Pre-fixSpan and GSP, and of mining temporal patterns (interval-based), TPrefixSpan the algorithms were implemented in Java lan-guage and tested on a Pentium IV 3.0GHz Windows XP system with 2GB of main memory and JVM (J2RE 1.4.2) as the Javaexecution environment.

The performance evaluation experiments were conducted using synthetic data sets, which were generated separately forsequential pattern mining methods and temporal pattern mining methods. Note that since the two mining methods havedifferent purposes, the data sets they use are also different. In the first performance evaluation experiment, we comparedthe execution time and scalability of HTPM with those of PrefixSpan and GSP. Then, we compared the execution time and sca-lability of HTPM with those of TPrefixSpan (which is based on PrefixSpan) in discovering temporal patterns. Since there are nonon-constraint-based hybrid temporal pattern mining methods, we could not compare HTPM’s performance with other algo-rithms for hybrid event sequences.

Page 14: Discovering hybrid temporal patterns from sequences consisting of point- and interval-based events

1322 S.-Y. Wu, Y.-L. Chen / Data & Knowledge Engineering 68 (2009) 1309–1330

5.1.1. Data generationThe synthetic data sets in the experiments were generated using the synthetic data generator designed by Agrawal and

Srikant in 1995 [2]. Since the original data generation program was designed to generate point-based event sequences, thedata generator could be used directly for the first part of the performance evaluation experiment; however, for the secondpart, the data generator had to be modified to generate interval-based event sequences.

Tables 7 lists the parameters used in the simulation. All are classical parameters used in previous studies [2,42]; therefore,they can be directly utilized in the first part of the experiment, to discover sequential patterns from point-based event se-quences. In the second part, however, discovering temporal patterns from interval-based event sequences, the data gener-ator needs some modifications. We modified the data generator as follows:

1. Two parameters jCj and jTj are multiplied together. Let CT denote the resulting value. Then, the length of each sequence,i.e., the number of events in a sequence, is determined by drawing a value from a Poisson distribution with mean CT.

2. All events are classified into three categories: long-interval events with average length 12; medium-interval events withaverage length 8; and short-interval events with average length 4. We set these parameters according to the heuristic rulethat a long event is three times longer than a short event, and a short event is only half the length of a medium event. Foreach event, we first randomly determine its type, and then determine its length by drawing a value from a Normaldistribution.

The starting time of each event relative to the starting point of its immediate preceding event is determined by randomlyselecting a value from {0,2,4,6}. When the value is 0, the current event and the preceding event begin at the same time.When the value is 4, the current event occurs 4 time units after the preceding event. The data generation method adoptedfor interval-based event sequence database is the same as that in [38].

5.1.2. Discovering patterns from point-based event sequencesIt is interesting to investigate the efficiency of HTPM compared to the existing point-based methods in discovering pat-

terns from the sequences of point-based events. The comparison of GSP, SPADE, and PrefixSpan has already been evaluated in[32]. The results indicate that SPADE performs better than GSP but not as well as PrefixSpan. Thus in our comparison we uti-lized only PrefixSpan, the most efficient sequential pattern mining algorithm, and GSP, the classical algorithm for discoveringsequential patterns. In addition, we used the generator proposed by Agrawal and Srikant to generate synthetic data sets.First, we examined the execution times of the above-mentioned algorithms, and then tested their scalabilities.

Some parameters in the execution time experiments were fixed: jTj ¼ 2:5;NS ¼ 5000;NI ¼ 25;000;N ¼ 10;000, andjDj ¼ 200;000. The other three parameters were set as in the four configurations shown in Table 8. The minimum supportthresholds in the first two configurations varied from 0.01 to 0.025. In C10-S8-I1.25, however, the minimum support thresh-old varied from 0.005 to 0.02, because no frequent sequence exists if the minimum support threshold is larger than 0.025 inthis configuration. In C20-S4-I1.25, the minimum support threshold varied from 0.02 to 0.035, because with long sequencedata, too many patterns are generated with a small minimum support. The results are summarized in Fig. 7.

We see that HTPM is almost as efficient as PrefixSpan for discovering point-based event sequences. GSP, however, has theworst performance, because it has to scan the database many times to generate frequent patterns from candidate patterns.HTPM is more efficient than GSP because it needs only one database scan. Frequent patterns are generated by joining the ORsof (k � 1)-event hybrid temporal patterns. Since the HTPM and PrefixSpan lines shown in Fig. 7 are so close, we further pro-vide their run time data; see Table 9. Although most of the HTPM run times are slower than those of PrefixSpan, the differ-ences are small. The reason why PrefixSpan outperforms HTPM in terms of run time is due to the difference in the methods ofpattern generation, as discussed in Section 4.3. In HTPM, patterns are generated by the joining of OR, which are appropriatefor handling the complicated relations of events in hybrid event sequences. On the other hand, in PrefixSpan, pattern exten-sion is adopted to generate patterns. The results show that OR joining (in HTPM) is not as efficient as pattern extension (inPrefixSpan) for pattern generation in sequences of pure point-based events, especially when sequence length or patternlength is long. Fortunately, the difference in run time between HTPM and PrefixSpan is not too large. HTPM’s performanceis satisfactory in terms of discovering traditional point-based sequential patterns.

Table 7Parameters of the synthetic data generator.

Parameters Description

jDj Number of customersjCj Average number of transactions per customerjTj Average number of items per transactionjSj Average length of maximal potentially large sequencesjIj Average size of itemsets in maximal potentially large sequencesNS Number of maximal potentially large sequencesNI Number of maximal potentially large itemsetsN Number of items

Page 15: Discovering hybrid temporal patterns from sequences consisting of point- and interval-based events

Table 8Configurations of synthetic data sets.

Name j C j j S j j I j

C10-S4-I1.25 10 4 1.25C10-S4-I2.5 10 4 2.5C10-S8-I1.25 10 8 1.25C20-S4-I1.25 20 4 1.25

C10-S4-I1.25

010002000300040005000600070008000

0.01 0.015 0.02 0.025min_sup

exec

utio

n tim

e (s

ec.)

HTPM

PrefixSpan

GSP

C10-S4-I2.5

010002000300040005000600070008000

0.01 0.015 0.02 0.025min_sup

exec

utio

n tim

e (s

ec.)

HTPM

PrefixSpan

GSP

C10-S8-I1.25

0

10000

20000

30000

40000

50000

0.005 0.01 0.015 0.02min_sup

exec

utio

n tim

e (s

ec.)

HTPM

PrefixSpan

GSP

C20-S4-I1.25

0

5000

10000

15000

20000

25000

30000

0.02 0.025 0.03 0.035min_sup

exec

utio

n tim

e (s

ec.)

HTPM

PrefixSpan

GSP

Fig. 7. Comparison of the execution times of HTPM, PrefixSpan, and GSP.

Table 9Execution time comparison of HTPM and PrefixSpan.

Methods 0.01 0.015 0.02 0.025

(a) C10-S4-I1.25HTPM 47 36 34 33PrefixSpan 50 29 24 23

(b) C10-S4-I2.5HTPM 37 33 33 33PrefixSpan 35 23 23 23

0.005 0.01 0.015 0.02

(c) C10-S8-I1.25HTPM 566 39 36 36PrefixSpan 136 25 25 25

0.02 0.025 0.03 0.035

(d) C20-S4-I1.25HTPM 133 102 92 83PrefixSpan 119 91 80 72

S.-Y. Wu, Y.-L. Chen / Data & Knowledge Engineering 68 (2009) 1309–1330 1323

The second part of this experiment was an examination of HTPM’s scalability on sequences of point-based events. Again,we used PrefixSpan and GSP for comparison. The first part concerned the number of customers jDj, and the second part con-cerned the average length of sequences per customer jCj. In scaling up jDj, we used the C10-S4-I1.25 setting and jDj rangedfrom 200,000 to 500,000. In scaling up jDj, we set the parameters as follows: jDj ¼ 200;000; jSj ¼ 4; jIj ¼ 1:25, and varied thevalue of jCj from 10 to 25. The experimental results are shown in Fig. 8. The six lines displayed in both charts in Fig. 8 cor-respond to a different minimum support threshold setting for HTPM, PrefixSpan, or GSP. The results on the left indicate thatthe execution times of all the algorithms increase linearly with jDj. Furthermore, HTPM’s scalability is as good as PrefixSpan,

Page 16: Discovering hybrid temporal patterns from sequences consisting of point- and interval-based events

Scalability of |D|

0

5000

10000

15000

20000

200,000 300,000 400,000 500,000

|D|

exec

utio

n tim

e (s

ec.)

HTPM-0.01

HTPM-0.02

PrefixSpan-0.01

PrefixSpan-0.02

GSP-0.01

GSP-0.02

Scalability of |C|

0

20000

40000

60000

80000

100000

10 15 20 25

|C|

exec

utio

n tim

e (s

ec.)

HTPM-0.02

HTPM-0.03

PrefixSpan-0.02

PrefixSpan-0.03

GSP-0.02

GSP-0.03

Fig. 8. Comparison of the scalabilities of HTPM, PrefixSpan, and GSP for jDj and jCj.

Table 10Scalability comparison of HTPM and PrefixSpan.

Methods 200,000 300,000 400,000 500,000

(a) S Scalability of jDjHTPM-0.01 47 66 85 112PrefixSpan-0.01 50 86 108 135HTPM-0.02 34 50 64 95PrefixSpan-0.02 24 46 61 74

10 15 20 25

(b) Scalability of jCjHTPM-0.02 34 71 133 269PrefixSpan-0.02 24 60 119 213HTPM-0.03 33 62 92 137PrefixSpan-0.03 28 44 80 114

1324 S.-Y. Wu, Y.-L. Chen / Data & Knowledge Engineering 68 (2009) 1309–1330

since the lines for them on the left-hand side of Fig. 8 are almost horizontal. On the right-hand side, we see that the executiontimes of HTPM and PrefixSpan grow linearly with jCj. Their lines fall almost on the horizontal x-axis. GSP is the least efficient,because multiple database scans are needed and too many candidate patterns are generated, especially when min_sup is low,when the number of sequences is large, or when the lengths of sequences are long. Again, the run time data are provided toshow the scalability comparison of HTPM and PrefixSpan more clearly; see Table 10. The execution times of HTPM are quiteclose to those of PrefixSpan. For almost all configurations, PrefixSpan outperforms HTPM, except for the scalability comparisonof jDj with min_sup = 0.01. This phenomenon arises because of the difference between the HTPM and PrefixSpan pattern gen-eration methods as discussed in Section 4.3. The OR joining method makes HTPM more sensitive to sequence length (or pat-tern length) than PrefixSpan. On the contrary, the PrefixSpan projection-based method is more sensitive to minimum supportthan the HTPM. That is why PrefixSpan does not always outperform HTPM, and vice versa. HTPM’s pattern generating perfor-mance was worse than that of PrefixSpan because it needed more effort to join ORs, especially when the database size waslarge or the sequence length was long. However, the scalabilities of jDj and jCj for HTPM are quite satisfactory as indicated bythe flatness of the slopes of these lines in Fig. 8.

5.1.3. Discovering patterns from interval-based event sequencesBesides point-based event sequences, HTPM can also discover frequent patterns from sequences of interval-based events.

We compare this process with that of a temporal pattern mining algorithm TPrefixSpan [38]. The execution times and sca-labilities of these two algorithms are compared. Since the input sequences needed for temporal pattern mining differ fromthose for sequential pattern mining, we cannot directly use the original data generators. Thus, we generated synthetic datasets by modifying Agrawal and Srikant’s data generator according to the steps mentioned in Section 5.1.1. Similar to theexperiments performed in Section 5.1.2, we first compared the execution times of HTPM and TPrefixSpan, and then their sca-labilities. The data generator’s parameters (in both parts of the experiment) are the same as those set in Section 5.1.2. Theexperimental results are shown in Figs. 9 and 10.

From Fig. 9, we see that HTPM performed better than TPrefixSpan, especially when min_sup was very low. Although bothmethods generated frequent k-patterns from frequent patterns with length (k � 1), they differed when growing patterns.HTPM generates frequent k-patterns directly by joining the occurrence records of Lk�1, which store the patterns’ occurrencetimes in each sequence. On the other hand, TPrefixSpan first generates candidate 1-event patterns from the projected data-base, then combines these 1-event patterns with the prefix (k � 1)-event patterns to generate candidate k-event patterns,and finally scans the projected database to determine the supports of these candidate patterns. Since HTPM does not needto generate a large candidate set in each phase, it performs better than TPrefixSpan.

Page 17: Discovering hybrid temporal patterns from sequences consisting of point- and interval-based events

C10-S4-I1.25

0

25

50

75

100

125

150

175

0.01 0.015 0.02 0.025min_sup

exec

utio

n tim

e (s

ec.)

HTPM

TPrefixSpan

C10-S4-I2.5

01020304050607080

0.01 0.015 0.02 0.025min_sup

exec

utio

n ti

me

(sec

.)

HTPMTPrefixSpan

C10-S8-I1.25

050

100150200250300350400450

0.005 0.01 0.015 0.02min_sup

exec

utio

n tim

e (s

ec.)

HTPM

TPrefixS pan

C20-S4-I1.25

0100200300400500600700800

0.02 0.025 0.03 0.035min_sup

exec

utio

n tim

e (s

ec.)

HTPMTPrefixSpan

Fig. 9. Comparison of the execution times of HTPM and TPrefixSpan.

Scalability of |D|

0

100

200

300

400

500

600

200,000 300,000 400,000 500,000|D|

exec

utio

n tim

e (s

ec.)

HTPM-0.01

HTPM-0.02

TPrefixSpan-0.01

TPrefixSpan-0.02

Scalability of |C|

0500

1000150020002500300035004000

10 15 20 25|C|

exec

utio

n tim

e (s

ec.)

HTPM-0.02

HTPM-0.03

TPrefixSpan-0.02

TPrefixSpan-0.03

Fig. 10. Comparison of the scalabilities of HTPM and TPrefixSpan for jCj and jCj.

S.-Y. Wu, Y.-L. Chen / Data & Knowledge Engineering 68 (2009) 1309–1330 1325

From the left-hand side of Fig. 10, we see that the scalabilities of HTPM and TPrefixSpan for customer number jDj are quitesatisfactory, because each line in this figure increases linearly with jDj. To put it more precisely, the scalability of HTPM isbetter than that of TPrefixSpan in most configuration settings, since the slope of its execution time line is smaller than thatof TPrefixSpan, especially when min_sup is low. The experimental results of scaling up sequence length jCj are shown on theright-hand side of Fig. 10. We see that the execution time of HTPM increases linearly with jCj, while that of TPrefixSpan in-creases exponentially. This result may be due to the difference in growing patterns mentioned above. Since TPrefixSpan has tomaintain the pair-wise relations of every interval-based event in the sequences in every projected database, the databaseprojection step takes much longer time, especially when patterns are long. On the contrary, HTPM generates patterns bymaintaining ORs of the patterns, so its performance is less related to pattern length. This is why the run time performanceof TPrefixSpan-0.02 is slightly better than that of HTPM-0.02 in terms of the scalability of jDj. When min_sup is high, patternsare few and short, and TPrefixSpan can perform well. Once min_sup is set lower, patterns become many and long, and theperformance of TPrefixSpan becomes inferior to that of HTPM. To summarize, HTPM outperforms TPrefixSpan, especially whenmin_sup is low.

5.2. Real case analyses

Now that we know HTPM’s efficiency is satisfactory, we further investigate its effectiveness on real financial data sets.Since the rise and fall of stock prices are essentially interval-based events, and stock dividends and splits are point-basedevents, we can say that stock price data sequences are hybrid event sequences. For a stock price mining scenario, we

Page 18: Discovering hybrid temporal patterns from sequences consisting of point- and interval-based events

1326 S.-Y. Wu, Y.-L. Chen / Data & Knowledge Engineering 68 (2009) 1309–1330

compare the mining effectiveness of hybrid temporal patterns, temporal patterns, and sequential patterns using stock pricedata downloaded from Yahoo-Finance (http://www.finance.yahoo.com/).

Mr. King, the imagined investor, is interested in investigating frequent patterns for the following six stocks: Apple Inc.(AAPL); Adobe Systems Inc. (ADBE); Internal Business Machines Corp. (IBM); Microsoft Corp. (MSFT); NEC Corp. (NIPNY);and the Nasdaq Composite Index (NASDAQ). Mr. King wants to know when a stock price will continue to increase, whenit will continue to decrease, when it will fluctuate rapidly, and when a company will declare a dividend or split. The firstthree kinds of events are interval-based and the last event is point-based. We transformed the raw data for this case accord-ing to the stock event types listed in Table 11. There are 23 event types all listed in Table 12.

We drew four attributes: firm, period, day, and price, from the raw data downloaded from Yahoo-Finance (http://finan-ce.yahoo.com/). Since the original price data for each stock were a long series, we transformed it into four columns to dis-cover hybrid temporal patterns: PID (period ID); EID (event ID); ts (starting time)/tp (occurrence time of point-based event);and te (ending time), which conforms to HTPM’s input data scheme. The stock price data we gathered were divided into twosets: the training set, from January 1, 1990 to December 31, 2005; and the testing set, from January 1, 2006 to January 29,2007. Both data sets were preprocessed into the aforementioned four columns. Based upon the different period lengths,months, and seasons, we cut the original data sequence in each data set (training and testing) into two sets of shorter se-quences, to obtain four preprocessed data sets: DSm

train;DSmtest;DSs

train and DSstest . Since we needed to compare the effectiveness

of hybrid temporal patterns, temporal patterns, and sequential patterns, the preprocessor further transformed the four datasets into twelve sets: hDSm

train;hDSmtest;hDSs

train;hDSstest; iDSm

train; iDSmtest; iDSs

train;hDSstest; pDSm

train; pDSmtest; pDSs

train and pDSstest . The first

four data sets consist of all 23 event types; see Table 12. The middle four data sets consist of 18 interval-based event types,"", ;;, and ";. The last four data sets consist of only 6 point-based event types, type .

Let RTrain be the set of patterns discovered from the training set, RTest be the set of all patterns occurring at least once in thetesting set, SupportTestðpÞ be the support of p in the testing set, and PrefixðpÞ be the prefix (k � 1)-pattern of p (p’s length is k.)Then, we define the average predicting power (Avg_PP) of RTrain as Eq. (2). The patterns in RTrain can be grouped into x sets,G1;G2; . . . ;Gx according to the last event of each pattern. In this case, x is 23, since there are 23 event types in each dataset. Avg_PPðRTrainÞ is computed by averaging the predictive accuracy of each group, Avg_PAðGiÞ. In Eq. (2), PA(p) is the predict-ability of every p 2 RTest , because the higher the value of PA(p), the more likely it is that we can predict the final outcomebased on the (k � 1)-prefix of p. Avg_PAðGiÞ is computed by averaging the PA(p) for each p in Gi. In conclusion, if a set of pat-terns has a higher Avg_PP than another set, this pattern set could provide a higher predicting power.

Table 1Stock e

Stock e

""##"#

Table 1Stock e

EID

12345678

Avg PPðRTrainÞ ¼P

Gi2fG1 ;...;GxgAvg PAðGiÞx

Avg PAðGiÞ ¼

Pp2Gi

PAðpÞ

jGji; if jGij > 0

0; if jGij ¼ 0

8<:

PAðPÞ ¼supporttestðpÞ

SupportTestðPrefixðpÞÞ ; if p 2 RTest

0; if p 2 Rtest

(ð2Þ

1vent types in the real case.

vent types Descriptions

The stock price increases for at least five daysThe stock price decreases for at least five daysThe stock price increases and decreases in turn at least three timesStock dividend or split

2vent types in the real cas twenty-three event types in the real case.

Event types EID Event types EID Event types

NASDAQ"" 9 ADBE"" 17 MSFT""NASDAQ## 10 ADBE## 18 MSFT##NASDAQ"# 11 ADBE"# 19 MSFT"#– 12 ADBE 20 MSFTAAPL"" 13 IBM"" 21 NIPNY""AAPL## 14 IBM## 22 NIPNY##AAPL"# 15 IBM"# 23 NIPNY"#AAPL 16 IBM 24 NIPNY

Page 19: Discovering hybrid temporal patterns from sequences consisting of point- and interval-based events

Table 13Predicting powers and pattern counts of hybrid and temporal patterns with period length = month (min_sup = 0.04).

Min_conf Hybrid temporal patterns Temporal patterns

Predicting power (%) Pattern counts Predicting power (%) Pattern counts

(a) L2-patterns’ predicting power0 17.99 897 14.91 7900.2 29.36 418 23.80 3930.4 42.82 115 33.13 100

(b) L3-patterns’ predicting power0 10.36 725 9.33 6180.2 36.40 196 33.67 1680.4 51.37 111 46.14 87

S.-Y. Wu, Y.-L. Chen / Data & Knowledge Engineering 68 (2009) 1309–1330 1327

Another index is introduced to remove unimportant patterns, i.e., confidence, as defined in Eq. (3). Patterns with confi-dences less than min_conf are removed. Basically, the higher the value of min_conf, the higher the predictive accuracy andthe fewer number of patterns we will obtain. The predicting power of a pattern set, however, is not necessarily positivelyrelated to min_conf and min_sup. If these two thresholds are set too high, some event types will not be predicted. This resultsin some Avg_PAðGiÞs of 0, and makes Avg_PP low.

Table 1Predicti

Min_con

(a) L2-p00.7

(b) L3-p00.7

supportðpÞsupportðPrefixðpÞÞ ð3Þ

We added the confidence threshold to all six training data sets. By adjusting min_conf for the above-mentioned trainingset, we realized how this factor influences the performances of the three kinds of data sets. Since only patterns with a lengthof at least 2 can be used to predict, we only used those patterns for comparison. The results are shown in Tables 13, 14, andFig. 11. Since the point-based events were too few and infrequent (at most twice a year), the patterns discovered from andwere either too short ðL1 only) or too few (only 1 pattern could make a prediction). Thus, Tables 13, 14, and Fig. 11 show onlythe comparison results for hybrid temporal patterns and temporal patterns.

4ng powers and pattern counts of hybrid and temporal patterns with period length = season (min_sup = 0.6).

f Hybrid temporal patterns Temporal patterns

Predicting power (%) Pattern counts Predicting power (%) Pattern counts

atterns’ predicting power57.34 381 54.48 32664.90 206 61.52 186

atterns’ predicting power51.78 1700 50.39 148972.23 691 71.45 579

Period = month, Length = 2, min_sup = 0.04

01020304050

0 0.2 0.4min_conf

Pred

ictin

g Po

wer

(%)

hybridpattern

temporalpattern

Period = month, Length = 3, min_sup = 0.04

0

20

40

60

0 0.2 0.4min_conf

Pred

ictin

g Po

wer

(%)

hybridpattern

temporalpattern

Period = season, Length = 2, min_sup = 0.6

455055606570

0 0.7min_conf

Pred

ictin

g Po

wer

(%)

hybridpattern

temporalpattern

Period = season, Length = 3, min_sup = 0.6

0

20

40

60

80

0 0.7min_conf

Pred

ictin

g Po

wer

(%)

hybridpattern

temporalpattern

Fig. 11. Comparison of the predicting powers of hybrid temporal patterns and temporal patterns.

Page 20: Discovering hybrid temporal patterns from sequences consisting of point- and interval-based events

1328 S.-Y. Wu, Y.-L. Chen / Data & Knowledge Engineering 68 (2009) 1309–1330

From Tables 13, 14, and Fig. 11, we see that the predicting powers of hybrid temporal patterns are better than those oftemporal patterns. The improvement, however, does not seem very significant. This may be because there were too fewpoint-based events in this real data set. We believe that hybrid temporal patterns would outperform temporal patterns toan even greater degree if the data set contained more point-based events.

6. Conclusion

The existing sequential pattern mining algorithms are designed for mining either pure point-based or pure interval-basedevent sequences. In many applications, however, events are hybrids, meaning the sequences consist of both kinds of events.Therefore, we develop a new algorithm, HTPM, for discovering hybrid temporal patterns from hybrid event sequences. Toverify the efficiency and effectiveness of this algorithm, we perform experiments not only on synthetic data but also on realdata. The experimental results show that the performance of our algorithm is highly satisfactory, and the patterns discoveredin real cases show that the predicting power of the hybrid temporal patterns is better than that of previous patterns.

This work can be extended in several ways. First, it has been shown in previous work [44] that the majority of the patternsfound by the mining methods are redundant and it is well-recognized that the main factor that hinders the application ofdata mining is the huge number of patterns returned by the mining process. Therefore, it might prove beneficial to studyhow to remove duplicate hybrid temporal patterns so that the information could be more compact. To this end, we can keeponly the maximal hybrid temporal patterns or keep only the closed hybrid temporal patterns or only the Top-k hybrid tem-poral patterns. Second, we extend the problem of mining hybrid temporal patterns to other types of computational modelssuch as incremental, parallel, distributed, and data stream models. Finally, a hierarchy of time intervals can be developed bycombining a set of smaller intervals (low level points) into a larger interval (a high level point). Using such an extension,multiple-level hybrid patterns can be discovered from hybrid event data.

Appendix A. Running HTPM in D

L1, as shown in Table 6, is obtained by scanning D (Table 2) once. From L1, GenLk joins all pairs of patterns to get L2, aslisted in Table 15. From the joined occurrence time values, we know the arrangement of event points and order relations(‘‘<” or ‘‘=”). From L2, GenLk joins the pairs with the same prefix 1-event subpattern. For example, in L2 of Table 15, onlythe following pairs should be joined: ((a+0 < a+1 < a�0 < a�1), (a+0 < a+1 < a�0 < a�1)), ((a+0 < a+1 < a�0 < a�1), (a+ = c < a�)),((a+ = c < a�), (a+ = c < a�)), ((b+ < a+ < a� = b�), (b+ < a+ < a� = b�)), ((b+ < a+ < a� = b�), (b+ = c < b�)), ((b+ < a+ < a� = b�),(b+ < c < b�)), ((b+ = c < b�), (b+ = c < b�)), ((b+ = c < b�), (b+ < c < b�)), ((b+ < c < b�), (b+ < c < b�)), ((c < a+ < a�), (c<a+ < a�)),((c < a+ < a�), (c < c)), and ((c < c), (c < c)). The underlined event nodes are the common prefix 1-event subpatterns of eachpair. The results from joining the above-mentioned pairs are as follows: (b+ = c < a+ < a� = b�), (b+ < a+ = c < a� = b�),(c < a+ = c < a�), and (b+ = c < c < b�), as shown in Fig. 4. From L3, GenLk joins only the pair (b+ = c < a+ < a� = b�) and(b+ = c < c < b�), and then obtains L4 ¼ fðbþ ¼ c < aþ ¼ c < a� ¼ b�Þg, as shown in the bottom table of Fig. 4. Since there isonly one pattern in L4, GenLk cannot join any more pairs, and the process stops. All the mining results are shown in Fig 6.

Table 15L2 discovered from D.

L2 Index Occurrence Illustration s1 (5, 8, 10, 12) (a+0<a+1<a-0<a-1) s3 (4, 9, 10, 12)

aa

s1 (6, 8, 12, 12) s2 (6, 8, 11, 11)

(b+<a+<a-=b-)

s3 (4, 9, 12, 12)

ba

s1 (8, 8, 12) s2 (8, 8, 11)

(a+=c<a-)

s3 (4, 4, 10) c

a

s1 (6, 8, 10) s2 (6, 8, 11)

(c<a+<a-)

s3 (4, 9, 12) a

c

s1 (6, 6, 12) s2 (6, 6, 11)

(b+=c<b-)

s3 (4, 4, 12)

bc

s1 (6, 8, 12) (b+<c<b-) s2 (6, 8, 11)

bc

s1 (6, 8) (c<c) s2 (6, 8) c

c

Page 21: Discovering hybrid temporal patterns from sequences consisting of point- and interval-based events

S.-Y. Wu, Y.-L. Chen / Data & Knowledge Engineering 68 (2009) 1309–1330 1329

Appendix B. Time complexity of HTPM

The time complexity of the HTPM algorithm is Oðn� sKÞ, where n denotes the number of sequences; s the length of longestsequence; and K the length of the longest pattern. The following is an informal description about the time complexityanalysis.

Basically, the algorithm contains two major parts. In the first part the entire database is scanned once to find all 1-eventhybrid patterns along with their ORs. In iteration k of the second part, where 1 6 k 6 K , the patterns in Lk are generated byjoining the ORs of (k � 1)-patterns in Lk�1. Since the algorithm mainly relies on the operation of joining the ORs of (k � 1)-patterns to generate k-patterns, the time needed by the algorithm can be estimated by the total number of occurrence re-cords generated in the entire process.

In the first part, every sequence in the database must be scanned, where the database scanning time is O(n � s). Since inORs of L1 we need to record all the places where all frequent events occur, the total number of occurrence records in L1 is atmost O(n � s). Therefore, the time needed to find L1 and build ORs is O(n � s).

In iteration k in the second part, we build Lk by joining the ORs of Lk�1. In the joining process for the ORs of the two pat-terns, we only join the ORs in the same sequence. In other words, if we want to join the ORs of two patterns (a+ < a�) and (c),only those ORs of (a+ < a�) and those ORs of (c) which are in the same sequences will be joined. Further, we also notice thatwhen two (k � 1)-patterns are joined, the pattern length will increase by one. This implies that the ORs of k-patterns can beobtained by inserting the ORs of 1-pattern into the ORs of (k � 1)-patterns.

After finding L1, the maximum number of ORs associated with a sequence is O(s), because the number of events in a se-quence is no more than s. To generate L2, we have to join the ORs of L1. No matter how many patterns the ORs associated witha sequence are distributed over, the worst case computation occurs when all ORs belong to a single pattern. Therefore, thisassumption is adopted in the following analysis. Accordingly, the total number of ORs for a sequence would be Oðs2Þ, and thetotal number of ORs for this iteration is Oðn� s2Þ. By repeating this reasoning for larger k along with the above observationthat the ORs of the k-patterns can be obtained by inserting the ORs of the 1-pattern into the ORs of the (k � 1)- patterns, wecan conclude that in iteration k the total number of ORs for a sequence would be OðskÞ and the total number of ORs isOðn� skÞ. Summing them together for all k, we finally obtain the total number of ORs for the entire algorithm is Oðn� sKÞ.

References

[1] R. Agrawal, R. Srikant, Fast algorithms for mining association rules, in: Proceedings of the 20th International Conference Very Large Data Bases, VLDB,1994, pp. 487–499.

[2] R. Agrawal, R. Srikant, Mining sequential patterns, in: Eleventh International Conference on Data Engineering, Taipei, Taiwan, IEEE Computer SocietyPress, 1995, pp. 3–14.

[3] J.F. Allen, Maintaining knowledge about temporal intervals, Communications of the ACM 26 (11) (1983) 832–843.[4] J.P. Caraca-Valente, I. Lopez-Chavarrias, Discovering similar patterns in time series, in: Proceedings of the Sixth ACM SIGKDD International Conference

on Knowledge Discovery and Data Mining 2000, ACM New York, NY, USA, pp. 497–505.[5] B. Chang, T. Wang, D. Yang, H. Luan, S. Tang, Efficient algorithms for incremental maintenance of closed sequential patterns in large databases, Data &

Knowledge Engineering 68 (1) (2009) 68–106.[6] Y.-L. Chen, Y.-H. Hu, Constraint-based sequential pattern mining: the consideration of recency and compactness, Decision Support Systems 42 (2)

(2006) 1203–1215.[7] Y.-L. Chen, T.C.-K. Huang, A novel knowledge discovering model for mining fuzzy multi-level sequential patterns in sequence databases, Data &

Knowledge Engineering 66 (3) (2008) 349–367.[8] H. Ceng, X. Yan, J. Han, IncSpan: incremental mining of sequential patterns in large database, in: Proceedings of the 2004 ACM SIGKDD International

Conference on Knowledge Discovery and Data Mining, Seattle, WA, USA, ACM Press, 2004, pp. 527–532.[9] S. Cong, J. Han, D. Padua, Parallel mining of closed sequential patterns, in: Proceedings of the 11th Conference on Knowledge Discovery in Data Mining

(ACM SIGKDD). 2005, ACM New York, NY, USA. pp. 562–567.[10] S. de Amo, D.A. Furtado, First-order temporal pattern mining with regular expression constraints, Data and Knowledge Engineering 62 (3) (2007) 401–

420.[11] S. de Amo, W.P. Junior, A. Giacometti, MILPRIT*: a constraint-based algorithm for mining temporal relational patterns, International Journal of Data

Warehousing and Mining 4 (4) (2008) 42–61.[12] S. de Amo, W.P. Junior, A. Giacometti, T.G. Clemente, Mining temporal relational patterns over databases with hybrid time domains, in: SBBD, 2007.[13] D.H. Dorr, A.M. Denton, Establishing relationships among patterns in stock market data, Data & Knowledge Engineering 68 (3) (2009) 318–337.[14] T.P. Exarchos, M.G. Tsipouras, C. Papaloukas, D.I. Fotiadis, A two-stage methodology for sequence classification based on sequential pattern mining and

optimization, Data & Knowledge Engineering 66 (3) (2008) 467–487.[15] M.N. Garofalakis, R. Rastogi, K. Shim, SPIRIT: sequential pattern mining with regular expression constraints, in: Proceedings of the 25th International

Conference on Very Large Data Bases, 1999, pp. 223–234.[16] J. Han, H. Cheng, D. Xin, X. Yan, Frequent pattern mining: current status and future directions, Data Mining and Knowledge Discovery 15 (1) (2007) 55–

86.[17] J. Han, M. Kamber, Mining Sequence Patterns in Transactional Databases, in: Data Mining: Concepts and Techniques, Morgan Kaufmann, 2006.[18] J. Han, J. Pei, B. Mortazavi-Asl, Q. Chen, U. Dayal, M.C. Hsu, FreeSpan: frequent pattern-projected sequential pattern mining, in: Proceedings of the Sixth

ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Boston, Massachusetts, United States, ACM Press, 2000, pp. 355–359.[19] T.P. Hong, K.Y. Lin, S.L. Wang, Mining fuzzy sequential patterns from multiple-item transactions, in: IFSA World Congress and 20th NAFIPS

International Conference, Vancouver, BC, Canada, 2001, pp. 1317–1321.[20] F. Hoppner, Knowledge Discovery from Sequential Data, Ph.D. Thesis, Technical University of Braunschweig, Germany, 2003.[21] Y.-C. Hu, G.-H. Tzeng, C.-M. Chen, Deriving two-stage learning sequences from knowledge in fuzzy sequential pattern mining, Information Sciences

159 (1–2) (2004) 69–86.[22] P.-S. Kam, A.W.-C. Fu, Discovering temporal patterns for interval-based events, in: Proceeding of Second International Conference on Data

Warehousing and Knowledge Discovery, London, UK, Springer, 2000, pp. 317–326.[23] M.Y. Lin, S.Y. Lee, Fast discovery of sequential patterns through memory indexing and database partitioning, Journal Information Science and

Engineering 21 (1) (2005) 109–128.

Page 22: Discovering hybrid temporal patterns from sequences consisting of point- and interval-based events

1330 S.-Y. Wu, Y.-L. Chen / Data & Knowledge Engineering 68 (2009) 1309–1330

[24] N.P. Lin, H.-J. Chen, W.-H. Hao, H.-E. Chueh, C.-I. Chang, Mining negative fuzzy sequential patterns, in: Proceedings of thr Seventh WSEAS InternationalConference Simulation, Modelling and Optimization Beijing, China 2007, pp. 52–57.

[25] D. Lo, S.C. Khoo, C. Liu, Efficient mining of iterative patterns for software specification discovery, in: Proceedings of SIGKDD International Conference onKnowledge Discovery and Data Mining, ACM, New York, NY, USA, 2007, pp. 460–469.

[26] F. Masseglia, P. Poncelet, M. Teisseire, Incremental mining of sequential patterns in large databases, Data and Knowledge Engineering 46 (1) (2003) 21–97.

[27] F. Masseglia, M. Teisseire, P. Poncelet, Sequential pattern mining: a survey on issues and approaches, in: Encyclopedia of Data Warehousing andMining, Information Science Publishing, 2005, pp. 1028–1032.

[28] T. Morzy, M. Wojciechowski, M. Zakrzewicz, Efficient constraint-based sequential pattern mining using dataset filtering techniques, in: Proceedings ofthe Baltic Conference, BalticDB&IS 2002, Table of contents, vol. 1, Institute of Cybernetics at Tallin Technical University, 2002, pp. 213–224.

[29] A. Nanopoulos, M. Zakrzewicz, T. Morzy, Y. Manolopoulos, Efficient storage and querying of sequential patterns in database systems?, Information andSoftware Technology 45 (1) (2003) 23–34

[30] S. Park, W.W. Chu, J. Yoon, C. Hsu, Efficient searches for similar subsequences of different lengths insequence databases, in: Proceedings of the 16thInternational Conference on Data Engineering, 2000, p. 23–32.

[31] J. Pei, J. Han, B. Mortazavi-Asl, H. Pinto, Q. Chen, U. Dayal, M.C. Hsu, PrefixSpan: mining sequential patterns efficiently by prefix-projected patterngrowth, in: Proceedings of the 17th International Conference on Data Engineering, Heidelberg, Germany, 2001, pp. 215–224.

[32] J. Pei, J. Han, B. Mortazavi-Asl, J. Wang, H. Pinto, Q. Chen, U. Dayal, M.C. Hsu, Mining sequential patterns by pattern-growth: the prefixspan approach,IEEE Transactions on Knowledge and Data Engineering 16 (11) (2004) 1424–1440.

[33] J. Pei, J. Han, Constraint-based sequential pattern mining: the pattern-growth methods, Journal of Intelligent Information Systems 28 (2) (2007) 133–160.

[34] R. Srikant, R. Agrawal, Mining sequential patterns: generalizations and performance improvements, in: Proceedings of the Fifth InternationalConference on Extending Database Technology (EDBT), Avignon, France, IBM Research Division, 1996, pp. 3–17.

[35] J. Wang, J. Han, C. Li, Frequent closed sequence mining without candidate maintenance, IEEE Transactions on Knowledge and Data Engineering 19 (8)(2007) 1042–1056.

[36] K. Wang, J. Tan, Incremental discovery of sequential patterns, in: ACM SIGMOD Data Mining Workshop: Research Issues on Data Mining andKnowledge Discovery (SIGMOD96), Montreal, Canada, 1996, pp. 95–102.

[37] M. Wojciechowski, Interactive constraint-based sequential pattern mining, in: Proceedings of the Fifth East European Conference on Advances inDatabases and Information Systems (ADBIS’01), Vilnius, Lithuania, 2001, pp. 169–181.

[38] S.-Y. Wu, Y.-L. Chen, Mining non-ambiguous temporal patterns for interval-based events, IEEE Transactions on Knowledge and Data Engineering 19 (6)(2007) 742–758.

[39] X. Yan, J. Han, R. Afshar, CloSpan: mining closed sequential patterns in large datasets, in: Proceedings of the International Conference SIAM DataMining, 2003, pp. 166–177.

[40] G. Yang, Computational aspects of mining maximal frequent patterns, Theoretical Computer Science 362 (1) (2006) 63–85.[41] D. Yuan, K. Lee, H. Cheng, G. Krishna, Z. Li, X. Ma, Y. Zhou, J. Han, CISpan: comprehensive incremental mining algorithms of closed sequential patterns

for multi-versional software mining, in: Proceedings of the SIAM International Conference on Data Mining (SDM 2008), 2008, p. 84–95.[42] M.J. Zaki, SPADE: an efficient algorithm for mining frequent sequences, Machine Learning 42 (1) (2001) 31–60.[43] M.J. Zaki, Efficiently mining frequent trees in a forest, in: Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and

Data Mining 2002, ACM New York, NY, USA. pp. 71–80.[44] M.J. Zaki, Mining non-redundant association rules, Data Mining and Knowledge Discovery 9 (3) (2004) 223–248.[45] M. Zhang, B. Kao, D. Cheung, C.L. Yip, Efficient algorithms for incremental update of frequent sequences, in: Proceedings of Pacific-Asia Conference on

Knowledge Discovery and Data Mining (PAKDD2002), Taipei, Taiwan, Springer Berlin/Heidelberg, 2002, pp. 186–197.[46] X. Zhu, X. Wu, A.K. Elmagarmid, Z. Feng, L. Wu, Video data mining: semantic indexing and event detection from the association perspective, IEEE

Transactions on Knowledge and Data Engineering 17 (5) (2005) 665–677.

Shin-Yi Wu is an Engineer of Industrial Technology Research Institute in Taiwan. She received her Ph.D. degree in the Departmentof Information Management, National Central University, Taiwan. She received the B.S. degree in Information Management andthe M.S. degree in Electrical Engineering from Chung Hua University in Taiwan in 2000 and 2002, respectively. Her currentresearch interests include data mining, sequential pattern mining, temporal pattern mining, clustering, and personalizedrecommendation.

Yen-Liang Chen is Professor of Information Management at National Central University of Taiwan. He received his Ph.D. degree in

computer science from National Tsing Hua University, Hsinchu, Taiwan. His current research interests include data mining,information retrieval, data warehousing and operations research. He has published papers in Data and Knowledge Engineering,IEEE Transactions on Knowledge and Data Engineering, IEEE Transactions on Software Engineering, IEEE Transactions on SMC –part A and part B, Information Systems, Information Sciences, Knowledge and Information Systems, Decision Support Systems,Journal of Information Science, Information and Management, Electronic Commerce Research & Applications, OperationsResearch, Naval Research Logistics, Transportation Research – part B, European Journal of Operational Research, Computers &Operations Research and many others. He is currently the Editor-in-chief of Journal of Information Management.