pattern discovery using sequence data mining

8/10/2019 Pattern Discovery Using Sequence Data Mining

1/286


2/286

Pradeep Kumar

Indian Institute of Management Lucknow, India

P. Radha KrishnaInfosys Lab, Infosys Limited, India

S. Bapi RajuUniversity of Hyderabad, India

Pattern Discovery UsingSequence Data Mining:

Applications and Studies


3/286


4/286

List of Reviewers

Manish Gupta, University of Illinois at Urbana, USAChandra Sekhar,Indian Institute of Technology Madras,IndiaArnab Bhattacharya,Indian Institute of Technology Kanpur,IndiaPadmaja T Maruthi, University of Hyderabad,IndiaT. Ravindra Babu,Infosys Technologies Ltd, IndiaPratibha Rani,International Institute of Information Technology Hyderabad, IndiaNita Parekh,International Institute of Information Technology Hyderabad, IndiaAnass El-Haddadi,IRIT, FrancePinar Senkul,Middle East Technical University, TurkeyJessica Lin, George Mason University, USAPradeep Kumar,Indian Institute of Management Lucknow, IndiaRaju S. Bapi, University of Hyderabad, IndiaP. Radha Krishna,Infosys Lab, Infosys Limited, India


5/286

Table of Contents

Preface ..................................................................................................................................................vii

Section 1

Current State of Art

Chapter 1

Applications of Pattern Discovery Using Sequential Data Mining ........................................................ 1Manish Gupta, University of Illinois at Urbana-Champaign, USA

Jiawei Han, University of Illinois at Urbana-Champaign, USA

Chapter 2

A Review of Kernel Methods Based Approaches to Classication and Clustering of SequentialPatterns, Part I: Sequences of Continuous Feature Vectors.................................................................. 24

Dileep A. D., Indian Institute of Technology, India

Veena T., Indian Institute of Technology, IndiaC. Chandra Sekhar, Indian Institute of Technology, India

Chapter 3

A Review of Kernel Methods Based Approaches to Classication and Clustering of SequentialPatterns, Part II: Sequences of Discrete Symbols ................................................................................. 51

Veena T., Indian Institute of Technology, India

Dileep A. D., Indian Institute of Technology, India

C. Chandra Sekhar, Indian Institute of Technology, India

Section 2Techniques

Chapter 4

Mining Statistically Signicant Substrings Based on the Chi-Square Measure ................................... 73Sourav Dutta, IBM Research Lab, India

Arnab Bhattacharya, Indian Institute of Technology Kanpur, India


6/286

Chapter 5

Unbalanced Sequential Data Classication Using Extreme Outlier Elimination and SamplingTechniques ............................................................................................................................................ 83

T. Maruthi Padmaja, University of Hyderabad (UoH), IndiaRaju S. Bapi, University of Hyderabad (UoH), India

P. Radha Krishna, Infosys Lab, Infosys Limited, India

Chapter 6

Quantization Based Sequence Generation and Subsequence Pruning for Data MiningApplications .......................................................................................................................................... 94

T. Ravindra Babu, Infosys Limited, India

M. Narasimha Murty, Indian Institute of Science Bangalore, India

S. V. Subrahmanya, Infosys Limited, India

Chapter 7Classication of Biological Sequences ............................................................................................... 111

Pratibha Rani, International Institute of Information Technology Hyderabad, India

Vikram Pudi, International Institute of Information Technology Hyderabad, India

Section 3

Applications

Chapter 8

Approaches for Pattern Discovery Using Sequential Data Mining .................................................... 137Manish Gupta, University of Illinois at Urbana-Champaign, USA

Jiawei Han, University of Illinois at Urbana-Champaign, USA

Chapter 9

Analysis of Kinase Inhibitors and Druggability of Kinase-Targets Using Machine LearningTechniques .......................................................................................................................................... 155

S. Prasanthi, University of Hyderabad, India

S. Durga Bhavani, University of Hyderabad, India

T. Sobha Rani, University of Hyderabad, India

Raju S. Bapi, University of Hyderabad, India

Chapter 10

Identication of Genomic Islands by Pattern Discovery.................................................................... 166Nita Parekh, International Institute of Information Technology Hyderabad, India

Chapter 11

Video Stream Mining for On-Road Trafc Density Analytics ........................................................... 182Rudra Narayan Hota, Frankfurt Institute for Advanced Studies, Germany

Kishore Jonna, Infosys Lab, Infosys Limited, India



7/286

Chapter 12

Discovering Patterns in Order to Detect Weak Signals and Dene New Strategies ........................... 195Anass El Haddadi, University of Toulouse III, France & University of Mohamed V, Morocco

Bernard Dousset, University of Toulouse, FranceIlham Berrada, University of Mohamed V, Morocco

Chapter 13

Discovering Patterns for Architecture Simulation by Using Sequence Mining.................................212Pnar Senkul, Middle East Technical University, Turkey

Nilufer Onder, Michigan Technological University, USA

Soner Onder, Michigan Technological University, USA

Engin Maden, Middle East Technical University, Turkey

Hui Meen Nyew, Michigan Technological University, USA

Chapter 14Sequence Pattern Mining for Web Logs ............................................................................................. 237

Pradeep Kumar, Indian Institute of Management Lucknow, India

Raju S. Bapi, University of Hyderabad, India


Compilation of References............................................................................................................... 244

About the Contributors.................................................................................................................... 264

Index................................................................................................................................................... 270


8/286

vii

Preface

A huge amount of data is collected every day in the form of sequences. These sequential data are valu-able sources of information not only to search for a particular value or event at a specific time, but alsoto analyze the frequency of certain events or sets of events related by particular temporal/sequentialrelationship. For example, DNA sequences encode the genetic makeup of humans and all other species,and protein sequences describe the amino acid composition of proteins and encode the structure andfunction of proteins. Moreover, sequences can be used to capture how individual humans behave throughvarious temporal activity histories such as weblog histories and customer purchase patterns. In generalthere are various methods to extract information and patterns from databases, such as time series ap-proaches, association rule mining, and data mining techniques.

The objective of this book is to provide a concise state-of-the-art in the field of sequence data min-ing along with applications. The book consists of 14 chapters divided into 3 sections. The first sectionprovides review of state-of-art in the field of sequence data mining. Section 2 presents relatively newtechniques for sequence data mining. Finally, in section 3, various application areas of sequence datamining have been explored.

Chapter 1,Approaches for Pattern Discovery Using Sequential Data Mining, by Manish Gupta andJiawei Han of University of Illinois at Urbana-Champaign, IL, USA, discusses different approaches formining of patterns from sequence data. Apriori based methods and the pattern growth methods are theearliest and the most influential methods for sequential pattern mining. There is also a vertical formatbased method which works on a dual representation of the sequence database. Work has also been donefor mining patterns with constraints, mining closed patterns, mining patterns from multi-dimensionaldatabases, mining closed repetitive gapped subsequences, and other forms of sequential pattern mining.Some works also focus on mining incremental patterns and mining from stream data. In this chapter,the authors have presented at least one method of each of these types and discussed advantages anddisadvantages.

Chapter 2, A Review of Kernel Methods Based Approaches to Classification and Clustering ofSequential Patterns, Part I: Sequences of Continuous Feature Vectors, was authored by Dileep A. D.,Veena T., and C. Chandra Sekhar of Department of Computer Science and Engineering, Indian Instituteof Technology Madras, India. They present a brief description of kernel methods for pattern classifica-tion and clustering. They also describe dynamic kernels for sequences of continuous feature vectors.The chapter also presents a review of approaches to sequential pattern classification and clustering usingdynamic kernels.


9/286

viii

Chapter 3 is A Review of Kernel Methods Based Approaches to Classification and Clustering ofSequential Patterns, Part II: Sequences of Discrete Symbolsby Veena T., Dileep A. D., and C. ChandraSekhar of Department of Computer Science and Engineering, Indian Institute of Technology Madras,

India. The authors review methods to design dynamic kernels for sequences of discrete symbols. In theirchapter they have also presented a review of approaches to classification and clustering of sequences ofdiscrete symbols using the dynamic kernel based methods.

Chapter 4 is titled,Mining Statistically Significant Substrings Based on the Chi-Square Measure,contributed by Sourav Dutta of IBM Research India along with Arnab Bhattacharya Dept. of ComputerScience and Engineering, Indian Institute of Technology, Kanpur, India. This chapter highlights the chal-lenge of efficient mining of large string databases in the domains of intrusion detection systems, playerstatistics, texts, proteins, et cetera, and how these issues have emerged as challenges of practical nature.Searching for an unusual pattern within long strings of data is one of the foremost requirements formany diverse applications. The authors first present the current state-of-art in this area and then analyzethe different statistical measures available to meet this end. Next, they argue that the most appropriate

metric is the chi-square measure. Finally, they discuss different approaches and algorithms proposed forretrieving the top-k substrings with the largest chi-square measure. The local-maxima based algorithmsmaintain high quality while outperforming others with respect to the running time.

Chapter 5 is Unbalanced Sequential Data Classification Using Extreme Outlier Elimination andSampling Techniques, by T. Maruthi Padmaja along with Raju S. Bapi from University of Hyderabad,Hyderabad, India and P. Radha Krishna, Infosys Lab, Infosys Technologies Ltd, Hyderabad, India. Thischapter focuses on problem of predicting minority class sequence patterns from the noisy and unbal-anced sequential datasets. To solve this problem, the atuhors proposed a new approach called extremeoutlier elimination and hybrid sampling technique.

Chapter 6 is Quantization Based Sequence Generation and Subsequence Pruning for Data MiningApplicationsby T. Ravindra Babu and S. V. Subrahmanya of E-Comm. Research Lab, Education and

Research, Infosys Technologies Limited, Bangalore, India, along with M. Narasimha Murty, Dept. ofComputer Science and Automation, Indian Institute of Science, Bangalore, India. This chapter has high-lighted the problem of combining data mining algorithms with data compaction used for data compression.Such combined techniques lead to superior performance. Approaches to deal with large data includeworking with a representative sample instead of the entire data. The representatives should preferablybe generated with minimal data scans, methods like random projection, et cetera.

Chapter 7 is Classification of Biological Sequencesby Pratibha Rani and Vikram Pudi of InternationalInstitute of Information Technology, Hyderabad, India, and it discusses the problem of classifying a newlydiscovered sequence like a protein or DNA sequence based on their important features and functions,using the collection of available sequences. In this chapter, the authors study this problem and presenttwo techniques Bayesian classifiers: RBNBC and REBMEC. The algorithms used in these classifiers

incorporate repeated occurrences of subsequences within each sequence. Specifically, RBNBC (RepeatBased Naive Bayes Classifier) uses a novel formulation of Naive Bayes, and the second classifier,REBMEC (Repeat Based Maximum Entropy Classifier) uses a novel framework based on the classicalGeneralized Iterative Scaling (GIS) algorithm.

Chapter 8,Applications of Pattern Discovery Using Sequential Data Mining, by Manish Gupta andJiawei Han of University of Illinois at Urbana-Champaign, IL, USA, presents a comprehensive reviewof applications of sequence data mining algorithms in a variety of domains like healthcare, education,Web usage mining, text mining, bioinformatics, telecommunications, intrusion detection, et cetera.


10/286

ix

Chapter 9,Analysis of Kinase Inhibitors and Druggability of Kinase-Targets Using Machine Learn-ing Techniques, by S. Prashanthi, S. Durga Bhavani, T. Sobha Rani, and Raju S. Bapi of Departmentof Computer & Information Sciences, University of Hyderabad, Hyderabad, India, focuses on human

kinase drug target sequences since kinases are known to be potential drug targets. The authors have alsopresented a preliminary analysis of kinase inhibitors in order to study the problem in the protein-ligandspace in future. The identification of druggable kinases is treated as a classification problem in whichdruggable kinases are taken as positive data set and non-druggable kinases are chosen as negative data set.

Chapter 10,Identification of Genomic Islands by Pattern Discovery, by Nita Parekh of InternationalInstitute of Information Technology, Hyderabad, India addresses a pattern recognition problem at thegenomic level involving identifying horizontally transferred regions, called genomic islands. A horizon-tally transferred event is defined as the movement of genetic material between phylogenetically unrelatedorganisms by mechanisms other than parent to progeny inheritance. Increasing evidence suggests theimportance of horizontal transfer events in the evolution of bacteria, influencing traits such as antibioticresistance, symbiosis and fitness, virulence, and adaptation in general. Considerable effort is being made

in their identification and analysis, and in this chapter, a brief summary of various approaches used inthe identification and validation of horizontally acquired regions is discussed.

Chapter 11, Video Stream Mining for On-Road Traffic Density Analytics, by Rudra Narayan Hota ofFrankfurt Institute for Advanced Studies, Frankfurt, Germany along with Kishore Jonna and P. RadhaKrishna, Infosys Lab, Infosys Technologies Limited, India, addresses the problem of estimating computervision based traffic density using video stream mining. The authors present an efficient approach fortraffic density estimation using texture analysis along with Support Vector Machine (SVM) classifier, anddescribe analyzing traffic density for on-road traffic congestion control with better flow management.

Chapter 12,Discovering Patterns in Order to Detect Weak Signals and Define New Strategies, byAnass El Haddadi of Universit de Toulouse, IRIT UMR France Bernard Dousset, Ilham Berrada ofEnsias, AL BIRONI team, Mohamed V University Souissi, Rabat, Morocco presents four methods

for discovering patterns in the competitive intelligence process: correspondence analysis, multiplecorrespondence analysis, evolutionary graph, and multi-term method. Competitive intelligenceactivities rely on collecting and analyzing data in order to discover patterns from data using sequencedata mining. The discovered patterns are used to help decision-makers considering innovation and de-fining business strategy.

Chapter 13,Discovering Patterns for Architecture Simulation by Using Sequence Mining,by PnarSenkul (Middle East Technical University, Computer Engineering Dept., Ankara, Turkey) along withNilufer Onder (Michigan Technological University, Computer Science Dept., Michigan, USA), SonerOnder (Michigan Technological University, Computer Science Dept., Michigan, USA), Engin Maden(Middle East Technical University, Computer Engineering Dept., Ankara, Turkey) and Hui Meen Nyew(Michigan Technological University, Computer Science Dept., Michigan, USA), discusses the problem

of designing and building high performance systems that make effective use of resources such as spaceand power. The design process typically involves a detailed simulation of the proposed architecture fol-lowed by corrections and improvements based on the simulation results. Both simulator developmentand result analysis are very challenging tasks due to the inherent complexity of the underlying systems.They present a tool called Episode Mining Tool (EMT), which includes three temporal sequence miningalgorithms, a preprocessor, and a visual analyzer.

Chapter 14 is called Sequence Pattern Mining for Web Logsby Pradeep Kumar, Indian Institute ofManagement, Lucknow, India, Raju S. Bapi, University of Hyderabad, India and P. Radha Krishna,


11/286

x

Infosys Lab, Infosys Technologies Limited, India. In their work, the authors utilize a variation to theAprioriALL Algorithm, which is commonly used for the sequence pattern mining. The proposed varia-tion adds up the measure Interest during every step of candidate generation to reduce the number of

candidates thus resulting in reduced time and space cost.This book can be useful to academic researchers and graduate students interested in data mining

in general and in sequence data mining in particular, and to scientists and engineers working in fieldswhere sequence data mining is involved, such as bioinformatics, genomics, Web services, security, andfinancial data analysis.

Sequence data mining is still a fairly young research field. Much more remains to be discovered inthis exciting research domain in the aspects related to general concepts, techniques, and applications.Our fond wish is that this collection sparks fervent activity in sequence data mining, and we hope thisis not the last word!

Pradeep KumarIndian Institute of Management Lucknow, India

P. Radha Krishna

Infosys Lab, Infosys Limited, India

S. Bapi Raju

University of Hyderabad, India


12/286

Section 1

Current State of Art


13/286


14/286

1

Copyright 2012, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.

Chapter 1

DOI: 10.4018/978-1-61350-056-9.ch001

HEALTHCARE

Patterns in healthcare domain include the commonpatterns in paths followed by patients in hospitals,patterns observed in symptoms of a particulardisease, patterns in daily activity and health data.Works related to these applications are discussedin this sub-section.

Patterns in patient paths: The purpose of the

French Diagnosis Related Groups informationsystem is to describe hospital activity by focusingon hospital stays. (Nicolas, Herengt & Albuisson,2004) propose usage of sequential pattern miningfor patient path analysis across multiple healthcareinstitutions. The objective is to discover, to classifyand to visualize frequent patterns among patientpath. They view a patient path as a sequence of

Manish Gupta

University of Illinois at Urbana-Champaign, USA

Jiawei HanUniversity of Illinois at Urbana-Champaign, USA

Applications of PatternDiscovery Using Sequential

Data Mining

ABSTRACT

Sequential pattern mining methods have been found to be applicable in a large number of domains.Sequential data is omnipresent. Sequential pattern mining methods have been used to analyze this data

and identify patterns. Such patterns have been used to implement efcient systems that can recommend

based on previously observed patterns, help in making predictions, improve usability of systems, de-

tect events, and in general help in making strategic product decisions. In this chapter, we discuss the

applications of sequential data mining in a variety of domains like healthcare, education, Web usage

mining, text mining, bioinformatics, telecommunications, intrusion detection, et cetera. We conclude

with a summary of the work.


15/286

2

Applications of Pattern Discovery Using Sequential Data Mining

sets. Each set in the sequence is a hospitaliza-tion instance. Each element in a hospitalizationcan be any symbolic data gathered by the PMSI

(medical data source). They used the SLPMinersystem (Seno & Karypis, 2002) for mining thepatient path database in order to find frequentsequential patterns among the patient path. Theytested the model on the 2002 year of PMSI data atthe Nancy University Hospital and also proposean interactive tool to perform inter-institutionalpatient path analysis.

Patterns in dyspepsia symptoms: Considera domain expert, who is an epidemiologist andis interested in finding relationships between

symptoms of dyspepsia within and across timepoints. This can be done by first mining patternsfrom symptom data and then using patterns todefine association rules. Rules could look likeANOREX2=0 VOMIT2=0 NAUSEA3=0 AN-OREX3=0 VOMIT3=0 DYSPH2=0 whereeach symptom is represented as N=V(time=N and value=V). ANOREX (anorexia),VOMIT (vomiting), DYSPH (dysphagia) andNAUSEA (nausea) are the different symptoms.However, a better way of handling this is to de-fine subgroups as a set of symptoms at a singletime point. (Lau, Ong, Mahidadia, Hoffmann,Westbrook, & Zrimec, 2003) solve the problemof identifying symptom patterns by implement-ing a framework for constraint based associationrule mining across subgroups. Their framework,Apriori with Subgroup and Constraint (ASC), isbuilt on top of the existing Apriori framework.They have identified four different types of phase-wise constraints for subgroups: constraint acrosssubgroups, constraint on subgroup, constraint onpattern content and constraint on rule. A constraintacross subgroups specifies the order of subgroupsin which they are to be mined. A constraint onsubgroup describes the intra-subgroup criteriaof the association rules. It describes a minimumsupport for subgroups and a set of constraints foreach subgroup. A constraint on pattern contentoutlines the inter-subgroup criteria on association

rules. It describes the criteria on the relationshipsbetween subgroups. A constraint on rule outlinesthe composition of an association rule; it describes

the attributes that form the antecedents and theconsequents, and calculates the confidence of anassociation rule. It also specifies the minimumsupport for a rule and prunes away item-sets that donot meet this support at the end of each subgroup-merging step. A typical user constraint can looklike [1,2,3][1, a=A1&n


16/286

3


new test cases, they identify behaviors that do notconform to normal behavior and report them aspredicted anomalous health problems.

EDUCATION

In the education domain, work has been doneto extract patterns from source code and studentteamwork data.

Patterns in source code: A coding pattern isa frequent sequence of method calls and controlstatements to implement a particular behavior.Coding patterns include copy-and-pasted code,

crosscutting concerns (parts of a program whichrely on or must affect many other parts of thesystem) and implementation idioms. Dupli-cated code fragments and crosscutting concernsthat spread across modules are problematic insoftware maintenance. (Ishio, Date, Miyake, &Inoue, 2008) propose a sequential pattern min-ing approach to capture coding patterns in Javaprograms. They define a set of rules to translateJava source code into a sequence database forpattern mining, and apply PrefixSpan algorithmto the sequence database. They define constraintsfor mining source code patterns. A constraint forcontrol statements could be: If a pattern includesa LOOP/IF element, the pattern must include itscorresponding element generated from the samecontrol statement. They classify sub-patterns intopattern groups. As a case study, they applied theirtool to six open-source programs and manuallyinvestigated the resultant patterns.

They identify about 17 pattern groups whichthey classify into 5 categories:

1. A boolean method to insert an additionalaction: , , ,

2. A boolean method to change the behaviorof multiple methods: ,, ,

3. A pair of set-up and clean-up: , , ,

4. Exception Handling: Every instance is in-cluded in a try-catch statement.

5. Other patterns.

They have made this technique available asa tool: Fung(http://sel.ist.osaka-u.ac.jp/~ishio/fung/)

Patterns in student team-work data: (Kay,Maisonneuve, Yacef, & Zaane, 2006) describedata mining of student group interaction data toidentify significant sequences of activity. The goal

is to build tools that can flag interaction sequencesindicative of problems, so that they can be usedto assist student teams in early recognition ofproblems. They also want tools that can identifypatterns that are markers of success so that thesemight indicate improvements during the learningprocess. They obtain their data using TRAC whichis an open source tool designed for use in softwaredevelopment projects. Students collaborate bysharing tasks via the TRAC system. These tasksare managed by a Ticket system; source codewriting tasks are managed by a version controlsystem called SVN; students communicate bymeans of collaborative web page writing calledWiki. Data consist of events where each eventis represented as Event = {EventType, Resour-ceId, Author, Time} where: EventType is one ofT (for Ticket), S (for SVN), W (for Wiki). Onesuch sequence is generated for each of the groupof students.

The original sequence obtained for each groupwas 285 to 1287 long. These event sequenceswere then broken down into several sequencesof events using a per session approach or a perresource approach. In breakdown per session ap-proach, date and the resourceId are omitted anda sequence is of form: (iXj) which captures thenumber of i consecutive times a medium X wasused by j different authors, e.g., . In breakdown per resource ap-


17/286

4


proach, sequence is of form which capturesthe number of i different events of type X, thenumber j of authors, and the number of days over

which t the resource was modified, e.g., . In a follow-up paper (Perera, Kay, Yacef, &Koprinska, 2007), they have a third approach,breakdown by task where every sequence is ofthe form (i,X,A) which captures the number ofconsecutive events (i) occurring on a particularTRAC medium (X), and the role of the author (A).

Patterns observed in group sessions: Bettergroups had many alternations of SVN and Wikievents, and SVN and Ticket events whereasweaker groups had almost none. The best group

also had the highest proportion of author ses-sions containing many consecutive ticket events(matching their high use of ticketing) and SVNevents (suggesting they committed their work tothe group repository more often).

A more detailed analysis of these patternsrevealed that the best group used the Ticketmore than the Wiki, whereas the weakest groupdisplayed the opposite pattern. The data sug-gested group leaders in good groups were muchless involved in technical work, suggesting workwas being delegated properly and the leader wasleading rather than simply doing all the work. Incontrast, the leaders of the poorer groups eitherseemed to use the Wiki (a less focused medium)more than the tickets, or be involved in too muchtechnical work.

Patterns observed in task sequences: The twobest groups had the greatest percentage supportfor the pattern (1,t,L)(1,t,b), which were mostlikely tickets initiated by the leader and acceptedby another team member. The fact this occurredmore often than (1,t,L)(2,t,b), suggests that thebetter groups were distinguished by tasks beingperformed on the Wiki or SVN files before theticket was closed by the second member. Notably,the weakest group had higher support for this latterpattern than the former. The best group was one ofthe only two to display the patterns (1,t,b)(1,s,b)and (1,s,b)(1,t,b) the first likely being a ticket

being accepted by a team member and then SVNwork relating to that task being completed and thesecond likely being work being done followed

by the ticket being closed. The close coupling oftask-related SVN and Wiki activity and Ticketevents for this group was also shown by relativelyhigh support for the patterns (1,t,b)(1,t,b)(1,t,b),(1,t,b)(1,s,b)(1,t,b) and (1,t,b)(1,w,b)(1,t,b). Thepoorest group displayed the highest support forthe last pattern, but no support for the former,again indicating their lack of SVN use in tasks.

Patterns observed in resource sequences: Thebest group had very high support for patternswhere the leader interacted with group members

on tickets, such as (L,1,t)(b,1,t)(L,1,t). The poorestgroup in contrast lacked these interaction patterns,and had more tickets which were created by theTracker rather than the Leader, suggestive ofweaker leadership. The best group displayed thehighest support for patterns such as (b,3,t) and(b,4,t), suggestive of group members making atleast one update on tickets before closing them.In contrast, the weaker groups showed supportmainly for the pattern (b,2,t), most likely indicativeof group members accepting and closing ticketswith no update events in between.

Web Usage Mining

The complexity of tasks such as Web site design,Web server design, and of simply navigatingthrough a Web site has been increasing continu-ously. An important input to these design tasksis the analysis of how a Web site is being used.Usage analysis includes straightforward statistics,such as page access frequency, as well as moresophisticated forms of analysis, such as findingthe common traversal paths through a Web site.Web Usage Mining is the application of patternmining techniques to usage logs of large Webdata repositories in order to produce results thatcan be used in the design tasks mentioned above.However, there are several preprocessing tasks that


18/286


19/286

6


the German SchulWeb (a database of German-language school magazines), demonstrate the ap-propriateness of WUM in discovering navigation

patterns and show how those discoveries can helpin assessing and improving the quality of the sitedesign i.e. conformance of the web sites structureto the intuition of each group of visitors accessingthe site. The intuition of the visitors is indirectlyreflected in their navigation behavior, as repre-sented in their browsing patterns. By comparingthe typical patterns with the site usage expectedby the site designer, one can examine the qualityof the site and give concrete suggestions for itsimprovement. For instance, repeated refinements

of a query may indicate a search environment thatis not intuitive for some users. Also, long listsof results may signal that sufficiently selectivesearch options are lacking, or that they are notunderstood by everyone.

A session is a directed list of page accessesperformed by a user during her/his visit in a site.Pages of a session are mapped onto elementsof a sequence, whereby each element is a paircomprised of the page and a positive integer. Thisinteger is the occurrence of the page in the session,taking the fact into account that a user may visit thesame page more than once during a single session.Further, they also define generalized sequenceswhich are sequences with length constraints ongaps. These constraints are expressed in a mininglanguage MINT.

The patterns that they observe are as follows.Searches reaching a school entry are a dominantsub-pattern. State lists of schools are the mostpopular lists. Schools are rarely reached in shortsearches.

Pattern Discovery forWeb Personalization

Pattern discovery from usage data can also be usedfor Web personalization. (Mobasher, Dai, Luo, &Nakagawa, 2002) find that more restrictive pat-terns, such as contiguous sequential patterns (e.g.,

frequent navigational paths) are more suitable forpredictive tasks, such as Web pre-fetching, whichinvolve predicting which item is accessed next by

a user), while less constrained patterns, such asfrequent item-sets or general sequential patternsare more effective alternatives in the context ofWeb personalization and recommender systems.

Web usage preprocessing ultimately resultsin a set of n page-views, P = {p

1, p

2... p

n}, and a

set of m user transactions, T = {t1, t

2... t

m}. Each

transaction t is defined as an l-length sequence ofordered pairs: t = where w(p

ti) is the weight associated with

page-view pti. Contiguous sequential patterns

(CSPs -- patterns in which the items appearingin the sequence must be adjacent with respectto the underlying ordering) are used to capturefrequent navigational paths among user trails.General sequential patterns are used to representmore general navigational patterns within the site.

To build a recommendation algorithm usingsequential patterns, the authors focus on frequentsequences of size |w| + 1 whose prefix contains anactive user session w. The candidate page-viewsto be recommended are the last items in all suchsequences. The recommendation values are basedon the confidence of the patterns. A simple triestructure is used to store both the sequential andcontiguous sequential patterns discovered duringthe pattern discovery phase. The recommendationalgorithm is extended to generate all kth orderrecommendations as follows. First, the recom-mendation engine uses the largest possible activesession window as an input for recommendationengine. If the engine cannot generate any recom-mendations, the size of active session window isiteratively decreased until a recommendation isgenerated or the window size becomes 0.

The CSP model can do better in terms of pre-cision, but the coverage levels, in general, maybe too low when the goal is to generate as manygood recommendations as possible. On the otherhand, when dealing with applications such asWeb pre-fetching in which the primary goal is to


20/286

7


predict the users immediate next actions (ratherthan providing a broader set of recommendations),the CSP model provides the best choice. This is

particularly true in sites with many dynamicallygenerated pages, where often a contiguous navi-gational path represents a semantically meaningfulsequence of user actions each depending on theprevious actions.

TEXT MINING

Pattern mining has been used for text databases todiscover trends, for text categorization, for docu-

ment classification and authorship identification.We discuss these works below.

Trends in Text Databases

(Lent, Agrawal, & Srikant, 1997) describe asystem for identifying trends in text documentscollected over a period of time. Trends can beused, for example, to discover that a companyis shifting interests from one domain to another.Their system mines these trends and also providesa method to visualize them.

The unit of text is a word and a phrase is a listof words. Associated with each phrase is a historyof the frequency of occurrence of the phrase, ob-tained by partitioning the documents based upontheir timestamps. The frequency of occurrence ina particular time period is the number of docu-ments that contain the phrase. A trend is a specificsubsequence of the history of a phrase that satisfiesthe users query over the histories. For example,the user may specify a shape query like a spikequery to find those phrases whose frequency ofoccurrence increased and then decreased. In thistrend analysis, sequential pattern mining is usedfor phrase identification.

A transaction ID is assigned to each word ofevery document treating the words as items in thedata mining algorithms. This transformed data isthen mined for dominant words and phrases, and

the results saved. The users query is translatedinto a shape query and this query is then executedover the mined data yielding the desired trends.

The results of the mining are a set of phrases thatoccur frequently in the underlying documents andthat match a query supplied by the user. Thus, thesystem has three major steps: Identifying frequentphrases using sequential patterns mining, generat-ing histories of phrases and finding phrases thatsatisfy a specified trend.

1-phrase is a list of elements where each ele-ment is a phrase. k-phrase is an iterated list ofphrases with k levels of nesting. is a 1-phrase, which can mean that

IBM and data mining should occur in the sameparagraph, with data mining being contiguouswords in the paragraph.

A word in a text field is mapped to an item ina data-sequence or sequential pattern and a phraseto a sequential pattern that has just one item ineach element. Each element of a data sequencein the sequential pattern problem has some as-sociated timestamp relative to the other elementsin the sequence thereby defining an ordering ofthe elements of a sequence. Sequential patternalgorithms can now be applied to the transactionID labeled words to identify simple phrases fromthe document collection.

User may be interested in phrases that arecontained in individual sentences only. Alterna-tively, the words comprising a phrase may comefrom sequential sentences so that a phrase spansa paragraph. This generalization can be accom-modated by the use of distance constraints thatspecify a minimum and/or maximum gap betweenadjacent words of a phrase. For example, the firstvariation described above would be constrainedby specifying a minimum gap of one word and amaximum gap of one sentence. The second varia-tion would have a minimum gap of one sentenceand a maximum gap of one paragraph. For thislatter example, one could further generalize thenotion from a single word from each sentenceto a set of words from each sentence by using a


21/286

8


sliding transaction time window within sentences.The generalizations made in the GSP algorithmfor mining sequential patterns allow a one-to-one

mapping of the minimum gap, maximum gap,and transaction window to the parameters of thealgorithm.

Basic mapping of phrases to sequential patternsis extended by providing a hierarchical mappingover sentences, paragraphs, or even sections of atext document. This extended mapping helps intaking advantage of the structure of a documentto obtain a richer set of phrases. Where a docu-ment has completely separate sections, phrasesthat span multiple sections can also be mined,

thereby discovering a new set of relationships.This enhancement of the GSP algorithm can beimplemented by changing the Apriori-like candi-date generation algorithm, to consider both phrasesand words as individual elements when generatingcandidate k-phrases. The manner in which thesecandidates are counted would similarly change.

Patterns for Text Categorization

(Jaillet, Laurent, & Teisseire, 2006) propose us-age of sequential patterns in the SPaC method(Sequential Patterns for Classification) for textcategorization. Text categorization is the task ofassigning a boolean value to each pair (document,category) where the value is true if the documentbelongs to the particular category. SPaC methodconsists of two steps. In the first step, sequentialpatterns are built from texts. In the second step,sequential patterns are used to classify texts.

The text consists of a set of sentences. Eachsentence is associated with a timestamp (its posi-tion in the text). Finally the set of words containedin a sentence corresponds to the set of items pur-chased by the client in the market basket analysisframework. This representation is coupled with astemming step and a stop-list. Sequential patternsare extracted using a different support applied foreach category C

i. The support of a frequent pattern

is the number of texts containing the sequence of

words. E.g., the sequential pattern < (data) (infor-mation) (machine)> means that some texts containwords data then information then machine in

three different sentences. Once sequential patternshave been extracted for each category, the goal isto derive a categorizer from the obtained patterns.This is done by computing, for each category, theconfidence of each associated sequential pattern.To solve this problem, a rule R is generated in thefollowing way:

R: C

i; confidence(R)=(#texts from C

i

matching )/(#texts matching ).

Rules are sorted depending on their confidencelevel and the size of the associated sequence.When considering a new text to be classified,a simple categorization policy is applied: the Krules having the best confidence level and beingsupported are applied. The text is then assignedto the class mainly obtained within the K rules.

Patterns for XML DocumentClassification

(Garboni, Masseglia, & Trousse, 2005) presenta supervised classification technique for XMLdocuments which is based on structure only. EachXML document is viewed as an ordered labeledtree, represented by its tags only. After a cleaningstep, each predefined cluster is characterized interms of frequent structural subsequences. Thenthe XML documents are classified based on themined patterns of each cluster.

Documents are characterized using frequentsub-trees which are common to at least x% (theminimum support) documents of the collection.The system is provided a set of training docu-ments each of which is associated with a category.Frequently occurring tags common to all clustersare removed. In order to transform an XML docu-ment to a sequence, the nodes of the XML treeare mapped into identifiers. Then each identifieris associated with its depth in the tree. Finally


22/286

9


a depth-first exploration of the tree gives thecorresponding sequence. An example sequentialpattern looks like . Once the whole set ofsequences (corresponding to the XML documentsof a collection) is obtained, a traditional sequentialpattern extraction algorithm is used to extractthe frequent sequences. Those sequences, oncemapped back into trees, will give the frequentsub-trees embedded in the collection.

They tested several measures in order to decidewhich class each test document belongs to. The twobest measures are based on the longest common

subsequence. The first one computes the averagematching between the test document and the setof sequential patterns and the second measure is amodified measure, which incorporates the actuallength of the pattern compared to the maximumlength of a sequential pattern in the cluster.

Patterns to Identify Authorsof Documents

(Tsuboi, 2002) aims at identifying the authorsof mailing list messages using a machine learn-ing technique (Support Vector Machines). Inaddition, the classifier trained on the mailinglist data is applied to identify the author of Webdocuments in order to investigate performance inauthorship identification for more heterogeneousdocuments. Experimental results show betteridentification performance when features of notonly conventional word N-gram information butalso of frequent sequential patterns extracted bya data mining technique (PrefixSpan) are used.

They applied PrefixSpan to extract sequentialword patterns from each sentence and used themas authors style markers in documents. Thesequential word patterns are sequential patternswhere item and sequence correspond to word andsentence, respectively.

Sequential pattern is where w

i

is a word and l is the length of pattern. * is any

sequence of words including empty sequence.These sequential word patterns were introducedfor authorship identification based on the fol-

lowing assumption. Because people usuallygenerate words from the beginning to the end ofa sentence, how one orders words in a sentencecan be an indicator of authors writing style. Asword order in Japanese (they study a Japanesecorpus) is relatively free, rigid word segmentsand non-contiguous word sequences may be aparticularly important indicator of the writingstyle of authors.

While N-grams (consecutive word sequences)fail to account for non-contiguous patterns, se-

quential pattern mining methods can do so quitenaturally.

BIOINFORMATICS

Pattern mining is useful in the bioinformaticsdomain for predicting rules for organization ofcertain elements in genes, for protein function pre-diction, for gene expression analysis, for proteinfold recognition and for motif discovery in DNAsequences. We study these applications below.

Pattern Mining for Bio-Sequences

Bio-sequences typically have a small alphabet,a long length, and patterns containing gaps (i.e.,dont care) of arbitrary size. A long sequence(especially, with a small alphabet) often containslong patterns. Mining frequent patterns in suchsequences faces a different type of explosionthan in transaction sequences primarily moti-vated in market-basket analysis. (Wang, Xu, &Yu, 2004) study how this explosion affects theclassic sequential pattern mining, and present ascalable two-phase algorithm to deal with thisnew explosion.

Biosequence patterns have the form of X1

*...* Xnspanning over a long region, where each

Xiis a short region of consecutive items, called


23/286

10


a segment, and * denotes a variable length gapcorresponding to a region not conserved in theevolution. The presence of * implies that pattern

matching is more permissible and involves thewhole range in a sequence. The support of a patternis the percentage of the sequences in the databasethat contain the pattern. Given a minimum segmentlength min_len and a minimum support min_sup,a pattern X

1*...* X

nis frequent if |X

i|>=min_len

for 1


24/286

11


combination of elements is important. In this case,pattern support is the number of genes containingthe pattern in their promoter region. Minimum

length of the patterns may vary with the species.They use Apriori algorithm to perform mining.

Each element typically has a length of 1020base pairs. Therefore, two elements sometimesoverlap one another. In this study, any two ele-ments overlapping each other are not consideredto be ordered elements, because they use elementsdefined by computational prediction. Most ofthese overlapping sites may have no biologicalmeaning; they may simply be false-positive hitsduring computational prediction of elements.

The decision of how to treat such overlappingelements is reflected in the count stage if apattern consisting of element A followed by andoverlapping with B should not be considered as, we can exclude genes containing suchelements when counting the support of .This is an interesting tweak in counting support,specific to this problem.

Patterns for Predicting ProteinSequence Function

(Wang, Shang, & Li, 2008) present a novel methodof protein sequence function prediction based onsequential pattern mining. First, known functionsequence dataset is mined to get frequent patterns.Then, a classifier is built using the patterns gen-erated to predict function of protein sequences.They propose the usage of joined frequent pat-terns based and joined closed frequent patternsbased sequential pattern mining algorithms formining this data. First, the joined frequent patternsegments are generated. Then, longer frequentpatters can be obtained by combining the abovesegments. They generate closed patterns only.The purpose of producing closed patterns is to usethem to construct a classifier for protein functionprediction. So using non-redundant patterns canimprove the accuracy of classification.

Patterns for Analysis ofGene Expression Data

(Icev, 2003) introduces a sequential pattern miningbased technique for the analysis of gene expres-sion. Gene expression is the effective productionof the protein that a gene encodes. They focus onthe characterization of the expression patterns ofgenes based on their promoter regions. The pro-moter region of a gene contains short sequencescalled motifs to which gene regulatory proteinsmay bind, thereby controlling when and in whichcell types the gene is expressed. Their approachaddresses two important aspects of gene expres-

sion analysis: (1) Binding of proteins at more thanone motif is usually required, and several differenttypes of proteins may need to bind several differ-ent types of motifs in order to confer transcrip-tional specificity. (2) Since proteins controllingtranscription may need to interact physically, theorder and spacing in which motifs occur can affectexpression. They use association rules to addressthe combinatorial aspect. The association ruleshave the ability to involve multiple motifs andto predict expression in multiple cell types. Toaddress the second aspect, association rules areenhanced with information about the distancesamong the motifs, or items that are present inthe rule. Rules of interest are those whose set ofmotifs deviates properly, i.e. set of motifs whosepair-wise distances are highly conserved in thepromoter regions where these motifs occur.

They define the cvd of a pair of motifs withrespect to a collection (or item-set) I of motifs asthe ratio between the standard deviation and themean of the distances between the motifs in thosepromoter regions that contain all the motifs in I.

Given a dataset of instances D, a minimumsupport min_sup, a minimum confidence min_conf, and a maximum coefficient of variation ofdistances (max-cvd), they find all distance-basedassociation rules from D whose support and confi-dence are >= the min_sup and min_conf thresholdsand such that the cvds of all the pairs of items


25/286

12


in the rule are


26/286

13


The score of a protein with respect to a foldis calculated based on the number of sequentialpatterns of this fold contained in the protein. The

higher the number of patterns of a fold containedin a protein, the higher the score of the proteinfor this fold.

A classifier uses the extracted sequential pat-terns to classify proteins in the appropriate foldcategory. For training and evaluating the proposedmethod they used the protein sequences fromthe Protein Data Bank and the annotation of theSCOP database. The method exhibited an overallaccuracy of 25% (random would be 2.8%) in aclassification problem with 36 candidate catego-

ries. The classification performance reaches up to56% when the five most probable protein foldsare considered.

Patterns for Protein Family Detection

In another work on protein family detection (pro-tein classification), (Ferreira & Azevedo, 2005)use the number and average length of the relevantsubsequences shared with each of the proteinfamilies, as features to train a Bayes classifier.Priors for the classes are set using the number ofpatterns and average length of the patterns in thecorresponding class.

They Identify Two Types of Patterns

Rigid Gap Patterns (only contain gaps with afixed length) and Flexible Gap Patterns (allow avariable number of gaps between symbols of thesequence). Frequent patterns are mined with theconstraint of minimum length. Apart from this,they also support item constraints (restricts set ofother symbols that can occur in the pattern), gapconstraints (minGap and maxGap), duration orwindow constraints which defines the maximumdistance (window) between the first and the lastevent of the sequence patterns.

Protein sequences of the same family typicallyshare common subsequences, also called motifs.

These subsequences are possibly implied in astructural or biological function of the family andhave been preserved through the protein evolution.

Thus, if a sequence shares patterns with othersequences it is expected that the sequences arebiologically related. Considering the two typesof patterns, rigid gap patterns reveal better con-served regions of similarity. On the other hand,flexible gap patterns have a greater probabilityof occur by chance, having a smaller biologicalsignificance. Since the protein alphabet is small,many small patterns that express trivial localsimilarity may arise. Therefore, longer patternsare expected to express greater confidence in the

sequences similarity.

Patterns in DNA Sequences

Large collections of genomic information havebeen accumulated in recent years, and embeddedin them is potentially significant knowledge forexploitation in medicine and in the pharmaceuticalindustry. (Guan, Liu, & Bell, 2004) detect stringsin DNA sequences which appear frequently, eitherwithin a given sequence (e.g., for a particularpatient) or across sequences (e.g., from differentpatients sharing a particular medical diagnosis).Motifs are strings that occur very frequently.Having discovered such motifs, they show how tomine association rules by an existing rough-setsbased technique.

TELECOMMUNICATIONS

Pattern mining can be used in the field of tele-communications for mining of group patternsfrom mobile user movement data, for customerbehavior prediction, for predicting future locationof a mobile user for location based services andfor mining patterns useful for mobile commerce.We discuss these works briefly in this sub-section.


27/286

14


Patterns in Mobile UserMovement Data

(Wang, Lim, & Hwang, 2006) present a new ap-proach to derive groupings of mobile users basedon their movement data. User movement data arecollected by logging location data emitted frommobile devices tracking users. This data is ofthe form D = (D

1, D

2... D

M), where D

iis a time

series of tuples (t, (x, y, z)) denoting the x, y andz coordinates of user u

iat time t. A set of consecu-

tive time points [ta, t

b] is called a valid segment

of G (where G is a set of users) if all the pair ofusers are within dist max_dis for time [t

a,t

b], at

least one pair of users has distance greater thanmax_dis before time t

a, at least one pair of users

has distance greater than max_dis after time tband

tb-t

a+1 >=min_dur. Given a set of users G, thresh-

olds max_dis and min_dur, these form a grouppattern, denoted by P = , ifG has a valid segment. Thus, a group pattern is agroup of users that are within a distance thresholdfrom one another for at least a minimum duration.

In a movement database, a group pattern mayhave multiple valid segments. The combinedlength of these valid segments is called the weight-count of the pattern. Thus the significance of thepattern is measured by comparing its weight-countwith the overall time duration.

Since weight represents the proportion of thetime points a group of users stay close together,the larger the weight is, the more significant (orinteresting) the group pattern is. Furthermore, ifthe weight of a group pattern exceeds a thresholdmin_wei, it is called a valid group pattern, andthe corresponding group of users a valid group.

To mine group patterns, they first propose twoalgorithms, namely AGP (based on Apriori) andVG-growth (based on FP-growth). They show thatwhen both the number of users and logging dura-tion are large, AGP and VG-growth are inefficientfor the mining group patterns of size two. There-fore, they propose a framework that summarizesuser movement data before group pattern mining.

In the second series of experiments, they showthat the methods using location summarizationreduce the mining overheads for group patterns

of size two significantly.

Patterns for CustomerBehavior Prediction

Predicting the behavior of customers is challeng-ing, but important for service oriented businesses.Data mining techniques are used to make suchpredictions, typically using only recent static data.(Eichinger, Nauck, & Klawonn) propose the usageof sequence mining with decision tree analysis for

this task. The combined classifier is applied to realcustomer data and produces promising results.

They Use Two SequenceMining Parameters

maxGap, the maximum number of allowed ex-tra events in between a sequence and maxSkip,the maximum number of events at the end of asequence before the occurrence of the event tobe predicted.

They use an Apriori algorithm to detect fre-quent patterns from a Sequence tree and hashtable based data structure. This avoids multipledatabase scans, which are otherwise necessaryafter every generation of candidate sequences inApriori based algorithms.

The frequent sequences are combined withdecision tree based classification to predict cus-tomer behavior.

Patterns for Future LocationPrediction of Mobile Users

Future location prediction of mobile users canprovide location-based services (LBSs) with ex-tended resources, mainly time, to improve systemreliability which in turn increases the users confi-dence and the demand for LBSs. (Vu, Ryu, & Park,2009) propose a movement rule-based Location


28/286

15


Prediction method (RLP), to guess the users futurelocation for LBSs. They define moving sequencesand frequent patterns in trajectory data. Further,

they find out all frequent spatiotemporal move-ment patterns using an algorithm based on GSPalgorithm. The candidate generating mechanismof the technique is based on that of GSP algorithmwith an additional temporal join operation anda different method for pruning candidates. Inaddition, they employ the clustering method tocontrol the dense regions of the patterns. Withthe frequent movement patterns obtained fromthe preceding subsection, the movement rules aregenerated easily.

Patterns for Mobile Commerce

To better reflect the customer usage patterns inthe mobile commerce environment, (Yun & Chen,2007) propose an innovative mining model, calledmining mobile sequential patterns, which takesboth the moving patterns and purchase patternsof customers into consideration. How to strike acompromise among the use of various knowledgeto solve the mining on mobile sequential patterns,is a challenging issue. They devise three algorithmsfor determining the frequent sequential patternsfrom the mobile transaction sequences.

INTRUSION DETECTION

Sequential pattern mining has been used for in-trusion detection to study patterns of misuse innetwork attack data and thereby detect sequentialintrusion behaviors and for discovering multistageattack strategies.

Patterns in Network Attack Data

(Wuu, Hung, & Chen, 2007) have implementedan intrusion pattern discovery module in Snortnetwork intrusion detection system which appliesdata mining technique to extract single intrusion

patterns and sequential intrusion patterns from acollection of attack packets, and then converts thepatterns to Snort detection rules for on-line intru-

sion detection. Patterns are extracted both frompacket headers and the packet payload. A typicalpattern is of the form A packet with DA port as139, DgmLen field in header set to 48 and withcontent as 11 11. Intrusion behavior detectionengine creates an alert when a series of incom-ing packets match the signatures representingsequential intrusion scenarios.

Patterns for Discovering Multi-Stage Attack Strategies

In monitoring anomalous network activities,intrusion detection systems tend to generate alarge amount of alerts, which greatly increase theworkload of post-detection analysis and decision-making. A system to detect the ongoing attacksand predict the upcoming next step of a multistageattack in alert streams by using known attackpatterns can effectively solve this problem. Thecomplete, correct and up to date pattern rule ofvarious network attack activities plays an impor-tant role in such a system. An approach based onsequential pattern mining technique to discovermultistage attack activity patterns is efficient toreduce the labor to construct pattern rules. Butin a dynamic network environment where novelattack strategies appear continuously, the novelapproach proposed by (Li, Zhang, Li, & Wang,2007) to use incremental mining algorithm showsbetter capability to detect recently appeared attack.They remove the unexpected results from miningby computing probabilistic score between suc-cessive steps in a multistage attack pattern. Theyuse GSP to discover multistage attack behaviorpatterns. All the alerts stored in database can beviewed as a global sequence of alerts sorted byascending DetectTime timestamp. Sequences ofalerts describe the behavior and actions of attack-ers. Multistage attack strategies can be found byanalyzing this alert sequence. A sequential pattern


29/286

16


is a collection of alerts that occur relatively closeto each other in a given order frequently. Oncesuch patterns are known, the rules can be produced

for describing or predicting the behavior of thesequence of network attack.

OTHER APPLICATIONS

Apart from the different domains mentionedabove, sequential pattern mining has been founduseful in a variety of other domains. We brieflymention works in some of such areas in this sub-section. Besides the works mentioned below, there

are some applications that may need to classifysequence data, such as based on sequence patterns.An overview on research in sequence classificationcan be found in (Xing, Pei & Keogh).

Patterns in Earth Science Data

The earth science data consists of time series mea-surements for various Earth science and climatevariables (e.g. soil moisture, temperature, andprecipitation), along with additional data fromexisting ecosystem models (e.g. Net PrimaryProduction). The ecological patterns of interestinclude associations, clusters, predictive models,and trends. (Potter, Klooster, Torregrosa, Tan,Steinbach, & Kumar) discuss some of the chal-lenges involved in preprocessing and analyzingthe data, and also consider techniques for handlingsome of the spatio-temporal issues. Earth Sciencedata has strong seasonal components that needto be removed prior to pattern analysis, as Earthscientists are primarily interested in patternsthat represent deviations from normal seasonalvariation such as anomalous climate events (e.g.,El Nino) or trends (e.g., global warming). Theyde-seasonalize the data and then compute varietyof spatio-temporal patterns. Rules learned fromthe patterns look like (WP-Hi) (Solar-Hi) (NINO34-Lo) (Temp-Hi) (NPP-Lo) where

WP, Solar etc are different earth science parameterswith values Hi (High) or Lo (Low).

Patterns for ComputerSystems Management

Predictive algorithms play a crucial role in sys-tems management by alerting the user to potentialfailures. (Vilalta, Apte, Hellerstein, Ma, & Weiss,2002) focus on three case studies dealing with theprediction of failures in computer systems: (1)long-term prediction of performance variables(e.g., disk utilization), (2) short-term predictionof abnormal behavior (e.g., threshold violations),

and (3) short-term prediction of system events(e.g., router failure). Empirical results showthat predictive algorithms based on mining ofsequential patterns can be successfully employedin the estimation of performance variables and theprediction of critical events.

Patterns to Detect Plan Failures

(Zaki, Lesh, & Mitsunori, 1999) present an al-gorithm to extract patterns of events that predictfailures in databases of plan executions: Plan-Mine. Analyzing execution traces is appropriatefor planning domains that contain uncertainty,such as incomplete knowledge of the world oractions with probabilistic effects. They extractcauses of plan failures and feed the discoveredpatterns back into the planner. They label eachplan as good or bad depending on whetherit achieved its goal or it failed to do so. The goalis to find interesting sequences that have a highconfidence of predicting plan failure. They useSPADE to mine such patterns.

TRIPS is an integrated system in which aperson collaborates with a computer to develop ahigh quality plan to evacuate people from a smallisland. During the process of building the plan,the system simulates the plan repeatedly basedon a probabilistic model of the domain, includ-


30/286

17


ing predicted weather patterns and their effect onvehicle performance.

The system returns an estimate of the plans

success. Additionally, TRIPS invokes PlanMineon the execution traces produced by simulation,in order to analyze why the plan failed when itdid. The system runs PlanMine on the executiontraces of the given plan to pinpoint defects in theplan that most often lead to plan failure. It thenapplies qualitative reasoning and plan adaptationalgorithms to modify the plan to correct the defectsdetected by PlanMine.

Patterns in Automotive

Warranty Data

When a product fails within a certain time period,the warranty is a manufacturers assurance to abuyer that the product will be repaired withouta cost to the customer. In a service environmentwhere dealers are more likely to replace than torepair, the cost of component failure during thewarranty period can easily equal three to ten timesthe suppliers unit price. Consequently, companiesinvest significant amounts of time and resourcesto monitor, document, and analyze product war-ranty data. (Buddhakulsomsiri & Zakarian, 2009)present a sequential pattern mining algorithm thatallows product and quality engineers to extracthidden knowledge from a large automotive war-ranty database. The algorithm uses the elementaryset concept and database manipulation techniquesto search for patterns or relationships amongoccurrences of warranty claims over time. Thesequential patterns are represented in a form ofIFTHEN association rules, where the IF portionof the rule includes quality/warranty problems,represented as labor codes, that occurred in anearlier time, and the THEN portion includeslabor codes that occurred at a later time. Once aset of unique sequential patterns is generated, thealgorithm applies a set of thresholds to evaluatethe significance of the rules and the rules thatpass these thresholds are reported in the solution.

Significant patterns provide knowledge of one ormore product failures that lead to future productfault(s). The effectiveness of the algorithm is il-

lustrated with the warranty data mining applicationfrom the automotive industry.

Patterns in Alarm Data

Increasingly powerful fault management systemsare required to ensure robustness and quality ofservice in todays networks. In this context, eventcorrelation is of prime importance to extractmeaningful information from the wealth of alarmdata generated by the network. Existing sequen-

tial data mining techniques address the task ofidentifying possible correlations in sequences ofalarms. The output sequence sets, however, maycontain sequences which are not plausible fromthe point of view of network topology constraints.(Devitt, Duffin, & Moloney, 2005) presents theTopographical Proximity (TP) approach whichexploits topographical information embedded inalarm data in order to address this lack of plausibil-ity in mined sequences. Their approach is basedon an Apriori approach and introduces a novelcriterion for sequence selection which evaluatessequence plausibility and coherence in the contextof network topology. Connections are inferred atrun-time between pairs of alarm generating nodesin the data and a Topographical Proximity (TP)measure is assigned based on the strength of theinferred connection. The TP measure is used toreject or promote candidate sequences on the basisof their plausibility, i.e. the strength of their con-nection, thereby reducing the candidate sequenceset and optimizing the space and time constraintsof the data mining process.

Patterns for PersonalizedRecommendation System

(Romero, Ventura, Delgado, & Bra, 2007) describea personalized recommender system that uses webmining techniques for recommending a student


31/286

18


which (next) links to visit within an adaptableeducational hypermedia system. They present aspecific mining tool and a recommender engine

that helps the teacher to carry out the whole webmining process. The overall process of Web per-sonalization based on Web usage mining generallyconsists of three phases: data preparation, patterndiscovery and recommendation. The first twophases are performed off-line and the last phaseon-line. To make recommendations to a student,the system first, classifies the new students in oneof the groups of students (clusters). Then, it onlyuses the sequential patterns of the correspond-ing group to personalize the recommendations

based on other similar students and his currentnavigation. Grouping of students is done usingk-means. They use GSP to get frequent sequencesfor each of the clusters. They mine rules of theform readmeinstall, welcomeinstall which areintuitively quite common patterns for websites.

Patterns in AtmosphericAerosol Data

EDAM (Exploratory Data Analysis and Man-agement) is a joint project between researchersin Atmospheric Chemistry and Computer Sci-ence at Carleton College and the University ofWisconsin-Madison that aims to develop datamining techniques for advancing the state of theart in analyzing atmospheric aerosol datasets.

The traditional approach for particle measure-ment, which is the collection of bulk samples ofparticulates on filters, is not adequate for studyingparticle dynamics and real-time correlations. Thishas led to the development of a new generationof real-time instruments that provide continuousor semi-continuous streams of data about certainaerosol properties. However, these instrumentshave added a significant level of complexity to at-mospheric aerosol data, and dramatically increasedthe amounts of data to be collected, managed, andanalyzed. (Ramakrishnan, et al., 2005) experiment

with a dataset consisting of samples from aerosoltime-of-flight mass spectrometer (ATOFMS).

A mass spectrum is a plot of signal intensity

(often normalized to the largest peak in the spec-trum) versus the mass-to-charge (m/z) ratio ofthe detected ions. Thus, the presence of a peakindicates the presence of one or more ions con-taining the m/z value indicated, within the ioncloud generated upon the interaction betweenthe particle and the laser beam. In many cases,the ATOFMS generates elemental ions. Thus, thepresence of certain peaks indicates that elementssuch as Na+ (m/z = +23) or Fe+ (m/z = +56) orO- (m/z = -16) ions are present. In other cases,

cluster ions are formed, and thus the m/z observedrepresents that of a sum of the atomic weights ofvarious elements.

For many kinds of analysis, what is significantin each particles mass spectrum is the composi-tion of the particle, i.e., the ions identified by thepeak labels (and, ideally, their proportions in theparticle, and our confidence in having correctlyidentified them). While this representation isless detailed than the labeled spectrum itself, itallows us to think of the ATOFMS data stream asa time-series of observations, one per observedparticle, where each observation is a set of ions(possibly labeled with some additional details).This is precisely the market-basket abstractionused in e-commerce: a time-series of customertransactions, each recording the items purchasedby a customer on a single visit to a store. Thisanalogy opens the door to applying a wide rangeof association rule and sequential pattern algo-rithms to the analysis of mass spectrometry data.Once these patterns are mined, they can be used toextrapolate to periods where filter-based sampleswere not collected.

Patterns in Individuals Time Diaries

Identifying patterns of activities within indi-viduals time diaries and studying similarities anddeviations between individuals in a population


32/286

19


is of interest in time use research. So far, activ-ity patterns in a population have mostly beenstudied either by visual inspection, searching for

occurrences of specific activity sequences andstudying their distribution in the population, orstatistical methods such as time series analysisin order to analyze daily behavior. (Vrotsou, El-legrd, & Cooper) describe a new approach forextracting activity patterns from time diaries thatuses sequential data mining techniques. Theyhave implemented an algorithm that searches thetime diaries and automatically extracts all activ-ity patterns meeting user-defined criteria of whatconstitutes a valid pattern of interest. Amongst the

many criteria which can be applied are: a timewindow containing the pattern, and minimumand maximum number of people that perform thepattern. The extracted activity patterns can thenbe interactively filtered, visualized and analyzedto reveal interesting insights using the VISUAL-TimePAcTS application. To demonstrate the valueof this approach they consider and discuss sequen-tial activity patterns at a population level, from asingle day perspective, with focus on the activitypaid work and some activities surrounding it.

An activity pattern in this paper is defined as asequence of activities performed by an individualwhich by itself or together with other activities,aims at accomplishing a more general goal/proj-ect. When analyzing a single day of diary data,activity patterns identified in a single individual(referred to as an individual activity pattern) areunlikely to be significant but those found amongsta group or population (a collective activity pat-tern) are of greater interest. Seven categories ofactivities that they consider are: care for oneself,care for others, household care, recreation/reflec-tion, travel, prepare/procure food, work/school.{cook dinner; eat dinner; wash dishes} isa typical pattern. They also incorporate a varietyof constraints like min and max pattern duration,min and max gap between activities, min andmax number of occurrences of the pattern andmin and max number of people (or a percentage

of the population) that should be performing thepattern. The sequential mining algorithm thatthey have used for the activity pattern extraction

is an AprioriAll algorithm which is adapted tothe time diary data.

Two stage classification using patterns: (Ex-archos, Tsipouras, Papaloukas, & Fotiadis, 2008)present a methodology for sequence classification,which employs sequential pattern mining andoptimization, in a two-stage process. In the firststage, a sequence classification model is defined,based on a set of sequential patterns and two setsof weights are introduced, one for the patterns andone for classes. In the second stage, an optimiza-

tion technique is employed to estimate the weightvalues and achieve optimal classification accuracy.Extensive evaluation of the methodology is car-ried out, by varying the number of sequences, thenumber of patterns and the number of classes andit is compared with similar sequence classifica-tion approaches.

CONCLUSION

We presented selected applications of the se-quential pattern mining methods in the fields ofhealthcare, education, web usage mining, textmining, bioinformatics, telecommunications,intrusion detection, etc. We envision that thepower of sequential mining methods has not yetbeen fully exploited. We hope to see many morestrong applications of these methods in a varietyof domains in the years to come.

REFERENCES

Berendt, B. A. (2000). Analysis of navigationbehaviour in web sites integrating multiple infor-mation systems. The VLDB Journal, 9(1), 5675.doi:10.1007/s007780050083


33/286

20


Buchner, A. G., & Mulvenna, M. D. (1998). Dis-covering Internet marketing intelligence throughonline analytical web usage mining. SIGMOD Re-cord, 27(4), 5461. doi:10.1145/306101.306124

Buddhakulsomsiri, J., & Zakarian, A. (2009). Se-quential pattern mining algorithm for automotivewarranty data.Journal of Computers and Indus-trial Engineering, 57(1), 137147. doi:10.1016/j.cie.2008.11.006

Chen, Y.-L., & Huang, T. C.-K. (2008). A novelknowledge discovering model for mining fuzzymulti-level sequential patterns in sequence data-bases. Data & Knowledge Engineering, 66(3),349367. doi:10.1016/j.datak.2008.04.005

Cooley, R., Mobasher, B., & Srivastava, J. (1999).Data preparation for mining World Wide Webbrowsing patterns. Knowledge and InformationSystems, 1(1), 532.

Devitt, A., Duffin, J., & Moloney, R. (2005).Topographical proximity for mining networkalarm data.MineNet 05: Proceedings of the 2005ACM SIGCOMM workshop on Mining network

data(pp. 179-184). Philadelphia, PA: ACM.

Eichinger, F., Nauck, D. D., & Klawonn, F. (n.d.).Sequence mining for customer behaviour predic-

tions in telecommunications.

Exarchos, T. P., Papaloukas, C., Lampros, C., &Fotiadis, D. I. (2008). Mining sequential patternsfor protein fold recognition.Journal of Biomedi-cal Informatics, 41(1), 165179. doi:10.1016/j.jbi.2007.05.004

Exarchos, T. P., Tsipouras, M. G., Papaloukas, C.,

& Fotiadis, D. I. (2008). A two-stage methodologyfor sequence classification based on sequentialpattern mining and optimization.Data & Knowl-edge Engineering,66(3), 467487. doi:10.1016/j.datak.2008.05.007

Ferreira, P. G., & Azevedo, P. J. (2005). Proteinsequence classification through relevant sequencemining and bayes classifiers.Proc. 12th Portu-guese Conference on Artificial Intelligence (EPIA)(pp. 236-247). Springer-Verlag.

Garboni, C., Masseglia, F., & Trousse, B. (2005).Sequential pattern mining for structure-based

XML document classification. Workshop of theINitiative for the Evaluation of XML Retrieval.

Guan, J. W., Liu, D., & Bell, D. A. (2004). Dis-covering motifs in DNA sequences. Fundam.Inform., 59(2-3), 119134.

Icev, A. (2003).Distance-enhanced associationrules for gene expression.BIOKDD03, in con-junction with ACM SIGKDD.

Ishio, T., Date, H., Miyake, T., & Inoue, K. (2008).Mining coding patterns to detect crosscutting con-cerns in Java programs. WCRE 08: Proceedingsof the 2008 15th Working Conference on Reverse

Engineering (pp. 123-132). Washington, DC:IEEE Computer Society.

Jaillet, S., Laurent, A., & Teisseire, M. (2006).

Sequential patterns for text categorization.Intel-ligent Data Analysis, 10(3), 199214.

Kay, J., Maisonneuve, N., Yacef, K., & Zaane,O. (2006).Mining patterns of events in studentsteamwork data. In Educational Data MiningWorkshop, held in conjunction with IntelligentTutoring Systems (ITS), (pp. 45-52).

Kum, H.-C., Chang, J. H., & Wang, W. (2006).Sequential Pattern Mining in Multi-Databases viaMultiple Alignment. Data Mining and Knowl-

edge Discovery, 12(2-3), 151180. doi:10.1007/s10618-005-0017-3

Kum, H.-C., Chang, J. H., & Wang, W. (2007).Benchmarking the effectiveness of sequentialpattern mining methods. Data & KnowledgeEngineering, 60(1), 3050. doi:10.1016/j.datak.2006.01.004


34/286

21


Kuo, R. J., Chao, C. M., & Liu, C. Y. (2009). In-tegration of K-means algorithm and AprioriSomealgorithm for fuzzy sequential pattern mining.Ap-plied Soft Computing, 9(1), 8593. doi:10.1016/j.asoc.2008.03.010

Lau, A., Ong, S. S., Mahidadia, A., Hoffmann,A., Westbrook, J., & Zrimec, T. (2003). Miningpatterns of dyspepsia symptoms across time pointsusing constraint association rules. PAKDD03:Proceedings of the 7th Pacific-Asia conference on

Advances in knowledge discovery and data mining(pp. 124-135). Seoul, Korea: Springer-Verlag.

Laur, P.-A., Symphor, J.-E., Nock, R., & Pon-celet, P. (2007). Statistical supports for miningsequential patterns and improving the incrementalupdate process on data streams.Intelligent DataAnalysis, 11(1), 2947.

Lent, B., Agrawal, R., & Srikant, R. (1997). Dis-covering trends in text databases.Proc. 3rd Int.Conf. Knowledge Discovery and Data Mining,

KDD(pp. 227-230). AAAI Press.

Li, Z., Zhang, A., Li, D., & Wang, L. (2007). Dis-covering novel multistage attack strategies.ADMA07: Proceedings of the 3rd international confer-

ence on Advanced Data Mining and Applications(pp. 45-56). Harbin, China: Springer-Verlag.

Lin, N. P., Chen, H.-J., Hao, W.-H., Chueh, H.-E.,& Chang, C.-I. (2008). Mining strong positive andnegative sequential patterns. W.Trans. on Comp.,7(3), 119124.

Mannila, H., Toivonen, H., & Verkamo, I. (1997).Discovery of frequent episodes in event sequences.

Data Mining and Knowledge Discovery, 1(3),259289. doi:10.1023/A:1009748302351

Masseglia, F., Poncelet, P., & Teisseire, M. (2003).Incremental mining of sequential patterns in largedatabases.Data & Knowledge Engineering,46(1),97121. doi:10.1016/S0169-023X(02)00209-4

Masseglia, F., Poncelet, P., & Teisseire, M. (2009).Efficient mining of sequential patterns with timeconstraints: Reducing the combinations.ExpertSystems with Applications, 36(2), 26772690.doi:10.1016/j.eswa.2008.01.021

Mendes, L. F., Ding, B., & Han, J. (2008). Streamsequential pattern mining with precise errorbounds.Proc. 2008 Int. Conf. on Data Mining(ICDM08),Italy, Dec. 2008.

Mobasher, B., Dai, H., Luo, T., & Nakagawa, M.(2002). Using sequential and non-sequential pat-terns in predictive Web usage mining tasks.ICDM02: Proceedings of the 2002 IEEE International

Conference on Data Mining(pp. 669-672). Wash-ington, DC: IEEE Computer Society.

Nicolas, J. A., Herengt, G., & Albuisson, E. (2004).Sequential pattern mining and classification ofpatient path. MEDINFO 2004: Proceedings OfThe 11th World Congress On Medical Informatics.

Parthasarathy, S., Zaki, M., Ogihara, M., &Dwarkadas, S. (1999). Incremental and interactivesequence mining. In Proc. of the 8th Int. Conf.on Information and Knowledge Management

(CIKM99).

Perera, D., Kay, J., Yacef, K., & Koprinska, I.(2007).Mining learners traces from an onlinecollaboration tool. Proceedings of Educational

Data Mining workshop(pp. 6069). CA, USA:Marina del Rey.

Pinto, H., Han, J., Pei, J., Wang, K., Chen, Q., &Dayal, U. (2001). Multi-dimensional sequentialpattern mining. CIKM 01: Proceedings of the

Tenth International Conference on Informationand Knowledge Management(pp. 81-88). NewYork, NY: ACM.

Potter, C., Klooster, S., Torregrosa, A., Tan, P.-N., Steinbach, M., & Kumar, V. (n.d.).Findingspatio-temporal patterns in earth science data.


35/286

22


Ramakrishnan, R., Schauer, J. J., Chen, L., Huang,Z., Shafer, M. M., & Gross, D. S. (2005). TheEDAM project: Mining atmospheric aerosol da-

tasets: Research articles.International Journal ofIntelligent Systems, 20(7), 759787. doi:10.1002/int.20094

Romero, C., Ventura, S., Delgado, J. A., & Bra,P. D. (2007).Personalized links recommendationbased on data mining un adaptive educational

hypermedia systems. Creating New LearningExperiences on a Global Scale. Second EuropeanConference on Technology Enhanced Learning,EC-TEL 2007 (pp. 293-305). Crete, Greece:

Springer.Seno, M., & Karypis, G. (2002). SLPMiner: Analgorithm for finding frequent sequential patternsusing length-decreasing support constraint. InProceedings of the 2nd IEEE International Con-

ference on Data Mining (ICDM), (pp. 418-425).

Srikant, R., & Agrawal, R. (1996)...Advances inDatabase Technology EDBT, 96, 317.

Terai, G., & Takagi, T. (2004). Predicting ruleson organization of cis-regulatory elements, tak-ing the order of elements into account. Bioin-formatics (Oxford, England), 20(7), 11191128.doi:10.1093/bioinformatics/bth049

Tsuboi, Y. (2002).Authorship identification forheterogeneous documents.

Vilalta, R., Apte, C. V., Hellerstein, J. L., Ma, S., &Weiss, S. M. (2002). Predictive algorithms in themanagement of computer systems.IBM SystemsJournal,41(3), 461474. doi:10.1147/sj.413.0461

Vrotsou, K., Ellegrd, K., & Cooper, M. (n.d.).Exploring time diaries using semi-automated

activity pattern extraction.

Vu, T. H., Ryu, K. H., & Park, N. (2009). Amethod for predicting