Discovering Temporal Change Patterns in the Presence of Taxonomies

Download Discovering Temporal Change Patterns in the Presence of Taxonomies

Post on 04-Dec-2016

212 views

Category:

Documents

0 download

TRANSCRIPT

  • Discovering Temporal Change Patternsin the Presence of Taxonomies

    Luca Cagliero

    AbstractFrequent itemset mining is a widely exploratory technique that focuses on discovering recurrent correlations among data.

    The steadfast evolution of markets and business environments prompts the need of data mining algorithms to discover significant

    correlation changes in order to reactively suit product and service provision to customer needs. Change mining, in the context of

    frequent itemsets, focuses on detecting and reporting significant changes in the set of mined itemsets from one time period to another.

    The discovery of frequent generalized itemsets, i.e., itemsets that 1) frequently occur in the source data, and 2) provide a high-level

    abstraction of the mined knowledge, issues new challenges in the analysis of itemsets that become rare, and thus are no longer

    extracted, from a certain point. This paper proposes a novel kind of dynamic pattern, namely the HIstory GENeralized Pattern (HIGEN),

    that represents the evolution of an itemset in consecutive time periods, by reporting the information about its frequent generalizations

    characterized by minimal redundancy (i.e., minimum level of abstraction) in case it becomes infrequent in a certain time period. To

    address HIGEN mining, it proposes HIGEN MINER, an algorithm that focuses on avoiding itemset mining followed by postprocessing by

    exploiting a support-driven itemset generalization approach. To focus the attention on the minimally redundant frequent

    generalizations and thus reduce the amount of the generated patterns, the discovery of a smart subset of HIGENs, namely the NON-

    REDUNDANT HIGENs, is addressed as well. Experiments performed on both real and synthetic datasets show the efficiency and the

    effectiveness of the proposed approach as well as its usefulness in a real application context.

    Index TermsData mining, mining methods and algorithms

    1 INTRODUCTION

    THE problem of discovering relevant data recurrencesand their most significant temporal trends is becomingan increasingly appealing research topic. The application offrequent itemset mining and association rule extractionalgorithms [1] to discover valuable correlations among datahas been thoroughly investigated in a number of differentapplication contexts (e.g., market basket analysis [1],medical image processing [4]). In the last years, the steadygrowth of business-oriented applications tailored to theextracted knowledge prompted the need of analyzing theevolution of the discovered patterns. Since, in manybusiness environments, companies are expected to reac-tively suit product and service provision to customer needs,the investigation of the most notable changes between theset of frequent itemsets or association rules mined fromdifferent time periods has become an appealing researchtopic [2], [9], [12], [19], [23], [27].

    Frequent itemset mining activity is constrained by aminimum support threshold to discover patterns whoseobserved frequency in the source data (i.e., the support) isequal to or exceeds a given threshold [1]. However, theenforcement of low support thresholds may entail generat-ing a very large amount of patterns which may becomehard to look into. On the other hand, higher supportthreshold enforcement may also prune relevant but not

    enough frequent recurrences. To overcome these issues,generalized itemset extraction could be exploited. General-ized itemsets, which have been first introduced in [3] in thecontext of market basket analysis, are itemsets that providea high level abstraction of the mined knowledge. Byexploiting a taxonomy (i.e., a is-a hierarchy) over dataitems, items are aggregated into higher level (generalized)ones. Tables 1a and 1b report two example datasets,associated, respectively, with two consecutive months ofyear 2009 (i.e., January and February). Each recordcorresponds to a product sale. In particular, for each salethe product description, the date, the time, and the locationin which it took place are reported. For the sake ofsimplicity, in this preliminary analysis I am interested inmining generalized itemsets involving just the location orthe product description. Fig. 1 shows a simple hierarchydefined on the location attribute. Table 1 reports the set ofall possible frequent generalized and not generalizeditemsets, and their corresponding absolute support values(i.e., observed frequencies), mined, by means of a tradi-tional approach [3], from the example datasets D1 and D2,by exploiting the taxonomy reported in Fig. 1 and byenforcing an absolute minimum support threshold equal to2. Readers can notice that some of the itemsets satisfy thesupport threshold in both D1 and D2, while the others arefrequent in only one out of two months.

    This paper focuses on change mining in the context offrequent itemsets by exploiting generalized itemsets torepresent patterns that become rare with respect to thesupport threshold, and thus are no longer extracted, at acertain point. Previous approaches allow both keeping trackof the evolution of the most significant pattern qualityindexes (e.g., [2], [7], [8]) and discovering their mostfundamental changes (e.g., [5], [12], [19]). Consider, for

    IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 25, NO. 3, MARCH 2013 541

    . The author is with the Politecnico di Torino Dipartimento di Automatica eInformatica, Corso Duca degli Abruzzi 24, Turin 10129, Italy.E-mail: luca.cagliero@polito.it.

    Manuscript received 2 May 2011; revised 2 Nov. 2011; accepted 7 Nov. 2011;published online 10 Nov. 2011.For information on obtaining reprints of this article, please send e-mail to:tkde@computer.org, and reference IEEECS Log Number TKDE-2011-05-0240.Digital Object Identifier no. 10.1109/TKDE.2011.233.

    1041-4347/13/$31.00 2013 IEEE Published by the IEEE Computer Society

  • instance, the itemsets {T-shirt, Turin} and {Jacket, Paris}. Theformer itemset is infrequent in D1 (i.e., its support value inD1 is lower than the support threshold) and becomesfrequent inD2 due to a support value increase from Januaryto February, while the latter shows an opposite trend. Albeitthese pieces of information might be of interest for businessdecision making, the discovery of higher level correlations,in the form of generalized itemsets, issues new challenges inthe investigation of pattern temporal trends. Item setgeneralization may allow preventing infrequent knowledgediscarding. At the same time, experts may look into theinformation provided by the sequence of generalizations orspecializations of the same pattern in consecutive timeperiods. For instance, the fact that the generalized itemset{T-shirt, Italy} (frequent in both January and February 2009)has a specialization (i.e., {T-shirt, Turin}) that is infrequent inJanuary while becomes frequent in the next month may bedeemed relevant for decision making and, thus, might bereported as it highlights a temporal correlation amongspatial recurrences regarding product T-shirt. Similarly, theinformation that, although {Jacket, Paris} becomes infrequentwith respect to the minimum support threshold movingfrom January to February, its generalization {Jacket, France}remains frequent in both months could be relevant foranalyst decision making as well. However, to the best of myknowledge, this kind of information cannot be neitherdirectly mined nor concisely represented by means of anypreviously proposed approach.

    This paper proposes a novel kind of dynamic pattern,namely theHIstory GENeralized Pattern (HIGENs). AHIGEN

    compactly represents the minimum sequence of general-izations needed to keep knowledge provided by a notgeneralized itemset frequent, with respect to the minimumsupport threshold, in each time period. If an itemset isfrequent in each time period, the corresponding HIGEN justreports its support variations. Otherwise, when the itemsetbecomes infrequent at a certain point, the HIGEN reports theminimumnumber of generalizations (and the correspondinggeneralized itemsets) needed tomake its covered knowledgefrequent at a higher level of abstraction. I argue that afrequent generalization at minimum abstraction level repre-sents the rare knowledge with minimal redundancy. In casean infrequent not generalized itemset has multiple general-izations belonging to the same minimal aggregation level,many HIGENs associated with the same itemset aregenerated. To focus the attention of the analysts on thefrequent generalizations of a rare itemset covering thesame knowledge with a minimal amount of redundancyand, thus, reduce the number of the generated patterns, amore selective type of HIGEN, i.e., the NONREDUNDANTHIGEN, is proposed as well. NONREDUNDANT HIGENs areHIGENs that include, for each time period, the frequentgeneralizations of the reference itemset of minimal general-ization level characterized by minimal support.

    In Table 2 the set of HIGENs, mined from the exampledataset by enforcing an absolute minimum support thresh-old equal to 2, is reported. The reference itemset is writtenin boldface. The relationship % means that the right-handside itemset is a generalization of the left-hand side one,&implies a specialization relationship, whilee> implies that noabstraction level change occurs. For instance, {Jacket, Paris}% {Jacket, France} is a HIGEN stating that, from January (D1)

    542 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 25, NO. 3, MARCH 2013

    TABLE 1Running Example Datasets

    Fig. 1. Generalization hierarchy for the location attribute in exampledatasets D1 and D2.

    TABLE 2Extracted Patterns

  • to February (D2) 2009, the not generalized itemset {Jacket,Paris} becomes infrequent with respect to the minimumsupport threshold while its upper level generalization (seeFig. 1) remains frequent. All the HIGENs reported inTable 2b refer to a different not generalized referenceitemset. In case the extraction would generate more HIGENsper itemset, the generation of the NONREDUNDANT HI-GENs generates just the HIGENs covering the minimalamount of redundancy, i.e., the ones including general-izations with minimal support.

    To address HIGEN mining, the HIstory GENeralizedPattern MINER (HIGEN MINER) has been proposed. Insteadof generating the HIGENs by means of a postprocessingstep that follows the generalized itemset mining processperformed on data collected at each time interval, theHIGEN MINER algorithm directly addresses the HIGENmining by extending a support-driven generalizationapproach, first proposed in [6], to a dynamic context. Theproposed algorithm avoids both redundant knowledgeextraction followed by postpruning and multiple taxonomyevaluations over the same pattern mined from differenttime intervals. A slightly modified version of the HIGENMINER algorithm is also proposed to address NONREDUN-DANT HIGEN mining.

    The proposed approach may be profitably exploited in anumber of different application domains (e.g., context-aware domain, network traffic analysis, social networkanalysis). Experiments, reported in Section 6, show 1) theeffectiveness of the proposed approach in supporting expertdecision making through the analysis of context datacoming from a context-aware mobile environment and2) the efficiency and the scalability of the proposed HIGENMINER algorithm on synthetic datasets.

    This paper is organized as follows. Section 2 discussesand compares state-of-the-art approaches concerning bothchange mining and generalized itemset mining with theproposed approach. Section 3 introduces preliminarynotions, while Section 4 formally states the HIGEN miningproblem. Section 5 describes the HIGEN MINER algorithm.Section 6 evaluates the efficiency and the effectiveness ofthe proposed approach. Finally, Section 7 draws conclu-sions and presents related future work.

    2 RELATED WORK

    Frequent itemset mining has been first proposed in [1], in thecontext of market basket analysis, as the first step of theassociation rule extraction process. The problem of discover-ing relevant changes in the history of itemsets andassociation rules has been already addressed by a numberof research papers (e.g., [2], [5], [7], [8], [12], [19]). Active datamining [2] was the first attempt to represent and query thehistory pattern of the discovered association rule qualityindexes. It first mines rules, from datasets collected indifferent time periods, by adding rules and their relatedquality indexes (e.g., support and confidence [1]) to acommon rule base. Next, the analyst could specify a historypattern in a trigger which is fired when such a pattern trendis exhibiting. Albeit history patterns allow tracking mostnotable pattern index changes, they neither keep theinformation covered by rare patterns nor show links among

    patterns at different abstraction levels. More recently, othertime-related data mining frameworks tailored to monitorand detect changes in rule support and confidence have beenproposed [7], [8], [25]. In [8], patterns are evaluated andpruned based on both subjective and objective interesting-ness measures. The aim of [7] is to monitor the minedpatterns and reduce the effort spent in data mining. Toachieve this goal, new patterns are observed as soon as theyemerge, and old patterns are removed from the rule baseas soon as they become extinct. To further reduce thecomputational cost, at one time period a subset of rules isselected and monitored, while data changes that occur insubsequent periods are measured by their impact on therules being monitored. Similarly, [25] addresses itemsetchange mining from time-varying data streams. However,the approaches presented in [7], [8], and [25] do not addressrare knowledge representation by means of generalizeditemsets. Differently, Liu et al. [20] deal with rule changemining by discovering two main types of rules: 1) the stablerules, i.e., rules that do not change a great deal over time and,thus, are more reliable and could be trusted, and 2) the trendrules, i.e., rules that indicate some underlying systematictrends of potential interest. Similarly, this paper categorizesthe HIGENs based on their temporal trend. However, theirevolution has been classified in terms of itemset general-izations or specializations that occur over time. The problemof discovering the subset of most relevant changes inassociation rules has been addressed by 1) evaluating rulechanges that occur between two time periods by means ofchi-square test [19], 2) searching for border descriptions ofemerging patterns extracted from a pair of datasets [12], and3) applying a fuzzy approach to rule change evaluation [5].Unlike [5], [12], [19], the proposed approach allows bothcomparing more than two time periods at a time andrepresenting patterns that become infrequent in a certaintime period by means of one of their generalizationscharacterized by minimal redundancy.

    A parallel issue concerns the detection of most notablechanges in multidimensional data measures. Imieliski et al.[18] first introduce the concept of cubegrade and comparesthe multidimensional data cube cells with their gradientcells, namely their ancestors, descendants, and siblings.Dong et al. [11] extend the work proposed in [18] bypushing antimonotone constraints into the mining of highlysimilar data cube cell pairs with hugely different measurevalues. The HIGEN mining problem is somehow related tobut different from that addressed in [11] and [18] as itconcerns the discovery, from relational data, of temporalchange patterns composed of frequent generalized itemsetsat minimal abstraction level.

    A parallel effort has been devoted to proposing moreefficient and effective algorithms to address generalizedfrequent itemset mining. The problem of frequent general-ized itemset mining has been first introduced in [3] in thecontext of market basket analysis as an extension of thetraditional frequent itemset extraction task [1]. It generatesitemsets by considering, for each item, all its parents in ataxonomy (i.e., a is-a hierarchy defined on data items).Hence, all combinations of candidate frequent itemsets aregenerated by exhaustively evaluating the taxonomy. Toprevent redundant knowledge extraction, several optimiza-tion strategies have been proposed (e.g., [6], [15], [16], [17]).

    CAGLIERO: DISCOVERING TEMPORAL CHANGE PATTERNS IN THE PRESENCE OF TAXONOMIES 543

  • Among them, Baralis et al. [6] proposed a support-drivenitemset generalization approach, i.e., a frequent generalizeditemset is extracted only if it has at least an infrequentdescendant. This paper extends the generalization proce-dure proposed in [6] to a dynamic context to addressHIGEN mining from a sequence of time-related datasetswithout the need of a postprocessing step. Unlike [6], itdoes not extract all frequent generalizations of an infre-quent pattern, but rather generates only the ones character-ized by minimal redundancy.

    3 PRELIMINARY DEFINITIONS

    This section introduces main notions concerning dynamicgeneralized itemset mining from structured data.

    A structured dataset is a collection of records, whereeach record is a set of pairs attribute; value (e.g., (date,5:05 p.m.)), called items, which identify specific datafeatures (e.g., the time) and their values in the correspond-ing domains (e.g., 5:05 p.m.). Dynamic data miningconsiders datasets associated with different time periods.Thus, I introduce the notion of timestamped structureddataset. Its formal definition follows.

    Definition 3.1 (Timestamped structured dataset). Let T ft1; t2; . . . ; tng be a set of data features and f1;2; . . . ;ng its corresponding domains. ti may be either a categoricalor a numeric discrete data feature. Let W wstart; wend be atime interval. A timestamped structured dataset D is acollection of records, where 1) each record r is a set of pairsti; valuei where ti 2 T and valuei 2 i, 2) each ti 2 T , alsocalled attribute, may occur at most once in any record, and3) each record has a time stamp rts 2W .

    In the case of datasets with continuous attributes, the valuerange should be discretized into intervals, and the intervalsmapped into consecutive positive integers. Consider againthe example datasets D1 and D2 reported in Tables 1a and1b. They are examples of timestamped structured datasetscomposed of four attributes (i.e., date, time, location, andproduct description), among which the date attribute isselected as time stamp. Dataset D1 includes product sales(i.e., records) whose time stamp ranges from 2009-01-01 to2009-01-31 (see Table 1a), while D2 includes product saleswhose time stamp ranges from 2009-02-01 to 2009-02-28 (seeTable 1b).

    Generalized itemset mining exploits a set of general-ization hierarchies, i.e., a taxonomy, built over data items togeneralize at higher levels of abstraction items representedin a timestamped structured dataset. A taxonomy is a is-ahierarchy built over data item values composed of aggrega-tion hierarchies defined on each data attribute.

    Definition 3.2 (Taxonomy). Let T ft1; t2; . . . ; tng be a set ofattributes and f1;2; . . . ;ng its corresponding do-mains. Let fGH1; . . . ; GHmg be a set of generalizationhierarchies defined on T . Each GHi represents a predefinedhierarchy of aggregations over values in i. GHi leaves are allthe values in i. Each nonleaf node in GHi is an aggregationof all its children. The root node (denoted as ?) aggregates allvalues for attribute ti. A taxonomy , is a forest ofgeneralization hierarchies. contains one generalizationhierarchy GHi 2 for each attribute ti in T .

    Taxonomies may be either analyst-provided or inferred

    by means of ad hoc algorithms (e.g., [10], [14]). Fig. 2 reports

    three examples of generalization hierarchies constructed

    over the attributes of the example timestamped datasets

    reported in Tables 1a and 1b, exception for the time stamp

    attribute (i.e., the date) which, by construction, is not

    considered. According to Definition 3.2, the set of all the

    generalization hierarchies in Fig. 2 is a taxonomy.A pair (ti; aggregation valuei), where aggregation valuei

    is a nonleaf node belonging to the generalization hierarchy

    associated with the ith attribute ti, is a generalized item. In

    the following I denote as leavesaggregation valuei ithe set of leaf nodes descendants of value aggregation valueiin an arbitrary generalization hierarchy GHi associated with

    the ith data attribute. For instance, in the generalization

    hierarchy GHlocation shown in Fig. 2c, (location, Italy) is an

    example of generalized item whose set of leaves includes

    items (location, Rome) and (location, Turin). Most relevant

    quality indexes that characterize a generalized item in terms

    of, respectively, a timestamped dataset (cf. Definition 3.1)

    and a taxonomy (cf. Definition 3.2) are its support and

    generalization level. The support of ti; aggregation valueiis defined as the sum of the observed frequencies of the

    corresponding leaves leavesaggregation valuei in thesource dataset. For instance, the support of the generalized

    item (location, Italy) in D1 is25 as it covers 2 out of 5 records

    in D1 (see Table 1a). The generalization level classifies items

    based on their generalization grade in the corresponding

    taxonomy. The item level Lti; aggregation valuei isdefined as the height of the subtree nested in GHi and

    rooted in aggregation valuei plus 1, i.e., it is the length of the

    longest downward path in GHi to a leaf from that node plus

    1. Consider again the generalization hierarchy GHlocation on

    the product sale location shown in Fig. 2b. The level of the

    generalized item (location, Italy) is 2 provided that the

    544 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 25, NO. 3, MARCH 2013

    Fig. 2. Taxonomy over data items in D1 and D2.

  • longest path on GHlocation from Italy to a leaf node haslength 1.

    In the context of relational data, a (not generalized)k-itemset, i.e., an itemset of length k, is a set of k items, eachone belonging to a distinct data attribute. Generalizeditemsets represent high-level recurrences hidden in theanalyzed data. A generalized itemset of length k (i.e., ageneralized k-itemset) is a set of generalized or notgeneralized items including at least one generalized item.A more formal definition follows.

    Definition 3.3 (Generalized itemset). Let T ft1; t2; . . . ;tng be a set of attributes belonging to a timestamped structureddataset and I the enumeration of all the items in thecorresponding dataset. Let be a taxonomy over T , and Ethe set of generalized items derived by all the generalizationhierarchies in . A generalized itemset Y is a subset of I S Eincluding at least one generalized item in E. Each attributeti 2 T may occur at most once in Y .

    For instance, f(product, T-shirt), (location, Italy)g is a general-ized two-itemset, while f(product, T-shirt), (product, Jacket)gdoes not.

    To characterize generalized and not generalized itemsets,I extend previously discussed notions of item support andlevel to itemsets. To define the (generalized) itemsetsupport, I first introduce the concept of (generalized)itemset matching and coverage set. A (generalized) itemsetmatches a given record if all its items either 1) belong to therecord, or 2) are ancestors of items that belong to the record.

    Definition 3.4 ((Generalized) itemset matching). Let D be atimestamped structured dataset and a taxonomy built on D.A (generalized) itemsetX matches an arbitrary record r 2 D ifand only if for all (possibly generalized) items x 2 X

    1. x 2 I (i.e., x is an item) and x 2 r, or2. x 2 E (i.e., x is a generalized item) and 9 i 2

    leavesx such that i 2 r.The coverage set covX;D of a (generalized) itemsetX is

    given by the set of records in D matched by X [22].

    Definition 3.5 ((Generalized) itemset support). Let D be atimestamped structured dataset and a taxonomy on D. Thesupport of a generalized or not generalized itemset X is givenby

    jcovX;DjjDj :

    Definition 3.6 ((Generalized) Itemset Level). Let X ft1; value1; . . . ; tk; valuekg be a (generalized) itemset oflength k. Its level LX is the maximum item generalizationlevel by considering items in X, i.e.,

    LX max1jkfLtj; valuejg:

    The generalization level of an itemset is affected by thehighest generalized item level. Consider again the general-ized itemset f(product, T-shirt), (location, Italy)g. Its supportis 25 in both datasetsD1 andD2 (See Tables 1a and 1b). Since,

    according to the taxonomy reported in Fig. 2,Lproduct; T Shirt 1 and Llocation; Italy 2 itsgeneralization level is 2.

    A generalized itemset may be an ancestor or adescendant of another one, according to a given taxonomy,as stated by the following definition.

    Definition 3.7 ((Generalized) itemset ancestor and des-cendant). Let X;Y I be two different (generalized)k-itemsets and be a taxonomy. X is an ancestor of Y on, denoted as X 2 AncY ;, if and only if for any (general-ized) item yi 2 Y exists a (generalized) item xi 2 X such thatxi is an ancestor of yi in GHi. If X is an ancestor of Y , then Yis a descendant of X, denoted as Y 2 DescX;.

    Consider again the generalized itemset I f(product,T-shirt), (location, Italy)g. Since, according to the taxonomyreported in Fig. 2, Italy is an ancestor of Turin; f(product,T-shirt), (location, Turin)g is an example of descendant of I.

    The redundancy of an ancestor with respect to one ofits descendants is defined in terms of their respectivecoverage sets.

    Definition 3.8 (Ancestor redundancy). Let D be a structureddataset and a taxonomy. Let X;Y I be two (generalized)k-itemsets such that X 2 AncY ;. The redundancyredX;Y of X with respect to Y is given by jcovX;DjjcovY ;Dj .

    Theorem 3.1. The ancestors of a reference (generalized) itemsetwith minimal redundancy are the ones with minimal support.

    Proof. Suppose that X1, X2 2 AncY ; and redX1; Y >redX2; Y . Since jcovX1; Dj > jcovX2; Dj it followsthat supX1; D > supX2; D. tu

    4 PROBLEM STATEMENT

    This section formalizes the concepts of HIstory GENeralizedPattern and Nonredundant HIstory GENeralized Pattern(NONREDUNDANT HIGEN) as well as formally states themining problem addressed by the paper.

    The frequent (generalized) itemset mining problem [3]focuses on discovering, from the analyzed data, generalizedand not generalized itemsets (cf. Definition 3.3) whosesupport value is equal to or exceeds a minimum supportthreshold. Given an ordered sequence of timestampedstructured datasets (cf. Definition 3.1) relative to differenttime periods, a taxonomy (cf. Definition 3.2), and aminimum support threshold, dynamic change mining [2],in the context of frequent itemsets, investigates the changesand the evolution of the extracted itemsets, in terms of theirmain quality indexes (e.g., the support), from one timeperiod to another. This paper extends the dynamic changemining problem, in the context of frequent itemsets, byexploiting frequent generalized itemsets to representknowledge associated with infrequent patterns. In thefollowing, I introduce the concepts of HIGEN and NON-REDUNDANT HIGEN.

    4.1 The HIGEN

    A HIGEN, associated with both a not generalized itemset itand an ordered sequence of timestamped datasetsD fD1; D2; . . . ; Dng, is a ordered sequence of generalizeditemsets g1; g2; . . . ; gn that represents the evolution of theknowledge covered by it in D. Each gi is a frequent

    CAGLIERO: DISCOVERING TEMPORAL CHANGE PATTERNS IN THE PRESENCE OF TAXONOMIES 545

  • (generalized) itemset extracted fromDi. gimay be either 1) it,in case it is frequent in Di with respect to the minimumsupport threshold, or 2) one of the frequent generalizationsof it in Di characterized by minimum generalization levelotherwise. A more formal definition of HIGEN follows.

    Definition 4.1 (HIGEN). Let D fD1; D2; . . . ; Dng be anordered sequence of timestamped structured datasets and bea taxonomy built over data items in Di 2 D8 i. Let it be a notgeneralized itemset and min sup be a minimum supportthreshold. A HIGEN HGit, associated with it, is an orderedsequence of itemsets or generalized itemsets g1; g2; . . . ; gnsuch that

    . if supit;Di min sup then gi it,

    . else gi git, where git is an ancestor of it, with respectto , frequent in Di and characterized by a minimumgeneralization level among the set of frequent ancestorsof it, i.e., git 2 Ancit; ^ supgit;Di min supand 6 9 git2 2 Ancit; such that Lgit; > Lgit2 ; and supgit2; Di min sup.

    A HIGEN HGit may be represented as a sequence

    g1 R1 g2 R2 . . . gn1 Rn1 gn

    where Ri is a relationship holding between (generalized)itemsets gi and gi1 and may be represented as % if gi isa descendant of gi1;& if gi is an ancestor of gi1 or e> ifLgi; Lgi1;.

    Table 2b reports the set of HIGENs mined from thetimestamped datasets D1 and D2 (See Tables 1a and 1b) byenforcing an absolute minimum support threshold equal to2 and by exploiting the taxonomy reported in Fig. 2. Forinstance, {(location, Paris)} % {(location, France)} is a HIGENassociated with itemset {(location, Paris)} given that {(loca-tion, Paris)} is a descendant of {(location, France)} and it isfrequent in D1 but infrequent in D2. Consider now the timeattribute in D1 and D2 (see Tables 1a and 1b) and thecorresponding generalization hierarchy (see Fig. 2a). f(time,from 5 p.m. to 6 p.m.)g & f(time, 5 p.m.)g is a HIGENassociated with itemset f(time, 5 p.m.)g while f(time,p.m.)g & f(time, 5 p.m.)g does not as it exists an ancestorof f(time, 5 p.m.)g, i.e., f(time, from 5 p.m. to 6 p.m.)g, that isfrequent in D1 and such that L[f(time, from 5 p.m. to6 p.m.)g;] < Lf(time, p.m.)g;].4.2 The NONREDUNDANT HIGen

    Since a not generalized itemset it may have severalancestors characterized by the same generalization level, ittrivially follows that an itemset may have many associatedHIGENs. However, analysts may prefer to look into a moreconcise set of temporal change patterns instead of the wholeHIGEN set. Among the possible generalizations of aninfrequent itemset belonging to the minimal abstractionlevel, the ones with minimal support are the generalizationsthat cover its knowledge with the minimal amount ofredundancy (cf. Theorem 3.1).

    To reduce the number of considered patterns, the notionof NONREDUNDANT HIGEN is introduced. Among the setof HIGENs relative to the same generalized itemset it,the NONREDUNDANT HIGENs are the HIGENs that includegeneralized itemsets with minimal redundancy, i.e., the

    ones with minimal support, in case the reference itemsetbecomes infrequent in a certain time period.

    Definition 4.2 (NONREDUNDANT HIGEN). Let D be anordered sequence of timestamped structured datasets and a

    taxonomy built over data items in Di 2 D 8 i. A NONRE-DUNDANT HIGEN SHGit, associated with it, is a HIGEN

    composed of an ordered sequence of itemsets or generalized

    itemsets g1; g2; . . . ; gn such that for each generalized itemset giin SHGit its redundancy with respect to it is minimal, i.e., it

    does not exist any frequent ancestor gi of it such thatsupgi ; Di < supgi;Di.

    From the above definition, it trivially follows that thenumber of NONREDUNDANT HIGENs associated with eachnot generalized itemset it is lower than the number of the

    corresponding HIGENs. Consider, for instance, an item(time, 5:35 p.m.) and two of its possible generalizations (time,from 5 p.m. to 6 p.m.) and (time, from 5:30 p.m. to 6:30 p.m.)belonging to the same generalization level. Suppose that(time, 5:35 p.m.) % (time, from 5 p.m. to 6 p.m.) and (time,5:35 p.m.) % (time, from 5:30 p.m. to 6:30 p.m.) are bothHIGENs, according to Definition 4.1. Among them, theanalyst may prefer the one that covers the same knowledgesupported by the item (time, 5:35 p.m.) with a minimalamount of redundancy. If, for instance, the support of (time,from 5 p.m. to 6 p.m.) is strictly lower than the support of

    (time, from 5:30 p.m. to 6:30 p.m.) in the second time periodthe former HIGEN is selected as NON-REDUNDANT HIGEN.An empirical evaluation, performed on synthetic datasets,of the average number of frequent generalizations, withminimum generalization level and support value, of eachinfrequent itemset is reported in Section 6.3.

    4.3 Problem Statement

    Given an ordered set of timestamped structured datasets, ataxonomy , and a minimum support threshold min sup,this paper addresses the problem of mining all HIGENs,

    according to Definitions 4.1.To efficiently accomplish the HIGEN extraction task, in

    the next section 1 present the HIstory GENeralized PatternMINER. The main HIGEN MINER algorithm modificationsneeded to address the NONREDUNDANT HIGEN extraction(cf. Definition 4.2) are discussed as well.

    5 The HIGEN MINER ALGORITHM

    The HIstory GENeralized Pattern MINER algorithm ad-dresses the extraction of the HIGENs, according toDefinition 4.1.

    HIGEN mining may be addressed by means of apostprocessing step after performing the traditional general-

    ized itemset mining step [3], constrained by the minimumsupport threshold and driven by the input taxonomy, fromeach timestamped dataset. However, this approach maybecome computationally expensive, especially at lowersupport thresholds, as it requires 1) generating all thepossible item combinations by exhaustively evaluating the

    taxonomy, 2) performing multiple taxonomy evaluationsover the same pattern mined several times from different

    546 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 25, NO. 3, MARCH 2013

  • time periods, and 3) selecting HIGENs by means of a,possibly time-consuming, postprocessing step.

    To address the above issues, I propose a more efficientalgorithm, called HIGEN MINER. It introduces the followingexpedients: 1) to avoid generating all the possible combina-tions, it adopts, similarly to [6], an Apriori-based support-driven generalized itemset mining approach, in which thegeneralization procedure is triggered on infrequent itemsetsonly. Unlike [6], the generalization process does notgenerate all possible ancestors of an infrequent itemset atany abstraction level, but it stops at the generalization levelin which at least a frequent ancestor occurs, 2) to preventmultiple taxonomy evaluations over the same pattern, thegeneralization process of each itemset is postponed after itssupport evaluation in all timestamped datasets and itera-tively applied on infrequent generalized itemsets ofincreasing generalization level, and 3) to reduce theextraction time, HIGEN generation is performed on thefly, without the need of an ad hoc postprocessing step.Furthermore, a slightly modified version of the HIGENMINER algorithm is proposed to address NONREDUNDANTHIGEN extraction. A description of the main algorithmmodifications is given in Section 5.2.

    As a drawback, the HIGEN MINER algorithm automati-cally selects the subset of generalized itemsets of interest atthe cost of a higher number of dataset scans with respect toa traditional Apriori-based miner. Consider a timestampeddatasets of n attributes and a taxonomy of height Hmax, theHIGEN MINER algorithm requires up to Hmax n datasetscans, while a traditional Apriori-like miner requires up ton dataset scans. However, traditional mining approachesstill require a postprocessing step to select the HIGENs ofinterest (cf. Definition 4.1). The experimental evaluation,reported in Section 6, shows that the HIGEN MINERalgorithm yields good performance, in terms of bothpattern pruning selectivity, with respect to previousgeneralized itemset mining approaches (i.e., [3], [6]), andexecution time. In Section 5.1, a pseudocode of the HIGENMINER is reported and thoroughly described while, inSection 5.2, the selection and categorization of the dis-covered HIGENs is addressed.

    5.1 The HIGEN MINER Algorithm Pseudocode

    Algorithm 1 reports the pseudocode of the HIGEN MINER.The HIGEN MINER algorithm iteratively extracts frequentgeneralized itemsets of increasing length from each time-stamped dataset by following an Apriori-based level-wiseapproach and directly includes them into the HIGEN set. Atan arbitrary iteration k, HIGEN MINER performs thefollowing three steps: 1) k-itemset generation from eachtimestamped dataset in D (line 3), 2) support counting andgeneralization of infrequent (generalized) k-itemsets ofincreasing generalization level (lines 6-37), and 3) genera-tion of candidate itemsets of length k 1 by joining k-itemsets and infrequent candidate pruning (line 39). SinceHIGEN MINER discovers HIGENs from relational datasets itadopts the well-known strategy to consider the structure ofthe input datasets to avoid generating combinations that donot comply with the relational data format. After beinggenerated, frequent itemsets of length k are added to thecorresponding HIGENs in the HG set (line 9), while

    infrequent ones are generalized by means of the taxonomyevaluation procedure (line 17). Given an infrequent itemsetc of level l and a taxonomy , the taxonomy evaluationprocedure generates a set of generalized itemsets of levell 1 by applying, on each item tj; valuej of c, thecorresponding generalization hierarchy GHj 2 (see Defi-nition 3.2). All the itemsets obtained by replacing one ormore items in c with their generalized versions of level l 1are generated and included into the Gen set (line 21).Finally, generalized itemset supports are computed byperforming a dataset scan (line 26). Frequent general-izations of an infrequent candidate c, characterized by levell 1, are first added to the corresponding HIGEN set andthen removed from the Gen set when their lower levelinfrequent descendants in each time period have been fullycovered (lines 27-32). In such a way, their further general-izations at higher abstraction levels are prevented. Noticethat the taxonomy evaluation over an arbitrary candidate oflength k is postponed when the support of all candidates oflength k and generalization level l in each timestampeddataset is available. This approach allows triggering thegeneralization on distinct candidates of level l that areinfrequent in at least one timestamped dataset in D. Thesequence of support values of an itemset that is infrequentin a given time period is store and reported provided that1) it has at least a frequent generalization in the same timeperiod, and 2) it is frequent in at least one of the remainingtime periods. The generalization procedure stops, at acertain level, when the Gen set is empty, i.e., when eitherthe taxonomy evaluation procedure does not generate anynew generalization or all the considered generalizations arefrequent in each time period and, thus, have been pruned(line 30) to prevent further knowledge aggregations.

    Algorithm 1. HIGEN MINER: HIstory GENeralized Pattern

    MINER

    Input: ordered set of timestamped structured datasets

    D fD1; D2; . . . ; Dng, minimum support thresholdmin_sup, taxonomy Output: set of HIGENs HG

    1: k 1 // Candidate length2: HG ;3: Ck set of distinct k-itemsets in D4: repeat

    5: for all c 2 Ck do6: scan Di and count the support sup(c,Di)

    8 Di 2 D7: end for

    8: Lik fitemsets c 2 Ckj sup(c,Di) min_sup forsome Di 2 D}

    9: HG update_HIGEN_set(Lik;HG)10: l 1 // Candidate generalization level11: Gen ; // generalized itemset container12: repeat

    13: for all c in Ck of level l do14: Dinfc fDi 2 D jsupc;Di < min_supg

    // datasets for which c is infrequent

    15: if Dinfc 6 ; then16: gencset of new generalizations of

    itemset c of level l 1

    CAGLIERO: DISCOVERING TEMPORAL CHANGE PATTERNS IN THE PRESENCE OF TAXONOMIES 547

  • 17: genctaxonomy_evaluation(; c)18: for all gen 2 genc do19: gen.Desc c // Generalized itemset

    descendant

    20: end for

    21: Gen Gen [ genc22: end if

    23: end for

    24: if Gen 6 ; then25: for all gen 2 Gen do26: scan Di and count the support of

    gen 8 Di 2 Dinfgen:Desc27: for all genj sup(gen,Di) min_sup for some

    Di 2 Dinfgen:Desc do28: HG update_HIGEN_set(gen;HG)29: if sup(gen,Di) min_sup for all

    Di 2 Dinfgen:Desc then30: remove gen from Gen

    31: end if

    32: end for

    33: end for

    34: Ck Ck [Gen35: end if

    36: l l + 137: until Gen ;38: k k 139: Ck1 candidate_generation(

    Si C

    ik)

    40: until Ck ;41: return HG

    The HIGEN MINER algorithm ends the mining loopwhen the set of candidate itemsets is empty (line 40).

    5.2 HIGEN Categorization and Selection

    Domain experts are commonly in charge of looking into thediscovered temporal change patterns to highlight mostnotable trends. To ease the domain expert validation task, apreliminary analysis of the set of extracted HIGENs may1) categorize them based on their time-related trends, or2) prune them to select a subset of interest, i.e., theNONREDUNDANT HIGENs.

    HIGEN categorization. To better highlight HIGEN sig-nificance, HIGENs are categorized, based on their time-related trend, in: 1) Stable HIGENs, i.e., HIGENs that includegeneralized itemsets belonging to the same generalizationlevel, 2) monotonous HIGENs, i.e., HIGENs that include asequence of generalized itemsets whose generalization levelshows a monotonous trend, and 3) oscillatory HIGENs, i.e.,HIGENs that include a sequence of generalized itemsetswhose generalization level shows a variable and nonmo-notonous trend. Since, according to Definition 3.6, ageneralized itemset of level l may have several general-izations of level l 1 and taxonomies may have unbalanceddata item distributions, stable HIGENs may be furtherpartitioned in: 1) Strongly stable HIGENs, i.e., stable HIGENs,in which items, appearing in its generalized itemsets andbelonging to same data attribute, are characterized by thesame generalization level, and 2) Weakly stable HIGENs, i.e.,stable HIGENs in which items, appearing in its generalizeditemsets and belonging to the same attribute, may becharacterized by different generalization levels. Examples

    of HIGENs, extracted from a real-life dataset, belonging toeach category are reported in Section 6.

    NONREDUNDANT HIGEN selection. To reduce theamount of generated patterns, analysts may focus onthe set of NONREDUNDANT HIGENs (cf. Definition 4.2).Since they represent the subset of HIGENs whose general-izations cover the same knowledge covered by the referenceinfrequent itemset with minimal redundancy, they maybe deemed relevant by domain experts. Notice that theextraction of the NONREDUNDANT HIGENs may be accom-plished by slightly modifying the HIGEN MINER algorithm(see Algorithm 1). A sketch of the main algorithm modifica-tions follows. At each algorithm iteration, once the list ofGen set is populated, the set of infrequent descendants ofpatterns in Gen is visited. A nested loop iterates on Gen inorder of increasing support and selects its generalizationswith minimal support. Finally, the set HG of discoveredtemporal change patterns is updated accordingly.

    6 EXPERIMENTAL RESULTS

    I evaluated the HIstory GENeralized Pattern MINER algo-rithm by means of a large set of experiments addressing thefollowing issues:

    1. the usefulness and the characteristics of the HIGENsmined from a real-life dataset (Section 6.1),

    2. the pruning selectivity, in terms of the number ofgeneralized itemsets extracted by HIGEN MINERwith respect to previous generalized itemset miningalgorithms, i.e., [3], [6] (Section 6.2),

    3. the performance, in terms of the number of extractedtemporal patterns, of the HIGEN MINER algorithm(Section 6.3), and

    4. the scalability, in terms of execution time, of theproposed approach (Section 6.4).

    All the experiments were performed on a 3.2 GHz PentiumIV system with 2 GB RAM, running Ubuntu 8.04. TheHIGEN MINER algorithm was implemented in the Pythonprogramming language [21].

    6.1 Domain Expert Validation

    To validate the significance of the proposed approach, acampaign of experiments has been conducted on a real-lifedataset coming from a context-aware mobile system.The extracted HIGENs have been first processed, based onthe procedure described in Section 5.2. Next, they have beenanalyzed by a domain expert to assess their usefulness inthe context of mobile user and service profiling.

    This Section is organized as follows. Section 6.1.1 providesa brief description of the adopted dataset. Section 6.1.2reports a selection of the discovered HIGENs while Section6.1.3 describes the characteristics of the discovered HIGENsbased on the proposed categorization (see Section 5.2).

    6.1.1 Real Dataset

    Areal dataset, calledmDesktop, is providedby a research hub,namely the Telecom Italia Lab. It collects contextualinformation about user application requests submitted,throughmobile devices, over the time period of threemonths(i.e., from August to October). From the whole context datacollection, I generated three different timestamped datasets,

    548 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 25, NO. 3, MARCH 2013

  • each one corresponding to a 1-month time period. Regardingprivacy concerns related to real context data usage, pleasenotice that 1) experimental data were collected fromvoluntary users that provide their whole informed consenton personal data treatment for this research project, and2) real user names were hidden throughout the paper topreserve identities. A more detailed description of theadopted data collection follows.

    mDesktop dataset. The Telecom Italia mobile desktopapplication provides a wide range of services to usersthrough mobile devices. Some of them provide recommen-dations to users on restaurants, museums, movies, and otherentertainment activities. For instance, they allow end users torequest for a recommendation (GET REC service), enter ascore (VOTE service), or update a score for an entertainmentcenter (UPDATE service). Other services allow users toupload files, photos, or videos and share themwith the otherusers. The dataset includes 1,197 user requests concerningfile sharing and uploading, 5,814 records concerning userrecommendations, and 4,487 records regarding other kindsof services (e.g., weather forecasts, SMS, and call requests).To perform HIGEN mining, a taxonomy including thefollowing generalization hierarchies has been defined.

    . time stamp ! hour ! time slot (two-hour timeslots)! time slot (six-hour time slots)! day period(AM/PM).

    . service ! service category.

    . latitude:longitude ! city ! country.

    . phone number ! call type (PERSONAL/BUSI-NESS).

    In particular, service characterization is performed byexploiting the generalization hierarchy over the Serviceattribute domain reported in Fig. 3.

    I also considered 1) different time periods, and 2) dif-ferent generalization hierarchies for the time stamp attribute.In particular, I also considered 1-week time periods and4-hour and 8-hour time slots. Some remarkable examples ofHIGENs mined by using these configurations are reportedin Section 6.1.2.

    6.1.2 Discovered Pattern Analysis

    In the following, a selection of the discovered HIGENs isreported. Furthermore, possible scenarios of usage for theselected patterns suggested by the expert are discussed. Thereported HIGENs are classified based on the categorizationpresented in Section 5.2. Note that all the reported HIGENs,selected by the expert as most notable patterns, are alsoNONREDUNDANT HIGENs. Indeed, the pruning of the notNONREDUNDANT HIGENs did not relevantly affect theeffectiveness of the knowledge discovery process.

    Examples of stable. HIGENs Stable HIGENs are HIGENswhose generalized itemsets have the generalization level incommon. Table 3 reports two examples of extractedstrongly stable HIGENs, namely the HIGENs 1 and 2, andone example of weakly stable HIGEN, namely the HIGEN 3,generated by enforcing a minimum support thresholdmin sup 0:001%.

    Both the reported strongly stable HIGENs concern theusage time of service Twitter, which provides mobile accessto a famous online community. Since in HIGENs 1 and 2 all

    the items belonging to attribute Service are characterized bygeneralization level 1, while the ones belonging to attributeTime belong, respectively, to levels 2 and 3, the consideredHIGENs are categorized as strongly stable HIGENs. Thefirst strongly stable HIGEN states that requests for Twitterare frequently submitted between 9 p.m. and 10 p.m.within each considered month. The second one states asimilar recurrence at a higher temporal abstraction level,i.e., between 7 p.m. and 13 p.m. Their joint extractionimplies that service Twitter is frequently requested from9 p.m. to 10 p.m. within each month, but exists at least onehourly time slot between 7 p.m. and 13 p.m., different from[9 p.m., 10 p.m.], such that is infrequent within eachconsidered month. I also analyzed the recurrences inservice Twitter usage mined from shorter time periods.Even by considering weekly time periods, analogoustrends are shown. Differently, the third stable HIGENconcerns the usage time of the whole application serviceby a user called John. Although, according to Definition 3.6,all its generalized itemsets have the same generalizationlevel (i.e., 3), it is categorized as weakly stable HIGENprovided that the level of the items belonging to attributeUser changes from month to month. This means that theinformation stored in the generalized itemsets changes itsabstraction level, with respect to the input taxonomy, for acertain extent. However, since the overall itemset general-ization level is affected by the mostly generalized attributevalues (i.e., values associated with attribute Time), whoselevel does not vary over time, it still retains a certain degreeof stability.

    The information provided by stable HIGENs may bedeemed relevant for service provisioning. For instance, theanalyst may profitably exploit this knowledge for suitingbandwidth shaping to the actual user needs. In particular, hemay either allocate dedicated bandwidth to frequentlyrequested services at specific time slots or use free

    CAGLIERO: DISCOVERING TEMPORAL CHANGE PATTERNS IN THE PRESENCE OF TAXONOMIES 549

    Fig. 3. Generalization hierarchy over the Service attribute.

  • bandwidth at less congested time slots for network monitor-ing purposes.

    Examples of monotonous. HIGENs Monotonous HIGEN sare HIGENs whose sequence of generalized itemsets ischaracterized by generalization levels with monotonoustrend due to a steadfast and significant decrease/increase ofthe itemset support values, with respect to the supportthreshold, in consecutive time periods. Consider themonotonous HIGENs reported in Table 3, which have beengenerated by enforcing a minimum support thresholdmin sup 0:001%.

    Consider service Home W belonging to categoryHome Services, which is targeted to provide facilities tousers who are at home. The first monotonous HIGEN (i.e.,HIGEN 4) shows a recurrence in user John mDesktopapplication service usage. Requests for service Home Ware frequently submitted in August, while become infre-quent in both September (sup 0:0001%) and October(sup 0:0008%). Nevertheless, the corresponding categoryHome Services remains frequent in all the consideredmonths. Readers can notice that weak monotonicity holdsas the corresponding itemset generalization level increasesfrom August to September while remains constant fromSeptember to October. It is worth mentioning that, byconsidering weekly time periods separately within eachmonth, the reference itemset {Service;Home W, (User,John)} generates a strongly stable HIGEN over August, whileit generates oscillatory HIGENs in the next months. Differ-ently, themonotonousHIGEN5 highlights an opposite trend.It regards service Trains, which belongs to service categoryTravels and provides information on rail transports. User

    John appears less interested in service Trains in August(sup 0:0002%), possibly due to holiday vacations, while theservice increases its attractiveness in the following months.

    The above information is deemed relevant by domainexpert for both user and service profiling. The analyst could1) shape service bandwidth depending on the time period,2) personalize user John service promotions, and 3) planlong-term promotions to counteract service usage decreaseat certain times of the year. The extracted information maybe also used for cross-selling purposes, i.e., by suggestingothers services belonging to category Home Services (e.g.,serviceWeather) to either increase user interest in the servicecategory or counteract user migration from occasional useservices (i.e., classes of services, such as Travels, on whichusers focus their interest for short time periods) tocommonly used services (e.g., service Home W ).

    Examples of oscillatory. HIGENs Oscillatory HIGENs areHIGENs whose sequence of generalized itemsets is char-acterized by generalization levels with variable and non-monotonous trend. Consider, for instance, the oscillatoryHIGENs, reported in Table 3, mined by enforcing aminimum support threshold equal to 0.01 percent.

    The first oscillatory HIGEN (i.e., HIGEN 6) concernsservice category Travels, which provides to users informa-tion about travels, and its service Maps, which allowsbrowsing of geographical maps. User Terry seems inter-ested in services belonging to category Travels in eachconsidered month. Nevertheless, in September, he focusedhis interest in service Maps, possibly due to work travels.Differently, the second oscillatory HIGEN (i.e., HIGEN 7)provides information about the usage time of service MMS.

    550 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 25, NO. 3, MARCH 2013

    TABLE 3mDesktop Dataset

    Examples of HIGENs. min sup 0:001%.

  • In August and October, the service MMS is frequentlyrequested between 13 p.m. and 14 p.m., while in Septemberuser interest in service MMS in the same time slot decreasesbut still holds between 13 p.m. and 15 p.m. Notice that, byusing a different 2-hour time stamp (between 12 p.m. and14 p.m.) the temporal correlation does not hold anymore. Bylooking into the discovered oscillatory HIGENs, analystsmay, for instance, investigate the provision of occasionaluse services (e.g., service Maps) and perform the followingactions: 1) suggest complementary services (e.g., serviceFlightstats), 2) plan promotions targeted to specific userprofiles, and 3) discover mostly used service parameters toautomatically suggest (e.g., frequently requested GPScoordinates for service Maps).

    6.1.3 Characteristics of the Discovered HIGENs

    The expert also analyzes the frequency of stable, mono-tonous, and oscillatory HIGENs inherent in a certain servicecategory. To address this issue, he adopts the followingthree-step approach: 1) HIGEN mining from mDesktop bymeans of the HIGEN MINER algorithm, by enforcing aminimum support threshold min sup 0:001%, 2) selectionof the HIGENs whose generalized itemsets include itemsbelonging to the Service attribute (i.e., service description),and 3) HIGEN categorization based on the service categoriesreported in Fig. 3. For instance, the monotonous HIGENfService; SMSg % fService; Communicationg would beassociated with category Communication. Table 4 reportsthe distribution of the HIGENs within the service categories.

    Categories that include commonly used services(e.g., category Communication) are mainly represented bystable HIGENs (e.g., 52 percent of the HIGENs belonging toCommunication are stable against 22 percent of mono-tonous HIGENs and 26 percent of oscillatory HIGENs),while categories that group occasional use services aremostly covered by monotonous or oscillatory HIGENs. Asignificant percentage of strongly stable HIGENs charac-terizes service categories with a great regularity in theircontext of use (e.g., category Home Services). For theseservices, analysts may design long-term resource andpromotion plans. Service categories with a significantpercentage of weakly stable or monotonous HIGENs(e.g., Entertainment and Friends) are usually suitable formedium-term plans, as they show periodical (e.g., seasonal)variations of their context of usage. Finally, oscillatoryHIGENs are often characterized by nondeterministic beha-viors. For instance, notice that service category Travelsincludes a significant percentage of oscillatory HIGENs as

    its usage changes significantly and unexpectedly frommonth to month.

    6.2 HIGEN MINER Pruning Selectivity

    I evaluated the pruning selectivity of the HIGEN MINERgeneralization procedure on synthetic datasets. To this aim,I compared the number of frequent itemset and generalizeditemsets mined from each generated timestamped datasetwith that extracted by the following generalized frequentitemset mining algorithms: 1) a traditional algorithm,i.e., Cumulate [3], which performs an exhaustive taxonomyevaluation by generating all possible frequent combinationsof generalized and not generalized itemsets, and 2) arecently proposed support-driven approach to itemsetgeneralization, i.e., GENIO[6], which generates a generalizeditemset only if it has at least an infrequent descendant (cf.Definition 3.7). The set of experiments was performed onsynthetic datasets generated by means of the TPC-Hgenerator [26], whose main configuration settings follow.

    6.2.1 Synthetic Datasets

    The TPC-H data generator [26] consists of a suite ofbusiness-oriented ad hoc database queries. By varying thescale factor parameter, files with different size could begenerated. Generalized itemsets have been mined from adataset generated from the lineitem table by setting a scalefactor equal to 0.075 (i.e., around 450,000 records).Hierarchies on line item categorical attributes have beengenerated by using the part, nation, and region tables. Togenerate generalization hierarchies over the continuousattributes, I applied several data equidepth discretizationsteps of finer granularities. The finest discretized values areconsidered as data item values and, thus, become thetaxonomy leaves, while the coarser discretization proce-dures are exploited to aggregate the corresponding lowerlevel values. More specifically, pairs of consecutivediscretized values are aggregated in higher level items.To partition the whole dataset in three distinct time-relateddata collections I queried the source data by enforcingdifferent constraints on the shipping date value (attributeShipDate). More specifically, I partitioned line itemsshipped in the three following time periods: [1992-01-01,1994-02-31], [1994-03-01, 1996-05-31] [1996-06-01, 1998-12-01]. For the sake of brevity, I will denote the correspondingdatasets as data-1, data-2, and data-3 in the rest of thissection.

    6.2.2 Performance Comparison

    Since the enforcement of the minimum support thresholdduring the itemset mining step significantly affects thenumber of extracted itemsets, I performed different miningsessions, for all combinations of algorithms and datasets, byvarying the minimum support threshold value. In Figs. 4, 5,and 6, I plotted the number of itemsets mined, respectively,from data-1, data-2, and data-3. To test the HIGEN MINERalgorithm, I considered the generated datasets in order ofincreasing shipment date interval. To better highlight thepruning selectivity on the cardinality of the mined general-ized itemsets, I distinguished between generalized itemsetsand not.

    CAGLIERO: DISCOVERING TEMPORAL CHANGE PATTERNS IN THE PRESENCE OF TAXONOMIES 551

    TABLE 4mDesktop Dataset

    HIGEN distribution. min sup 0:001%.

  • For all the tested algorithms and datasets and for most ofthe settings of minimum support value, the percentage ofextracted frequent itemsets including at least one general-ized item is significant (i.e., higher than 60 percent). Forboth Cumulate and GENIO, the number of mined (general-ized) itemsets significantly increases when lower minimumsupport values are enforced (e.g., 1 percent). Thus, theanalysis of the temporal evolution of the extracted itemsetsmay require a computationally expensive postpruning step.In GENIO, infrequent items are aggregated during theextraction process and all frequent ancestors of an infre-quent itemset are extracted. However, most of the extractedgeneralization seem redundant as they represent the sameinfrequent knowledge at increasing abstraction levels. Theless redundant generalizations could be selected as the lessredundant frequent representatives. By following thisapproach, the HIGEN MINER algorithm selects just thefrequent ancestors with minimal generalization level(i.e., with minimal redundancy). The pruning selectivity,in terms of the number of extracted generalized itemsets,achieved by HIGEN MINER within each time periodappears more evident when lower support thresholds

    (e.g., 2 percent) are enforced (see Figs. 4c, 5c, and 6c). Infact, when high support thresholds (e.g., 7 percent) areenforced, most of the frequent generalizations have alreadya high generalization level and, thus, the pruning effective-ness is less evident. Oppositely, when lower supportthresholds are enforced (e.g., 2 percent), the extraction ofa significant amount of redundant generalized itemsets,generated by both Cumulate and GENIO, is prevented. Thepruning rate on the number of mined higher level (general-ized) itemsets is between 5 and 15 percent with respect toGENIO for any support threshold value.

    6.3 HIGEN MINER Performance Analysis

    I analyzed the performance of the HIGEN MINER algorithm,in terms of the number extracted HIGENs and NONREDUN-DANT HIGENs, by analyzing the impact of the followingfactors: 1) minimum support threshold, 2) number of timeperiods, and 3) taxonomy characteristics. A set of experi-ments was performed on synthetic datasets generated bymeans of the TPC-H generator [26]. The baseline configura-tion used for data generation is similar to that described inSection 6.1.1.

    552 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 25, NO. 3, MARCH 2013

    Fig. 5. Generalized and not generalized itemsets extracted from data-2.

    Fig. 6. Generalized and not generalized itemsets extracted from data-3.

    Fig. 4. Generalized and not generalized itemsets extracted from data-1.

  • 6.3.1 Impact of the support threshold

    I analyzed, on synthetic datasets, the impact of theminimum support threshold on the number of discoveredHIGENs and NON-REDUNDANT HIGENs. In Fig. 7a, Ireport the number of HIGENs mined from data-1, data-2,and data-3 by varying the minimum support threshold. Totest the HIGEN MINER algorithm, datasets are consideredin order of increasing shipment date interval. To betterhighlight the pruning selectivity yielded by miningNONREDUNDANT HIGENs, I distinguished between NON-REDUNDANT HIGENs and not.

    As expected, the number of generated HIGENs increasesmore than linearly when the minimum support thresholddecreases. The selection of the NONREDUNDANT HIGENssignificantly reduces the number of generated temporalchange patterns. In fact, the number of possible general-izations with minimal generalization level of each infre-quent itemset significantly affects the cardinality of the setof mined HIGENs. Let min sup be a minimum supportthreshold. Let n be the number of considered time periods(i.e., the number of timestamped datasets) and let jfpigjbe the average number of not generalized patterns pi thatare frequent, with respect to min sup, in an arbitrary timeperiod at any abstraction level. Let avg level sharing be theaverage number of frequent generalizations, with minimumgeneralization level, of an infrequent pattern. An estimationof the number of HIGENs (jHiGenj), mined by enforcing asupport threshold min sup, is

    jHiGenj avg level sharing n 1 jfpigj: 1By setting n 3 and by approximating the fpig set

    cardinality in Formula 1 as the greatest common divisor ofthe number of generalized itemsets extracted from data-1,data-2, and data-3, for each support threshold (see Figs. 4c,5c, and 6c), I can compute the approximated value achievedby avg level sharing, for any support threshold, startingfrom the results reported in Fig. 7a. The average numberavg level sharing of frequent generalizations, with minimalgeneralization level, of each infrequent itemset turns out torange from 2 to 4 for any support threshold. The achievedresults are close to the expectations. By selecting theNONREDUNDANT HIGENs, the reduction, in terms of thenumber of HIGENs, is greater than 35 percent for anysupport threshold. The obtained results depend on theconsidered data item distribution.

    6.3.2 Impact of the Number of Time Periods

    I analyzed the impact of the number of considered timeperiods on the cardinality of the extracted patterns. To thisaim, I uniformly partitioned the ShipDate attribute domainof the lineitem table in an increasing number of intervals.Fig. 7b reports the number of mined HIGENs and NON-REDUNDANT HIGENs. The cardinality of the discoveredHIGENs grows when the number of time periods increasesdue to both the presence of new reference itemsets that arefrequent, at any abstraction level, in at least one time periodand the generation of multiple combinations of leastgeneralized itemsets. As expected, the increase is lessrelevant when considering just NONREDUNDANT HIGENs.

    CAGLIERO: DISCOVERING TEMPORAL CHANGE PATTERNS IN THE PRESENCE OF TAXONOMIES 553

    Fig. 7. HIGEN MINER: number of mined HIGENs and NON-REDUNDANT HIGENs. Default parameters: min sup 1 percent, Taxonomy height 3,Time periods 3.

  • 6.3.3 Impact of the Taxonomy Characteristics

    I also analyzed, on synthetic datasets, the impact of themain taxonomy characteristics on the number of extractedHIGENs and NON-REDUNDANT HIGENs. To evaluate theimpact of the taxonomy height, I generated ad hoctaxonomies of increasing height by varying the number ofdiscretization steps adopted for aggregating continuousdata attribute values. To also evaluate the impact ofdifferent multiple-level data item distributions, Figs. 7cand 7d report the number of extracted temporal patterns,when varying the taxonomy height, by applying tworepresentative discretization procedures, i.e., an equidepthprocedure [24] and an entropy-based one [13], on numericdata attributes. A similar trend is shown by both the testeditem distributions. By using a taxonomy composed of onlytwo levels many generalizations are still infrequent in sometime periods and, thus, the corresponding HIGENs are notextracted. When the taxonomy height increases from 2 to 4 asignificant number of reference infrequent itemsets be-comes frequent at a higher generalization level and, thus,the number of extracted combinations relevantly grows.Further taxonomy height increments do not produce anysignificant increase of the HIGEN cardinality. Since the datadistribution generated by the entropy-based discretizationis sparser than that generated by the equidepth procedure,it produces a smaller number of temporal correlations.

    6.4 HIGEN MINER Scalability

    I also analyzed, on synthetic datasets, the scalability of theHIGEN MINER algorithm, in terms of the execution time. Toanalyze the scalability with the number of dataset records, Iexploited the TPC-H data generator [26] and I varied thescale factor to generate datasets of increasing size, rangingfrom 150,000 to 900,000 records. To test the HIGEN MINER, Iexploited the same configurations for the data generatoralready described in Section 6.2.1. Fig. 8 plots the time spentin HIGEN mining for different support values. Thecomputational complexity significantly increases for lowerminimum support threshold values (e.g., 1 percent) due tothe higher number of extracted itemsets. However, theexecution time scales roughly linearly with the number ofdataset records for any support threshold.

    I also analyzed on synthetic datasets the scalability of theHIGEN MINER with both the taxonomy height and thenumber of time periods. Results, not reported here due to thelack of space, show that HIGEN MINER averagely scalesmore than linearly with both taxonomy height and numberof time periods due to the nonlinear growth of the number of

    considered (potentially relevant) combinations. However,the HIGEN MINER execution time is still acceptable evenwhen coping with quite complex taxonomies evaluated overa number of time periods (e.g., around 20,000 seconds withmin sup 1 percent, time periods 6, taxonomyheight 5,and number of records 300;000).

    7 CONCLUSIONS AND FUTURE WORK

    This paper addresses the problem of change mining in thecontext of frequent itemsets. To represent the evolution ofitemsets in different time periods without discardingrelevant but rare knowledge due to minimum supportthreshold enforcement, it proposes to extract generalizeditemsets characterized by minimal redundancy (i.e., mini-mum abstraction level) in case one itemset becomesinfrequent in a certain time period. To this aim, two novelkinds of dynamic patterns, namely the HIGENs and theNONREDUNDANT HIGENs, have been introduced. Theusefulness of the proposed approach to support user andservice profiling in a mobile context-aware environment hasbeen validated by a domain expert.

    Future works will address: 1) the automatic inferenceand use of multiple taxonomies from data coming from realapplication contexts (e.g., social network data), and 2) thepushing of more complex HIGEN quality constraints intothe mining process.

    REFERENCES[1] R. Agrawal, T. Imielinski, and A. Swami, Mining Association

    Rules between Sets of Items in Large Databases, ACM SIGMODRecord, vol. 22, pp. 207-216, 1993.

    [2] R. Agrawal and G. Psaila, Active Data Mining, Proc. First IntlConf. Knowledge Discovery and Data Mining, pp. 3-8, 1995.

    [3] R. Agrawal and R. Srikant, Mining Generalized AssociationRules, Proc. 21th Intl Conf. Very Large Data Bases (VLDB 95),pp. 407-419, 1995.

    [4] M.L. Antonie, O.R. Zaiane, and A. Coman, Application of DataMining Techniques for Medical Image Classification, Proc. SecondIntl Workshop Multimedia Data Mining (MDM/KDD 01), 2001.

    [5] W.-H. Au and K.C.C. Chan, Mining Changes in AssociationRules: A Fuzzy Approach, Fuzzy Sets Systems, vol. 149, pp. 87-104, Jan. 2005.

    [6] E. Baralis, L. Cagliero, T. Cerquitelli, V. DElia, and P. Garza,Support Driven Opportunistic Aggregation for GeneralizedItemset Extraction, Proc. IEEE Fifth Intl Conf. IntelligentSystems (IS 10), 2010.

    [7] S. Baron, M. Spiliopoulou, and O. Gnther, Efficient Monitoring ofPatterns in Data Mining Environments, Advances in Databases andInformation Systems, L. Kalinichenko, R. Manthey, B. Thalheim,and U. Wloka, eds., vol. 2798, pp. 253-265, Springer, 2003.

    [8] M. Bottcher, D. Nauck, D. Ruta, and M. Spott, Towards aFramework for Change Detection in Datasets, Research andDevelopment in Intelligent Systems XXIII, M. Bramer, F. Coenen,and A. Tuson, eds., pp. 115-128, Springer, 2007.

    [9] D.W.-L. Cheung, J. Han, V. Ng, and C.Y. Wong, Maintenance ofDiscovered Association Rules in Large Databases: An IncrementalUpdating Technique, Proc. 12th Intl Conf. Data Eng. (ICDE 96),pp. 106-114, 1996.

    [10] C. Clifton, R. Cooley, and J. Rennie, Topcat: Data Mining forTopic Identification in a Text Corpus, Proc. Third EuropeanConf. Principles and Practice of Knowledge Discovery in Databases,2002.

    [11] G. Dong, J. Han, J.M.W. Lam, J. Pei, K. Wang, and W. Zou,Mining Constrained Gradients in Large Databases, IEEE Trans.Knowledge and Data Eng., vol. 16, no. 8, pp. 922-938, Aug. 2004.

    [12] G. Dong and J. Li, Mining Border Descriptions of EmergingPatterns from Dataset Pairs, Knowledge and Information Systems,vol. 8, pp. 178-202, Aug. 2005.

    554 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 25, NO. 3, MARCH 2013

    Fig. 8. Scalability of the HIGEN MINER algorithm. Time periods 3.Taxonomy height 3.

  • [13] U.M. Fayyad and K.B. Irani, Multi-Interval Discretization ofContinuous-Valued Attributes for Classification Learning, Proc.13th Intl Joint Conf. Artificial Intelligence, pp. 1022-1029, 1993.

    [14] S.C. Gates, W. Teiken, and K.-S.F. Cheng, Taxonomies by theNumbers: Building High-Performance Taxonomies, Proc. 14thACM Intl Conf. Information and Knowledge Management (CIKM 05),pp. 568-577, 2005.

    [15] P. Giannikopoulos, I. Varlamis, and M. Eirinaki, Mining FrequentGeneralized Patterns for Web Personalization in the Presence ofTaxonomies, Intl J. Data Warehousing and Mining, vol. 6, no. 1,pp. 58-76, 2010.

    [16] J. Han and Y. Fu, Mining Multiple-Level Association Rules inLarge Databases, IEEE Trans. Knowledge and Data Eng., vol. 11,no. 7, pp. 798-805, Sept. 1999.

    [17] J. Hipp, A. Myka, R. Wirth, and U. Guntzer, A New Algorithmfor Faster Mining of Generalized Association Rules, Proc. SecondEuropean Symp. Principles of Data Mining and Knowledge Discovery(PKDD 98), pp. 74-82, 1998.

    [18] T. Imieliski, L. Khachiyan, and A. Abdulghani, Cubegrades:Generalizing Association Rules, Data Mining and KnowledgeDiscovery, vol. 6, pp. 219-257, 2002, doi: 10.1023/A:1015417610840.

    [19] B. Liu, W. Hsu, and Y. Ma, Discovering the set of FundamentalRule Changes, Proc. Seventh ACM SIGKDD Intl Conf. KnowledgeDiscovery and Data Mining, pp. 335-340, 2001.

    [20] B. Liu, Y. Ma, and R. Lee, Analyzing the Interestingness ofAssociation Rules from the Temporal Dimension, Proc. IEEE IntlConf. Data Mining (ICDM), pp. 377-384, 2001.

    [21] Python, Python Website, 2009.[22] L.D. Raedt, Constraint-Based Pattern Set Mining, Proc. SIAM

    Intl Conf. Data Mining, pp. 237-248, 2007.[23] B. Shen, M. Yao, Z. Wu, and Y. Gao, Mining Dynamic

    Association Rules with Comments, Knowledge and InformationSystems, vol. 23, pp. 73-98, Apr. 2010.

    [24] P.-N. Tan, V. Kumar, and J. Srivastava, Selecting the RightInterestingness Measure for Association Patterns, Proc. ACMSIGKDD Intl Conf. Knowledge Discovery and Data Mining (KDD 02),pp. 32-41, July 2002.

    [25] Y. Tao and M.T. Ozsu, Mining Frequent Itemsets in Time-Varying Data Streams, Proc. 18th ACM Conf. Information andKnowledge Management (CIKM 09), pp. 1521-1524, 2009.

    [26] TPC-H, The TPC Benchmark H. Transaction Processing Perfor-mance Council, http://www.tpc.org/tpch/default.asp, 2009.

    [27] K. Verma and O.P. Vyas, Efficient Calendar Based TemporalAssociation Rule, ACM SIGMOD Record, vol. 34, pp. 63-70, Sept.2005.

    Luca Cagliero received the masters degree incomputer and communication networks fromPolitecnico di Torino. He has been workingtoward the PhD degree in the Dipartimento diAutomatica e Informatica at the Politecnico diTorino since January 2008. His current researchinterests include data mining and databasesystems. He has worked on user and serviceprofiling in context-aware systems, ontologyconstruction, and social network analysis by

    using itemset and association rule mining algorithms.

    . For more information on this or any other computing topic,please visit our Digital Library at www.computer.org/publications/dlib.

    CAGLIERO: DISCOVERING TEMPORAL CHANGE PATTERNS IN THE PRESENCE OF TAXONOMIES 555

    /ColorImageDict > /JPEG2000ColorACSImageDict > /JPEG2000ColorImageDict > /AntiAliasGrayImages false /CropGrayImages true /GrayImageMinResolution 36 /GrayImageMinResolutionPolicy /Warning /DownsampleGrayImages true /GrayImageDownsampleType /Bicubic /GrayImageResolution 300 /GrayImageDepth -1 /GrayImageMinDownsampleDepth 2 /GrayImageDownsampleThreshold 2.00333 /EncodeGrayImages true /GrayImageFilter /DCTEncode /AutoFilterGrayImages false /GrayImageAutoFilterStrategy /JPEG /GrayACSImageDict > /GrayImageDict > /JPEG2000GrayACSImageDict > /JPEG2000GrayImageDict > /AntiAliasMonoImages false /CropMonoImages true /MonoImageMinResolution 36 /MonoImageMinResolutionPolicy /Warning /DownsampleMonoImages true /MonoImageDownsampleType /Bicubic /MonoImageResolution 600 /MonoImageDepth -1 /MonoImageDownsampleThreshold 1.00167 /EncodeMonoImages true /MonoImageFilter /CCITTFaxEncode /MonoImageDict > /AllowPSXObjects false /CheckCompliance [ /None ] /PDFX1aCheck false /PDFX3Check false /PDFXCompliantPDFOnly false /PDFXNoTrimBoxError true /PDFXTrimBoxToMediaBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXSetBleedBoxToMediaBox true /PDFXBleedBoxToTrimBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXOutputIntentProfile (None) /PDFXOutputConditionIdentifier () /PDFXOutputCondition () /PDFXRegistryName (http://www.color.org) /PDFXTrapped /False

    /CreateJDFFile false /Description >>> setdistillerparams> setpagedevice

Recommended

View more >