Discovering Temporal Change Patterns in the Presence of Taxonomies

Download Discovering Temporal Change Patterns in the Presence of Taxonomies

Post on 04-Dec-2016




0 download

Embed Size (px)


<ul><li><p>Discovering Temporal Change Patternsin the Presence of Taxonomies</p><p>Luca Cagliero</p><p>AbstractFrequent itemset mining is a widely exploratory technique that focuses on discovering recurrent correlations among data.</p><p>The steadfast evolution of markets and business environments prompts the need of data mining algorithms to discover significant</p><p>correlation changes in order to reactively suit product and service provision to customer needs. Change mining, in the context of</p><p>frequent itemsets, focuses on detecting and reporting significant changes in the set of mined itemsets from one time period to another.</p><p>The discovery of frequent generalized itemsets, i.e., itemsets that 1) frequently occur in the source data, and 2) provide a high-level</p><p>abstraction of the mined knowledge, issues new challenges in the analysis of itemsets that become rare, and thus are no longer</p><p>extracted, from a certain point. This paper proposes a novel kind of dynamic pattern, namely the HIstory GENeralized Pattern (HIGEN),</p><p>that represents the evolution of an itemset in consecutive time periods, by reporting the information about its frequent generalizations</p><p>characterized by minimal redundancy (i.e., minimum level of abstraction) in case it becomes infrequent in a certain time period. To</p><p>address HIGEN mining, it proposes HIGEN MINER, an algorithm that focuses on avoiding itemset mining followed by postprocessing by</p><p>exploiting a support-driven itemset generalization approach. To focus the attention on the minimally redundant frequent</p><p>generalizations and thus reduce the amount of the generated patterns, the discovery of a smart subset of HIGENs, namely the NON-</p><p>REDUNDANT HIGENs, is addressed as well. Experiments performed on both real and synthetic datasets show the efficiency and the</p><p>effectiveness of the proposed approach as well as its usefulness in a real application context.</p><p>Index TermsData mining, mining methods and algorithms</p><p>1 INTRODUCTION</p><p>THE problem of discovering relevant data recurrencesand their most significant temporal trends is becomingan increasingly appealing research topic. The application offrequent itemset mining and association rule extractionalgorithms [1] to discover valuable correlations among datahas been thoroughly investigated in a number of differentapplication contexts (e.g., market basket analysis [1],medical image processing [4]). In the last years, the steadygrowth of business-oriented applications tailored to theextracted knowledge prompted the need of analyzing theevolution of the discovered patterns. Since, in manybusiness environments, companies are expected to reac-tively suit product and service provision to customer needs,the investigation of the most notable changes between theset of frequent itemsets or association rules mined fromdifferent time periods has become an appealing researchtopic [2], [9], [12], [19], [23], [27].</p><p>Frequent itemset mining activity is constrained by aminimum support threshold to discover patterns whoseobserved frequency in the source data (i.e., the support) isequal to or exceeds a given threshold [1]. However, theenforcement of low support thresholds may entail generat-ing a very large amount of patterns which may becomehard to look into. On the other hand, higher supportthreshold enforcement may also prune relevant but not</p><p>enough frequent recurrences. To overcome these issues,generalized itemset extraction could be exploited. General-ized itemsets, which have been first introduced in [3] in thecontext of market basket analysis, are itemsets that providea high level abstraction of the mined knowledge. Byexploiting a taxonomy (i.e., a is-a hierarchy) over dataitems, items are aggregated into higher level (generalized)ones. Tables 1a and 1b report two example datasets,associated, respectively, with two consecutive months ofyear 2009 (i.e., January and February). Each recordcorresponds to a product sale. In particular, for each salethe product description, the date, the time, and the locationin which it took place are reported. For the sake ofsimplicity, in this preliminary analysis I am interested inmining generalized itemsets involving just the location orthe product description. Fig. 1 shows a simple hierarchydefined on the location attribute. Table 1 reports the set ofall possible frequent generalized and not generalizeditemsets, and their corresponding absolute support values(i.e., observed frequencies), mined, by means of a tradi-tional approach [3], from the example datasets D1 and D2,by exploiting the taxonomy reported in Fig. 1 and byenforcing an absolute minimum support threshold equal to2. Readers can notice that some of the itemsets satisfy thesupport threshold in both D1 and D2, while the others arefrequent in only one out of two months.</p><p>This paper focuses on change mining in the context offrequent itemsets by exploiting generalized itemsets torepresent patterns that become rare with respect to thesupport threshold, and thus are no longer extracted, at acertain point. Previous approaches allow both keeping trackof the evolution of the most significant pattern qualityindexes (e.g., [2], [7], [8]) and discovering their mostfundamental changes (e.g., [5], [12], [19]). Consider, for</p><p>IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 25, NO. 3, MARCH 2013 541</p><p>. The author is with the Politecnico di Torino Dipartimento di Automatica eInformatica, Corso Duca degli Abruzzi 24, Turin 10129, Italy.E-mail:</p><p>Manuscript received 2 May 2011; revised 2 Nov. 2011; accepted 7 Nov. 2011;published online 10 Nov. 2011.For information on obtaining reprints of this article, please send e-mail, and reference IEEECS Log Number TKDE-2011-05-0240.Digital Object Identifier no. 10.1109/TKDE.2011.233.</p><p>1041-4347/13/$31.00 2013 IEEE Published by the IEEE Computer Society</p></li><li><p>instance, the itemsets {T-shirt, Turin} and {Jacket, Paris}. Theformer itemset is infrequent in D1 (i.e., its support value inD1 is lower than the support threshold) and becomesfrequent inD2 due to a support value increase from Januaryto February, while the latter shows an opposite trend. Albeitthese pieces of information might be of interest for businessdecision making, the discovery of higher level correlations,in the form of generalized itemsets, issues new challenges inthe investigation of pattern temporal trends. Item setgeneralization may allow preventing infrequent knowledgediscarding. At the same time, experts may look into theinformation provided by the sequence of generalizations orspecializations of the same pattern in consecutive timeperiods. For instance, the fact that the generalized itemset{T-shirt, Italy} (frequent in both January and February 2009)has a specialization (i.e., {T-shirt, Turin}) that is infrequent inJanuary while becomes frequent in the next month may bedeemed relevant for decision making and, thus, might bereported as it highlights a temporal correlation amongspatial recurrences regarding product T-shirt. Similarly, theinformation that, although {Jacket, Paris} becomes infrequentwith respect to the minimum support threshold movingfrom January to February, its generalization {Jacket, France}remains frequent in both months could be relevant foranalyst decision making as well. However, to the best of myknowledge, this kind of information cannot be neitherdirectly mined nor concisely represented by means of anypreviously proposed approach.</p><p>This paper proposes a novel kind of dynamic pattern,namely theHIstory GENeralized Pattern (HIGENs). AHIGEN</p><p>compactly represents the minimum sequence of general-izations needed to keep knowledge provided by a notgeneralized itemset frequent, with respect to the minimumsupport threshold, in each time period. If an itemset isfrequent in each time period, the corresponding HIGEN justreports its support variations. Otherwise, when the itemsetbecomes infrequent at a certain point, the HIGEN reports theminimumnumber of generalizations (and the correspondinggeneralized itemsets) needed tomake its covered knowledgefrequent at a higher level of abstraction. I argue that afrequent generalization at minimum abstraction level repre-sents the rare knowledge with minimal redundancy. In casean infrequent not generalized itemset has multiple general-izations belonging to the same minimal aggregation level,many HIGENs associated with the same itemset aregenerated. To focus the attention of the analysts on thefrequent generalizations of a rare itemset covering thesame knowledge with a minimal amount of redundancyand, thus, reduce the number of the generated patterns, amore selective type of HIGEN, i.e., the NONREDUNDANTHIGEN, is proposed as well. NONREDUNDANT HIGENs areHIGENs that include, for each time period, the frequentgeneralizations of the reference itemset of minimal general-ization level characterized by minimal support.</p><p>In Table 2 the set of HIGENs, mined from the exampledataset by enforcing an absolute minimum support thresh-old equal to 2, is reported. The reference itemset is writtenin boldface. The relationship % means that the right-handside itemset is a generalization of the left-hand side one,&amp;implies a specialization relationship, whilee&gt; implies that noabstraction level change occurs. For instance, {Jacket, Paris}% {Jacket, France} is a HIGEN stating that, from January (D1)</p><p>542 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 25, NO. 3, MARCH 2013</p><p>TABLE 1Running Example Datasets</p><p>Fig. 1. Generalization hierarchy for the location attribute in exampledatasets D1 and D2.</p><p>TABLE 2Extracted Patterns</p></li><li><p>to February (D2) 2009, the not generalized itemset {Jacket,Paris} becomes infrequent with respect to the minimumsupport threshold while its upper level generalization (seeFig. 1) remains frequent. All the HIGENs reported inTable 2b refer to a different not generalized referenceitemset. In case the extraction would generate more HIGENsper itemset, the generation of the NONREDUNDANT HI-GENs generates just the HIGENs covering the minimalamount of redundancy, i.e., the ones including general-izations with minimal support.</p><p>To address HIGEN mining, the HIstory GENeralizedPattern MINER (HIGEN MINER) has been proposed. Insteadof generating the HIGENs by means of a postprocessingstep that follows the generalized itemset mining processperformed on data collected at each time interval, theHIGEN MINER algorithm directly addresses the HIGENmining by extending a support-driven generalizationapproach, first proposed in [6], to a dynamic context. Theproposed algorithm avoids both redundant knowledgeextraction followed by postpruning and multiple taxonomyevaluations over the same pattern mined from differenttime intervals. A slightly modified version of the HIGENMINER algorithm is also proposed to address NONREDUN-DANT HIGEN mining.</p><p>The proposed approach may be profitably exploited in anumber of different application domains (e.g., context-aware domain, network traffic analysis, social networkanalysis). Experiments, reported in Section 6, show 1) theeffectiveness of the proposed approach in supporting expertdecision making through the analysis of context datacoming from a context-aware mobile environment and2) the efficiency and the scalability of the proposed HIGENMINER algorithm on synthetic datasets.</p><p>This paper is organized as follows. Section 2 discussesand compares state-of-the-art approaches concerning bothchange mining and generalized itemset mining with theproposed approach. Section 3 introduces preliminarynotions, while Section 4 formally states the HIGEN miningproblem. Section 5 describes the HIGEN MINER algorithm.Section 6 evaluates the efficiency and the effectiveness ofthe proposed approach. Finally, Section 7 draws conclu-sions and presents related future work.</p><p>2 RELATED WORK</p><p>Frequent itemset mining has been first proposed in [1], in thecontext of market basket analysis, as the first step of theassociation rule extraction process. The problem of discover-ing relevant changes in the history of itemsets andassociation rules has been already addressed by a numberof research papers (e.g., [2], [5], [7], [8], [12], [19]). Active datamining [2] was the first attempt to represent and query thehistory pattern of the discovered association rule qualityindexes. It first mines rules, from datasets collected indifferent time periods, by adding rules and their relatedquality indexes (e.g., support and confidence [1]) to acommon rule base. Next, the analyst could specify a historypattern in a trigger which is fired when such a pattern trendis exhibiting. Albeit history patterns allow tracking mostnotable pattern index changes, they neither keep theinformation covered by rare patterns nor show links among</p><p>patterns at different abstraction levels. More recently, othertime-related data mining frameworks tailored to monitorand detect changes in rule support and confidence have beenproposed [7], [8], [25]. In [8], patterns are evaluated andpruned based on both subjective and objective interesting-ness measures. The aim of [7] is to monitor the minedpatterns and reduce the effort spent in data mining. Toachieve this goal, new patterns are observed as soon as theyemerge, and old patterns are removed from the rule baseas soon as they become extinct. To further reduce thecomputational cost, at one time period a subset of rules isselected and monitored, while data changes that occur insubsequent periods are measured by their impact on therules being monitored. Similarly, [25] addresses itemsetchange mining from time-varying data streams. However,the approaches presented in [7], [8], and [25] do not addressrare knowledge representation by means of generalizeditemsets. Differently, Liu et al. [20] deal with rule changemining by discovering two main types of rules: 1) the stablerules, i.e., rules that do not change a great deal over time and,thus, are more reliable and could be trusted, and 2) the trendrules, i.e., rules that indicate some underlying systematictrends of potential interest. Similarly, this paper categorizesthe HIGENs based on their temporal trend. However, theirevolution has been classified in terms of itemset general-izations or specializations that occur over time. The problemof discovering the subset of most relevant changes inassociation rules has been addressed by 1) evaluating rulechanges that occur between two time periods by means ofchi-square test [19], 2) searching for border descriptions ofemerging patterns extracted from a pair of datasets [12], and3) applying a fuzzy approach to rule change evaluation [5].Unlike [5], [12], [19], the proposed approach allows bothcomparing more than two time periods at a time andrepresenting patterns that become infrequent in a certaintime period by means of one of their generalizationscharacterized by minimal redundancy.</p><p>A parallel issue concerns the detection of most notablechanges in multidimensional data measures. Imieliski et al.[18] first introduce the concept of cubegrade and comparesthe mult...</p></li></ul>