paradoxical correlation pattern mining - nsf par

14
Paradoxical Correlation Pattern Mining Wenjun Zhou , Member, IEEE, Hui Xiong, Senior Member, IEEE, Lian Duan , Keli Xiao , and Robert Mee Abstract—Given a large transactional database, correlation computing/association analysis aims at efficiently finding strongly correlated items. For traditional association analysis, relationships among variables are usually measured at a global level. In this study, we investigate confounding factors that can help to capture abnormal correlation behaviors at a local level. Indeed, many real-world phenomena are localized to specific markets or subpopulations. Such local relationships may not be visible or may be miscalculated when collectively analyzing the entire data. In particular, confounding effects that change the direction of correlation are a most severe problem because the global correlations alone leads to errant conclusions. To this end, we propose CONFOUND, an efficient algorithm to identify paradoxical correlation patterns (i.e., where controlling for a third item changes the direction of association for strongly correlated pairs) using effective pruning strategies. Moreover, we also provide an enhanced version of this algorithm, called CONFOUND+, which substantially speeds up the confounder search step. Finally, experimental results showed that our proposed CONFOUND and CONFOUND+ algorithms can effectively identify confounders and the computational performance is orders of magnitude faster than benchmark methods. Index Terms—Correlation coefficient, correlation computing, partial correlation, local patterns, confounder Ç 1 INTRODUCTION G IVEN a large transactional database, correlation com- puting is concerned with efficient identification of strongly correlated items. As a special case of association analysis, correlation computing has been a core problem in data mining and statistical modeling, and has been success- fully applied to various application domains, including market-basket analysis [1], climate studies [2], public health [3], [4], and bioinformatics [5]. Previous work in association mining mainly focused on identifying association patterns in the entire dataset at a global level [6], [7], and little attention has been paid to find- ing associations being confounded by other items. Ignoring this problem can easily lead to the identification of false neg- atives and false positives. False negatives are patterns that are insignificant globally, but significant locally. For example, one drug might have a significant adverse drug reaction only in a certain subpopulation, but not be significant in the entire population. Ignoring such patterns, we may miss potential new discovery opportunities [8], [9]. False positives are patterns that are significant globally, but insignificant locally. Making decisions based on false positive global pat- terns will lead to ineffective, sometimes harmful, real-world decisions. One example of false positives is known as the confounding effect, where the observed pattern is spurious and attributable to their correlation to some other factors. This problem has been found in many applications, includ- ing biomedical and social network analysis [10], [11]. Of par- ticular potential harm are confounders that reverse the correlation’s direction. For example, when assessed globally, two items are positively correlated; but when controlling for a third item, the two items are in fact negatively correlated. (A more detailed example will be discussed in Section 2.1.) In this case, failing to find this control item leads to misun- derstanding the true association between item pairs. To this end, we study the problem of identifying paradox- ical correlation patterns, which are defined as reversing con- founders that change the sign of the correlation between a pair of other items. Having these patterns identified effi- ciently and making them available would be useful in a number of applications that involve large-scale data. A good example may be described in the context of recom- mender systems: when a customer is buying an MP3 player, the same MP3 player in a different color may be recom- mended because, in the past, many people bought them together as presents for their multiple children. However, if the customer also showed interest in a matching memory card, there is high chance that the MP3 is for personal use. In this case, instead of recommending the same MP3 player with a different color, we may have a better chance recom- mending a protector or ear phone. Having paradoxical cor- relation patterns found may help us make more relevant, context-aware recommendations. Our goal is to provide a computational method to efficiently discover a substantially reduced number of probable confounders, which may be further investigated or exploited by domain experts. W. Zhou and R. Mee are with the Department of Business Analytics and Statistics, University of Tennessee, Knoxville, TN 37996-0532. E-mail: {wzhou4, rmee}@utk.edu. H. Xiong is with the Management Science and Information Systems Department, Rutgers University, Newark, NJ 07102. E-mail: [email protected]. L. Duan is with the Department of Information Systems and Business Analytics, Hofstra University, Hempstead, NY 11549. E-mail: [email protected]. K. Xiao is with the College of Business, Stony Brook University, Stony Brook, NY 11794-3775. E-mail: [email protected]. Manuscript received 31 July 2017; revised 27 Nov. 2017; accepted 2 Jan. 2018. Date of publication 18 Jan. 2018; date of current version 5 July 2018. (Corresponding author: Hui Xiong.) Recommended for acceptance by Q. He. For information on obtaining reprints of this article, please send e-mail to: [email protected], and reference the Digital Object Identifier below. Digital Object Identifier no. 10.1109/TKDE.2018.2791602 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 30, NO. 8, AUGUST 2018 1561 1041-4347 ß 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See ht_tp://www.ieee.org/publications_standards/publications/rights/index.html for more information.

Upload: khangminh22

Post on 14-May-2023

0 views

Category:

Documents


0 download

TRANSCRIPT

Paradoxical Correlation Pattern MiningWenjun Zhou ,Member, IEEE, Hui Xiong, Senior Member, IEEE, Lian Duan ,

Keli Xiao , and Robert Mee

Abstract—Given a large transactional database, correlation computing/association analysis aims at efficiently finding strongly

correlated items. For traditional association analysis, relationships among variables are usually measured at a global level. In this

study, we investigate confounding factors that can help to capture abnormal correlation behaviors at a local level. Indeed, many

real-world phenomena are localized to specific markets or subpopulations. Such local relationships may not be visible or may be

miscalculated when collectively analyzing the entire data. In particular, confounding effects that change the direction of correlation are

a most severe problem because the global correlations alone leads to errant conclusions. To this end, we propose CONFOUND, an

efficient algorithm to identify paradoxical correlation patterns (i.e., where controlling for a third item changes the direction of association

for strongly correlated pairs) using effective pruning strategies. Moreover, we also provide an enhanced version of this algorithm, called

CONFOUND+, which substantially speeds up the confounder search step. Finally, experimental results showed that our proposed

CONFOUND and CONFOUND+ algorithms can effectively identify confounders and the computational performance is orders of

magnitude faster than benchmark methods.

Index Terms—Correlation coefficient, correlation computing, partial correlation, local patterns, confounder

Ç

1 INTRODUCTION

G IVEN a large transactional database, correlation com-puting is concerned with efficient identification of

strongly correlated items. As a special case of associationanalysis, correlation computing has been a core problem indata mining and statistical modeling, and has been success-fully applied to various application domains, includingmarket-basket analysis [1], climate studies [2], publichealth [3], [4], and bioinformatics [5].

Previous work in association mining mainly focused onidentifying association patterns in the entire dataset at aglobal level [6], [7], and little attention has been paid to find-ing associations being confounded by other items. Ignoringthis problem can easily lead to the identification of false neg-atives and false positives. False negatives are patterns that areinsignificant globally, but significant locally. For example,one drug might have a significant adverse drug reactiononly in a certain subpopulation, but not be significant in theentire population. Ignoring such patterns, we may misspotential new discovery opportunities [8], [9]. False positives

are patterns that are significant globally, but insignificantlocally. Making decisions based on false positive global pat-terns will lead to ineffective, sometimes harmful, real-worlddecisions. One example of false positives is known as theconfounding effect, where the observed pattern is spuriousand attributable to their correlation to some other factors.This problem has been found in many applications, includ-ing biomedical and social network analysis [10], [11]. Of par-ticular potential harm are confounders that reverse thecorrelation’s direction. For example, when assessed globally,two items are positively correlated; but when controlling fora third item, the two items are in fact negatively correlated.(A more detailed example will be discussed in Section 2.1.)In this case, failing to find this control item leads to misun-derstanding the true association between item pairs.

To this end, we study the problem of identifying paradox-ical correlation patterns, which are defined as reversing con-founders that change the sign of the correlation between apair of other items. Having these patterns identified effi-ciently and making them available would be useful in anumber of applications that involve large-scale data. Agood example may be described in the context of recom-mender systems: when a customer is buying an MP3 player,the same MP3 player in a different color may be recom-mended because, in the past, many people bought themtogether as presents for their multiple children. However, ifthe customer also showed interest in a matching memorycard, there is high chance that the MP3 is for personal use.In this case, instead of recommending the same MP3 playerwith a different color, we may have a better chance recom-mending a protector or ear phone. Having paradoxical cor-relation patterns found may help us make more relevant,context-aware recommendations. Our goal is to provide acomputational method to efficiently discover a substantiallyreduced number of probable confounders, which may befurther investigated or exploited by domain experts.

� W. Zhou and R. Mee are with the Department of Business Analytics andStatistics, University of Tennessee, Knoxville, TN 37996-0532.E-mail: {wzhou4, rmee}@utk.edu.

� H. Xiong is with the Management Science and Information SystemsDepartment, Rutgers University, Newark, NJ 07102.E-mail: [email protected].

� L. Duan is with the Department of Information Systems and BusinessAnalytics, Hofstra University, Hempstead, NY 11549.E-mail: [email protected].

� K. Xiao is with the College of Business, Stony Brook University, StonyBrook, NY 11794-3775. E-mail: [email protected].

Manuscript received 31 July 2017; revised 27 Nov. 2017; accepted 2 Jan. 2018.Date of publication 18 Jan. 2018; date of current version 5 July 2018.(Corresponding author: Hui Xiong.)Recommended for acceptance by Q. He.For information on obtaining reprints of this article, please send e-mail to:[email protected], and reference the Digital Object Identifier below.Digital Object Identifier no. 10.1109/TKDE.2018.2791602

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 30, NO. 8, AUGUST 2018 1561

1041-4347� 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See ht _tp://www.ieee.org/publications_standards/publications/rights/index.html for more information.

In this study, we use the f correlation coefficient [12] asthe measure of association. Traditional correlation comput-ing techniques aimed at finding all pairs of items that areconsidered positively correlated when their f correlation isabove a user-specified threshold u [13]. We tend to leveragethis high correlation for recommendation since co-purchaseis promising. However, that global correlation may be mis-leading because a pair fA;Bg may be negatively correlatedwhen controlling for a third item C. In this case, the partialcorrelation of fA;Bg controlled for C may be below �u.Practically, this means that for those who bought C, theywere unlikely to buy both A and B; and for those who didnot buy C, they were unlikely to buy both A and B simulta-neously, either. This is known as the Simpson’s Paradox.Since C has changed the direction of the correlation betweenA and B, it is called a reversing confounder of item pairfA;Bg. Our paradoxical correlation pattern mining problemis then executed as finding triplets of fA;BjCg, where C is areversing confounder of strong pair fA;Bg.

It is challenging to effectively and efficiently identifyreversing confounders in large-scale datasets. Consideringthe correlation between any two items, controlled for a thirditem, we may need to check all possible three-item tuples,making the problem complexity asOðn3Þ, and so a brute-forceapproach will not be scalable when the number of items n isvery large. Since identifying pair co-occurrence is relativelymore expensive, and the correlation coefficient of a pair maybe required multiple times during the process of computingpartial correlation, we may save pair-wise correlation valuesas intermediate results for future look-up. This can save sub-stantial computation time at the cost of Oðn2Þmemory space.However, for a data set with a large number of items, it maynot be practical tomeet such a hugememory need.

To discover potential confounders in a more effective andefficient manner (in terms of both time and space), in thispaper, we studied properties of the partial correlation coeffi-cient, and used these properties to evaluate the effect ofpotential confounders.Wefirst described three necessary con-ditions for identifying confounders: the “more-extreme” con-dition, the “all-strong” condition, and the “compatible-sign”condition. These conditions enabled powerful pruning toreduce the number of triplets to check. Based on these condi-tions, a confounder search algorithm, called CONFOUND,was developed and published in a preliminary version of thispaper [14]. The CONFOUND algorithm had two basic steps:strong pairs search and confounder search. With improve-ments in each step, this paper provided further enhancementsto CONFOUND, as described in the CONFOUND+ algo-rithm. First, as the “all-strong” condition leads to the need ofefficiently finding both positive and negative strong pairs,we provide an analysis of the search space for bidirectionalall-strong pairs, which reduces the number of bounds to com-pute. Second,wederive a tight bound for reversing confound-ers, which helps improve the confounder search efficiency.Finally, we also store the positive and negative strong pairsseparately, which fully leverages the power of compatible-sign condition in the confounder search stepwithout repeatedchecking. Using a number of benchmark datasets, we showthat both CONFOUND and CONFOUND+ can effectivelyand efficiently identify confounders while striking a balancebetween the use of time andmemory space.

The rest of this paper is organized as follows. In Section 2,we provide preliminaries, such as concepts, examples, nota-tions, and computation formula, based on which we formu-late the paradoxical correlation pattern mining problem.Section 3 focuses on the detection of confounders by study-ing the properties of the correlation and partial correlationcoefficients. Based on such properties, our newly developedalgorithms along with the baseline methods are describedin Section 4. Section 5 summarizes experimental resultswith a focus on scalability. In Section 6, we discuss relatedwork briefly. Finally, we draw conclusions and suggestfuture work in Section 7.

2 PARADOXICAL CORRELATION PATTERNS

This section will introduce the preliminaries, including con-cepts and notations, computational formula, and examples.

2.1 Simpson’s Paradox

Simpson’s paradox refers to the phenomenon where theoverall association pattern is different from its counterpartin each local segment [15], [16]. Assuming that A, B, C arethree different events (a bar on top meaning its comple-ment), P ð�Þ represents a probability, and P ð�j�Þ represents aconditional probability, the mathematical definition forSimpson’s paradox may be written as follows.

Definition 1 (Simpson’s Paradox). It is possible to haveP ðAjBÞ < P ðAj �BÞ, and at the same time both P ðAjBCÞ � PðAj �BCÞ and P ðAjB �CÞ � P ðAj �B �CÞ.

Example 1 (Treatment Selection). Suppose that for amedical condition, there are two possible treatments Aand B. We are primarily interested in their success ratesas basis for deciding which treatment to give to a newpatient. More specifically, we are looking at the correla-tion between two binary variables: treatment T (1, if usingTreatment A; and 0 if using Treatment B) and outcome O(1, if recovered; and 0 otherwise). As illustrated in Table 1,considering all patients’ data (see the last row), the recov-ery rate of Treatment A is 71 percent, and of Treatment Bis 63 percent. Therefore, one may conclude that TreatmentA is better than Treatment B.

However, if we divide the data by the severity status ofthe patient S (1, if more sick; and 0 if less sick), for eithergroup we would find that Treatment B is in fact betterthan Treatment A (50 versus 38 percent for more sickpatients and 94 versus 85 percent for less sick patients).

This example shows that it is possible that correlationhave different directions on the aggregated data from localsegments. In practice, it is essential that we identify suchpatterns and avoid making wrong choices.

TABLE 1Medical Example of Simpson’s Paradox

Trmt A (T ¼ 1) Trmt B (T ¼ 0) Corr.

Treated Recovered Treated Recovered

More Sick 80 38% 200 50% -0.11Less Sick 200 85% 80 94% -0.12Any 280 71% 280 63% 0.09

1562 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 30, NO. 8, AUGUST 2018

2.2 Correlation and Partial Correlation

Suppose we have a large, binary market-basket databaseD ¼ fT1; T2; . . . ; TNg, where Tk is a transaction ðk ¼ 1; 2; . . . ;NÞ. Each transaction includes a set of items from the itemspace I ¼ fI1; I2; . . . ; IMg. We have Tk � I ðk ¼ 1; 2; . . . ; NÞ,

and I ¼S N

k¼1Tk. So in D, there are N transactions involv-ingM items.

Then, for any two items A and B, we can produce a two-way contingency table, as shown in Table 2. We can see thatamong the N transactions in D, there are NAB containingboth items A and B, NA �B containing item A but not item B,so on and so forth. With such notations, the f correlationcoefficient between items A and B can be computed as

fAB ¼NNAB �NANB

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi

NAðN �NAÞNBðN �NBÞp : (1)

Note that fAB is not defined if NA 2 f0; Ng or NB 2 f0; Ng.This f coefficient ranges between -1 and 1. Its sign indicatesthe direction of association, whether it is positive or negative.A higher absolute value indicates a stronger correlation,whereas a lower absolute value indicates a weaker correla-tion. Given a user-specified minimum correlation thresholdu, u 2 ð0; 1Þ, we say items A and B are (strongly) correlated iffAB � u or fAB � �u. Translated into correlation terms, theSimpson’s paradoxmay be expressed as follows.

Definition 2 (Simpson’s Paradox in Correlation). For anythree binary variables A, B, and C, it is possible to havefAB > 0, and at the same time both fABjC¼0 < 0 andfABjC¼1 < 0.

In Table 1, there were three binary attributes to be con-sidered: treatment (T ), outcome (O) and sickness status (S).As shown in the last row of the table, considering allpatients, the correlation between treatment (whether receiv-ing Treatment A) and outcome (whether recovered) isfTO ¼ 0:09. This indicates that the choice of Treatment A ispositively correlated to recovery. When looking at the databy patient status, the correlation between treatment andoutcome for more sick patients and less sick patientsbecame fTOjS¼1 ¼ �0:11 and fTOjS¼0 ¼ �0:12, respectively.

This indicates that the choice of Treatment A is negativelycorrelated to recovery for either group.

An effective measure for discovering patterns showingSimpson’s paradox is the partial correlation [17]. Take thethree-item scenario as an example: controlling for an itemC, the partial correlation between items A and B can becomputed as

fABjC ¼fAB � fACfBC

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi

1� f2AC

� �

1� f2BC

� �

q : (2)

Note that fABjC is not defined if fAC ¼ �1 or fBC ¼ �1.

In fact, the correlation between the treatment and sick-ness status in Table 1 is fTS ¼ �0:43, and. the correlationbetween the outcome and sickness status is fOS ¼ �0:44. Ifwe calculate the partial correlation between treatment andoutcome, controlled for patient status, we would getfTOjS ¼ �0:11, which is negative. That means the sicknessstatus serves as a confounding factor that would reverse thecorrelation between treatment and outcome.

2.3 Problem Formulation and Scope

In this paper, we formulate the problem of paradoxical cor-relation pattern mining as follows. Given the data describedin Section 2.2, our goal was to find all patterns in the formof fA;BjCg, such that A;B;C 2 I, and item C is a reversingconfounder of item pair fA;Bg. Formally, we have the fol-lowing definition.

Definition 3 (Reversing Confounder). Given the thresholdu, u 2 ð0; 1Þ, item C is called a reversing confounder of fA;Bg,if fAB � u and fABjC � �u.

Even though the above definition may be easily extendedto the opposite direction (i.e., fAB � u and fABjC > �u), we

focus on reversing confounders for positively correlatedpairs only for practical importance and for brevity. Theextension to consider reversing confounders for negativelycorrelated pairs is relatively straightforward given symme-try of the problem.

Similar to traditional correlation computing and patternmining settings, we assume that all items are coded asbinary. We also assume that the f correlation is used formeasuring item correlation. Computational methods devel-oped in this paper are based on specific properties of thismeasure, which may be a limitation. Extension to otherassociation measures will be interesting for future work.

Moreover, we limit our scope to confounders identifiedby first order constraints on a binary database. In otherwords, the confounder of an item pair fA;Bg is identifiedby a single item C instead of a group of items. Higher orderconstraints may be developed by using AND and OR con-junctions when needed, however, speeding up that processis beyond the scope of this study. Association among multi-ple items (i.e., beyond item pairs) is not well-defined for thecorrelation measure, making the extension to multi-itemcorrelation and its confounding non-trivial, and therefore, isalso beyond the scope of this study.

3 DETECTION OF CONFOUNDERS

In this section, we develop useful results that will help usspeed up the search of paradoxical correlation patterns.

3.1 Necessary Conditions of ReversingConfounders

In this section, we study the properties of reversing con-founders. Specially, for an item pair fA;Bg, we identify nec-essary conditions for any item C to be its confounder.

Lemma 1 (The “More-Extreme” Condition). Suppose thatan item pair fA;Bg is positively correlated (i.e., fAB � u,0 < u < 1). An item C may be a reversing confounder offA;Bg only if either

TABLE 2The Two-Way Contingency Table

B ¼ 1 B ¼ 0 Total

A ¼ 1 NAB NA �B NAA ¼ 0 N �AB N �A �B N �A

Total NB N �B N

ZHOU ET AL.: PARADOXICAL CORRELATION PATTERN MINING 1563

fAC � fAB

fBC � fAB

orfAC � �fAB

fBC � �fAB:

(3)

Proof. For a positively correlated item pair fA;Bg, we havefAB � u. If C is a reversing confounder of fA;Bg, we have

fABjC ¼fAB � fACfBC

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi

1� f2AC

� �

1� f2BC

� �

q � �u; (4)

from which we have

fACfBC � fABffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi

1� f2AC

� �

1� f2BC

� �

q � u:

Therefore

fACfBC � fAB � u

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi

1� f2AC

� �

1� f2BC

� �

q

> 0: (5)

Since fACfBC 2 ð�1; 1Þ and fAB � u > 0, from (5)we have

0 < fAB < fACfBC < 1:

Therefore, a necessary condition for (4) would be (3). tu

Lemma 2 (The “All-Strong” Condition). Item C may be areversing confounder of fA;Bg only if A, B, and C are pair-wise correlated (either positively or negatively).

Proof. If C is a reversing confounder of fA;Bg, then by defi-nition we have fAB � u. In addition, from the necessarycondition derived in Lemma 1, we have either

fAC � fAB � u

fBC � fAB � u

orfAC � �fAB � �u

fBC � �fAB � �u

:

So pairs fA;Cg and fB;Cg are also correlated. tu

Lemma 3 (The “Compatible-Sign” Condition). For anythree items A, B, and C that are pairwise correlated (either pos-itively or negatively), none of them will be a reversing con-founder if fABfACfBC � 0.

Proof. If C is a reversing confounder of fA;Bg, then by defi-nition we have fAB � u > 0. In addition, from the neces-sary condition derived in Lemma 2, we have either

fAC � fAB > 0

fBC � fAB > 0

orfAC � �fAB < 0

fBC � �fAB < 0

:

In either case, we have fABfACfBC > 0. tu

These conditions will be useful for effective pruningwhen designing efficient algorithms.

3.2 Bidirectional All-Strong-Pairs (BASP) Search

The “All-Strong” condition would require the availability ofall strong pairs, including positive and negative pairs. Wecall the problem of finding both positive and negativestrong pairs as a bidirectional all-strong-pairs search. For-mally, we have the following definition.

Definition 4 (BASP Search). Given u, find all pairs that eitherf � u or f � �u.

In [13], the all-strong-pairs (ASP) query problem was for-mulated as finding all positive strong pairs only. An

extension is needed when we need to include strong nega-tive pairs. To achieve this goal, a few results regarding cor-relation range query may be useful. These results aresummarized in Lemma 4, and the proof and other detailsmay be found in [18].

Lemma 4 (Correlation Range Query). Given a user-specifiedminimum correlation threshold u (0 < u < 1), for any twoitemsA andB, whose support values are sA and sB, respectively,itemBmay be positively strongly correlated withA only if

f1ðsAÞ � sB � f2ðsAÞ; (6)

where

f1ðsAÞ ¼u2sA

ð1� sAÞ þ u2sA

; (7)

f2ðsAÞ ¼sA

sA þ u2ð1� u2Þ; (8)

and negatively strongly correlated with A only if

f3ðsAÞ � sB � f4ðsAÞ; (9)

where

f3ðsAÞ ¼u2ð1� sAÞ

sA þ u2ð1� sAÞ; (10)

f4ðsAÞ ¼1� sA

ð1� sAÞ þ u2sA

: (11)

Using sA as the x-axis and sB as the y-axis, we may visu-alize item pairs in a ½0; 1 ½0; 1 region. Specifically, theranges mentioned in Lemma 4 may be visualized as thesolid lines in Fig. 1. The region between the solid lineswould be the search space for strong pairs. The extent ofbending (and thus the size of the search space) is deter-mined by the parameter u. In Fig. 1a, any item pair that liesbetween the f1ðxÞ and f2ðxÞ curves is a candidate positivestrong pair. The dashed line represents the case whensA ¼ sB, about which the search space is symmetric. InFig. 1b, any item pair that lies between the f3ðxÞ and f4ðxÞcurves is a candidate negative strong pair. The dashed linerepresents the case when sA þ sB ¼ 1, about which thesearch space is symmetric.

When searching for BASPs, we overlay the positivebounds and negative bounds and illustrated the overallsearch space (shaded) in Fig. 2. Note that we only shaded

Fig. 1. Search space for all strong pairs.

1564 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 30, NO. 8, AUGUST 2018

regions in the lower right half of the square, because this issufficient for enumerating all candidate pairs due to sym-metry (i.e., fAB ¼ fBA). It is then easy to see Theorem 1.

Theorem 1 (Search Regions for BASP). For any given itemA, whose support value is sA, the search space for item B is allthose such that

sB 2

f1ðsAÞ; sA½ if sA � 0:5;

f3ðsAÞ; sA½ if 0:5 < sA �1

1þu2;

f3ðsAÞ; f4ðsAÞ½

[ f1ðsAÞ; sA½ otherwise:

8

>

>

>

<

>

>

>

:

(12)

Proof. The proof is omitted since the search space is easy totell from Fig. 2, and the cutoff values may be easily solvedfrom the equations of bounds in Lemma 4. tu

3.3 Tight Bound for Reversing Confounders

The “More-Extreme” condition may be considered as aloose bound for pruning. In this section, we will identify atight bound and derive a few more pruning rules.

Lemma 5 (First Lower Bound of Partial Correlation). Alower bound for fABjC is

lowerðfABjCÞ ¼fAB � fACfBC

1� fACfBC

: (13)

Proof. Since ðfAC � fBCÞ2 ¼ f2

AC þ f2BC � 2fACfBC � 0, we

can see that f2AC þ f2

BC � 2fACfBC . We have

fABjC ¼fAB � fACfBC

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi

1� f2AC

� �

1� f2BC

� �

q

¼fAB � fACfBC

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi

1� f2AC þ f2

BC

� �

þ f2ACf

2BC

q

�fAB � fACfBC

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi

1� 2fACfBC þ f2ACf

2BC

q

¼fAB � fACfBC

1� fACfBC

;

the right hand side of which is a lower bound for fABjC .tu

Lemma 6 (Second Lower Bound of Partial Correlation).Assume that jfAC j � jfBC j without loss of generality, a lowerbound for fABjC is

lowerðfABjCÞ ¼fAB � f2

BC

1� f2BC

: (14)

Proof. Since jfAC j � jfBC j, it is easy to see that f2AC � f2

BC .

Moreover, since ðfAC � fBCÞ2 ¼ f2

AC þ f2BC � 2fACfBC � 0,

we can see that fACfBC �12 f2

AC þ f2BC

� �

� f2BC . We have

fABjC �fAB � fACfBC

1� fACfBC

(by Lemma 5)

¼ 1�1� fAB

1� fACfBC

� 1�1� fAB

1� f2BC

¼fAB � f2

BC

1� f2BC

;

the right hand side of which is a lower bound for fABjC . tu

Theorem 2 (Tight Bounds for Reversing Confounder ofPositive Pairs). Suppose that an item pair fA;Bg is posi-tively correlated (i.e., fAB � u) (0 < u < 1). Assuming thatjfAC j � jfBC j without loss of generality, an item C may be areversing confounder of fA;Bg only if either

fAC � d2AB

fBC � dAB

(

orfAC � �d

2AB

fBC � �dAB

(

; (15)

where dAB ¼ffiffiffiffiffiffiffiffiffiffiffi

fABþu1þu

q

.

Proof. By definition, in order for C to be a confounder offA;Bg, we require fABjC < �u. As a result, the lowerbound of fABjC (e.g., as derived in Lemmas 5 or 6) has tobe less than �u.

First, using the bound derived in Lemma 6, we have

lowerðfABjCÞ ¼fAB � f2

BC

1� f2BC

< �u;

from which we have

f2BC >

fAB þ u

1þ u: (16)

Furthermore, using the bound derived in Lemma 5,we have

lowerðfABjCÞ ¼fAB � fACfBC

1� fACfBC

< �u;

from which we have

fACfBC >fAB þ u

1þ u:

If fBC > 0, then

fAC >fAB þ u

ð1þ uÞfBC

�fAB þ u

1þ u; (17)

If fBC < 0, then

fAC <fAB þ u

ð1þ uÞfBC

� �fAB þ u

1þ u: (18)

Combining (16), (17), and (18), we can come to (15). tu

Fig. 2. Search space for all strong pairs.

ZHOU ET AL.: PARADOXICAL CORRELATION PATTERN MINING 1565

Based on the tight bound derived above, we prove thefollowing corollaries, which will be used for speeding upour algorithm.

Corollary 1 (Reversible). If C is a reversing confounder offA;Bg, then among the pairwise correlations fAB, fAC , andfBC , fAB has the smallest absolute value.

Proof. This corollary follows directly from the necessarycondition derived in Lemma 1. tu

Corollary 1 indicates that for any triplets, only the pairwith the least absolute of correlation may be reversed by thethird item. Therefore, whenever we know that three itemsare pairwise strongly correlated, we only need to checkwhether the weakest pair is reversed by the third item. Wedo not need to check the other two possibilities.

Corollary 2 (Irreversible Triplet). Suppose that A, B, and Care pairwise positively correlated, and that fAB is not the small-est among the pairwise correlations. If fAB < 2u

1þu, then noreversing confounders will be found among A, B, and C.

Proof. From Corollary 1 we know that fAB will not bereversed since it does not have the smallest absolutevalue. If fAC < fBC , we shall only have to check if fAC isreversed by item B; otherwise we shall only have to checkif fBC is reversed by item A.

From Theorem 2, we know that item B may be areversing confounder of fA;Cg only if either

fAB � d2AC

fBC � dAC

(

orfAB � dAC

fBC � d2AC

; (19)

where

dAC ¼

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi

fAC þ u

1þ u

r

ffiffiffiffiffiffiffiffiffiffiffi

2u

1þ u

r

: (20)

From Eqs. (19) and (20), we find the following necessarycondition:

fAB � d2AC �2u

1þ u: (21)

Symmetrically, we can find that a necessary conditionfor item A to be a reversing confounder of fB;Cgwill be

fAB � d2BC �2u

1þ u: (22)

The proof detail is omitted due to similarity. tu

Corollary 3 (Irreversible Pair). Suppose that fB;Cg is a pos-itive strong pair, and fAB � fAC � �u. Then the pair fB;Cg

cannot be reversed if fAB > �ffiffiffiffiffiffi

2u1þu

q

or fAC > � 2u1þu.

Proof. From Theorem 2, we know that item A may be areversing confounder of fB;Cg only if

fAC � �d2BC

fAB � �dBC

; (23)

where dBC ¼ffiffiffiffiffiffiffiffiffiffiffi

fBCþu1þu

q

�ffiffiffiffiffiffi

2u1þu

q

. Therefore, the necessary

conditions are that

fAB � �dBC � �

ffiffiffiffiffiffiffiffiffiffiffi

2u

1þ u

r

; (24)

and that

fAC � �d2BC � �

2u

1þ u: (25)

Note that both conditions have to be met. tu

4 ALGORITHM DESCRIPTIONS

In this section, we will describe algorithms for identifyingconfounders from a transactional database. We will firstdescribe two straightforward solutions as the baseline, thenthe CONFOUND algorithm we have designed based on theinitial set of necessary conditions. Finally, we will describeCONFOUND+, an enhanced version of CONFOUND withfurther pruning due to the newly found tight bound.

For each algorithm described in this section, we assumethat the input variables are I, the reverse indexed transac-tional database (i.e., a list of item IDs followed by all trans-action IDs in which the item was included); and u, theminimum correlation threshold. The output are patterns inthe format fA;BjCg, where item C is a reversing con-founder for positively correlated item pair fA;Bg. For brev-ity, by default A, B, C represent items, and a, b, and crepresent their respective indices in I.

4.1 Baseline Algorithms

In this section, we describe two straightforward solutions asthe baseline.

First, a “brute-force” approach is described inAlgorithm 1.The loops starting in Lines 2 and 3 ensure that we iterates oneach possible itempair fA;Bg. The fact that itemswere sortedby decreasing support (Line 1) ensures that sA � sB. Despitethe name “brute-force”, we performed basic pruning inLine 4, which is based on a computation-friendly upperbound of f [19]. If the upper bound is above the threshold u,we compute the exact correlation fAB (Line 5); otherwise,there is no need to compute the exact correlation. In fact, if theupper bound of fAB is below the threshold u, no strong pairwill be found between A andB0, 8B0 s.t. sB0 � sB [13]. This iswhywemay break from the inner loop in Line 4.WhenA andB are not positively correlated, there is no need to search fortheir confounders (Line 6). Otherwise, a third loop startingfrom Line 7 iterates on each remaining item C, compute itscorrelations to items A and B, respectively (Lines 9-10), andthen check the partial correlations to see if the pair fA;Bg isreversed (Lines 11-12).

Note that the computation of pairwise correlation inAlgorithm 1 has lots of overlap. For instance, the correlationof fAC in Line 9 may have to be computed again in Line 5 ina later iteration. The computation of the exact value of f cor-relation proves to be expensive [19]. Thus, it is desirable tostore pairwise correlation for repeated use, which motivatesus to design a save-and-lookup approach, as described inAlgorithm 2.

Algorithm 2 has two phases. In the initialization phase(Phase I, Lines 1-4), the correlation of each possible item pair iscomputed and saved for later use. Then, in the confoundersearch phase (Phase II, Lines 5 through 14), we iterate overeach possible item pair fA;Bg and check against each

1566 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 30, NO. 8, AUGUST 2018

remaining item C, to see if C is a confounder of fA;Bg. Thedifference of this process from Algorithm 1 is that, instead ofre-computing the f correlations, we simply look themup fromthe pre-computed results. This is an approach to alleviate thecomputation cost bymaking use ofmorememory space.

Algorithm 1. The Brute-Force Approach

Input: I: an item list representing a reverse indexedtransactional database; u: a user-specifiedminimum correlation threshold.

Output: all reversing confounder patterns.1 sort I by non-increasing item support;2 for a 1 to ðM � 1Þ do3 for b ðaþ 1Þ toM do4 if upperðfABÞ < u then break;5 find NAB and calculate fAB;6 if fAB < u then continue;

// Find all reversers of strong pair fA;Bg7 for c 1 toM do8 if C ¼ A or C ¼ B then continue;9 find NAC and calculate fAC ;10 find NBC and calculate fBC ;11 calculate fABjC ;12 if fABjC � �u then output fA;BjCg;13 end14 end15 end

Algorithm 2. The Save-and-Lookup Approach

Input: I: an item list representing a reverse indexed trans-actional database; u: a user-specified minimumcorrelation threshold.

Output : all reversing confounder patterns.// Phase I: Compute and save pair-wise correlations1 pre-allocate fixed storage space for all pairs;2 foreach item pair fA;Bg do3 find NAB, compute fAB, and save it;4 end// Phase II: Search for reversing confounder patterns5 foreach item pair fA;Bg do6 look up fAB;7 if fAB < u then continue;

// Find all reversers of strong pair fA;Bg8 for c 1 toM do9 if c ¼ a or c ¼ b then continue;10 look up fAC and fBC by index;11 calculate fABjC ;12 if fABjC � �u then output fA;BjCg;13 end14 end

4.2 The CONFOUND Algorithm

The above two baselinemethods present the two extremes intime and space usage for discovering confounders. With thesimple necessary conditions developed in Section 3.1, wewere able to design a new algorithm, called CONFOUND,which provides a balance between computation time andmemory space.

The CONFOUND algorithm is described in Algorithm 3.Similar to Algorithm 2, CONFOUND has two phases:

correlation computing (Lines 1 through 10) and confoundersearching (Lines 11 through 36). In the correlation computingphase, our goalwas to find and save all (i.e., positively or nega-tively) correlated pairs. Here we make use of both the upperand lower bounds of f [13] for speeding up the computing(Line 3), as we need to find not only positive but also negativepairs. Specifically, if a pair’s upper and lower bounds are notlarge enough (by absolute value), there is no chance for itsactual correlation value to be strong. In this case, we can safelyskip the computation of the exact correlation value for the pair.

Algorithm 3. CONFOUND

Input: I: an item list representing a reverse indexedtransactional database; u: a user-specified mini-mum correlation threshold.

Output: all reversing confounder patterns.// Phase I: Identify all strong pairs1 initialize L as an empty linked list;2 foreach item pair fA;Bg do3 if upperðfABÞ � u or lowerðfABÞ � �u then4 find NAB and compute fAB;5 if fAB ¼ �1 then continue;6 if fAB � u or fAB � �u then7 save pair fA;Bg to L;8 end9 end10 end// Phase II: Identify reversing confounders

11 foreach a 2 L:keysðÞ do12 l L½a:lengthðÞ;13 if l < 2 then continue;14 for i 1 to ðl� 1Þ do15 ðb;fABÞ L½a½i;16 if b 62 L:keysðÞ then continue;17 for j ðiþ 1Þ to l do18 ðc;fACÞ L½a½j;19 look up fBC from L;20 if fBC is not found then continue;21 if fABfBCfAC � 0 then continue;22 calculate fABjC ;23 if fAB � u and fABjC � �u then24 output fA;BjCg25 end26 calculate fACjB;27 if fAC � u and fACjB � �u then28 output fA;CjBg29 end30 calculate fBCjA;31 if fBC � u and fBCjA � �u then32 output fB;CjAg33 end34 end35 end36 end

In addition, we save pairwise correlations in a linked liststructure, L (Line 7). For an item pair fA;Bg, where A ¼I½a and B ¼ I½b, we save fAB in L½b, if b > a.1 In practice,strongly correlated pairs tend to be a small fraction of allpossible pairs. Therefore, the compactness of a linked list

1. Here we are using lexical order. Using other ordering is also pos-sible as long as being consistent.

ZHOU ET AL.: PARADOXICAL CORRELATION PATTERN MINING 1567

will save space when only strong pairs need to be saved.This compactness can save us time both in pairwise correla-tion computing and confounder searching.

In the confounder search phase, we iterate on the pair-wise correlated triplets only, based on the necessary condi-tions stated in Lemmas 2 and 3. First, we iterate on the keysof L (Line 11). For key a, the iteration is skipped unless thereare at least two elements in L½a (Line 13). The reason is thatif items A ¼ I½a, B ¼ I½b and C ¼ I½c are pairwise corre-lated, assuming a > b > c without loss of generality, weshould have ðb;fABÞ and ðc;fACÞ in L½a and ðc;fBCÞ in L½b.If b is not strongly correlated with anyone, break from allpairs that may involve B (Line 16). Then, in Line 19 we skipthe iteration if fBC is not found in L½b, which means that Aand B are not strongly correlated. Finally, additional prun-ing can be done in Line 21, which is based on Lemma 3.

4.3 The CONFOUND+ Algorithm

In this section, we describe an enhanced version of theCONFOUND algorithm, named CONFOUND+, which isbased on additional screening rules derived in this paper.

As described in Algorithm 4, the CONFOUND+algorithm also has two phases: identifying all strong pairs(Line 2), and searching for reversing confounders (Lines 3-4).

Algorithm 4. CONFOUND+

Input: I: an item list representing a reverse indexedtransactional database; u: a user-specified mini-mum correlation threshold.

Output: all reversing confounder patterns.// Phase I: Identify all strong pairs, saved by direction

1 initialize Lþ and L� as two empty linked lists;2 ðLþ; L�Þ TraverseSearchRegion(I);// Phase II: Identify reversing confounders

3 CheckTripletsPos(Lþ); // positive pairs only4 CheckTripletsNeg(L�; Lþ); // negative pairs

When identifying the strong pairs, different from the basicCONFOUND algorithm, we save the positively strong pairsand negatively strong pairs separately. Moreover, we devel-oped a sequential traversal procedure to maximally reduceunnecessary computing. The pseudocode is detailed inAppendixA. The strength of this sequential search procedure,as compared to the direct search procedure in Algorithm 2, isthat we can avoid computing too many bounds. The searchprocedure in Algorithm 3 needs to compute both the upperand the lower bounds for each pair of items, whereas in Algo-rithm 4, the search range for itemB is computed only once foreach given item A, and then we can deterministically iterateitemBwithin the specified range sequentially.

We were able to search among positive pairs and nega-tive pairs separately because of Lemma 3. There are twosituations when a reversing confounder could be found.First, when all three items are positively correlated pair-wise. We only need to search in Lþ, the linked list of posi-tive pairs (see Procedure CheckTripletsPos). Second, onepair is positive and the other two pairs were negativelystrong. We will iterate over each two strong negativepairs, and then check if the third pair can be found in Lþ

(i.e., is positively strong) and reversed (see ProcedureCheckTripletsNeg).

Procedure. CheckTripletsPos(Lþ)

1 foreach a 2 Lþ:keysðÞ do2 l Lþ½a:lengthðÞ;3 if l < 2 then continue;4 sort Lþ½a by decreasing f;5 for i 1 to ðl� 1Þ do6 ðb;fABÞ Lþ½a½i;

7 if fAB < 2u1þu then break;

8 for j ðiþ 1Þ to l do9 ðc;fACÞ Lþ½a½j;10 look up fBC from Lþ;11 if fBC is not found then continue;12 if fBC < fAC then

// Check if fB;Cg is reversed by A13 calculate fBCjA;14 if fBCjA � �u thenoutput fB;CjAg;15 else

// Check if fA;Cg is reversed by B16 calculate fACjB;17 if fACjB � �u then output fA;CjBg;18 end19 end20 end21 end

In Procedure CheckTripletsPos, for each backbone itemA, we require at least two correlated items (Line 3). This isbecause if A is only correlated with one item, the All-Strongcondition cannot be met. Then we iterate over each pair ofitems (B and C) that are correlated to A (see the two levelsof loops starting in Lines 5 and 8, respectively). Since thecorrelated items are sorted by decreasing correlation (Line4), we know fAB � fAC , and therefore, fAB is not reversible

(Corollary 1). If fAB is less than fAB < 2u1þu, then the search

related to item A ends (Line 7). This is made possible byCorollary 2. The pruning on Line 11 is guaranteed by theall-strong condition. Since only the smallest correlation maybe reversed (see Corollary 1), Lines 12 and 15 checks if thesmaller of fBC and fAC was reversed. There is no need tocheck fAB because it is irreversible (see Corollary 2).

The second case of a reversing confounder is that onepair is positive and the other two pairs were negativelystrong. The search procedure, outlined in ProcedureCheckTripletsNeg, is similar to Procedure CheckTriplet-sPos, but there are a few differences. First, the pruninghappens at two places (Lines 7 and 10). These conditionswere proved in Corollary 3. Second, the negative pairswere saved with some redundancy for easy enumerationof candidates. More specifically, a strong pair fA;Bg issaved both as ða;fABÞ in L�½b and ðb;fABÞ in L�½a. Finally,we only need to check whether the positive pair fB;Cgwas reversed by A.

5 EXPERIMENTAL RESULTS

In this section, we first briefly summarize the experimentalsetup and then show the experimental results.

Datasets. The Frequent Itemset Mining Implementations(FIMI) repository (http://fimi.cs.helsinki.fi/data/) pro-vides a collection of datasets that are often used as bench-marks for evaluating frequent pattern mining algorithms.Table 3 lists the names and sizes of these datasets.

1568 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 30, NO. 8, AUGUST 2018

Procedure. CheckTripletsNeg(L�; Lþ)

1 foreach a 2 L�:keysðÞ do2 l L�½a:lengthðÞ;3 if l < 2 then continue;4 sort L�½a by increasing f;5 for i 1 to ðl� 1Þ do6 ðb;fABÞ L�½a½i;

7 if fAB > �ffiffiffiffiffiffi

2u1þu

q

then break;

8 for j ðiþ 1Þ to l do9 ðc;fACÞ L�½a½j;

10 if fAC > � 2u1þu then break;

11 look up fBC from Lþ½b;12 if fBC is not found then continue;

// Check if fB;Cg is reversed by A13 calculate fBCjA;

14 if fBCjA � �u then output fB;CjAg;15 end16 end17 end

Platform. All the experiments were performed on a DellOptiplex 980 desktop PC, with 3.47 GHz Intel Core i5 CPUand 8 GB of RAM. The operating system is Microsoft Win-dows 10 Pro. All programs were implemented in C++, andwere compiled and run in Cygwin.

Regarding the evaluation of effectiveness, we comparedthe output from each algorithm to ensure that the patternsfound were correct and complete. Since all algorithms

described in this paper gave us the same output (given data-set and parameter), the remaining evaluations will focus onscalability, including both computation and memory effi-ciency. To control measurement errors of running time, foreach experiment setting (i.e., dataset, parameter, algorithm),we ran five runs and took the median.

5.1 Computation Efficiency of CONFOUND

In this section, we focus on a comparison of the computationefficiency. Fig. 3 shows a comparison of the total running time(seconds, in logarithm) of CONFOUND and the two baselinemethods for each dataset under various threshold values. Weshowed a range of lower u’s, from 0.00 to 0.30, because forhigher thresholds, the running time of CONFOUND andCONFOUND+ gets close to zero, making it hard to measurereliably and show on logarithm.We can see that for any givenu, the save-and-lookup approach (SL) and the CONFOUNDalgorithm (CF) are both significantly faster than the brute-force approach (BF).Moreover, the CONFOUND algorithm isconsistently faster than the save-and-lookup approach. Theclear separation between their curves on the log scale indi-cates that the performance gap is high.

The performance gap between CONFOUND and thebrute-force algorithm varied by threshold. The smaller u,the more computational savings, as the number of strongpairs grows quickly with lower u. The running time of thesave-and-lookup algorithm is expected to be relatively sta-ble across different parameters. For some datasets, it run-ning time elevated for lower thresholds for the need ofchecking confounders for more strong pairs. It is surprisingthat the save-and-lookup algorithm is slower than CON-FOUND even though we expected it to be memory inten-sive but computing efficient. The fact that CONFOUND hasoutperformed a memory intensive approach shows that ourproposed method is highly scalable.

5.2 Number of Pairs to Compute and Save

After seeing that the brute-force method was unscalable (inspite of being frugal in memory use), Fig. 4 shows a compari-son of space usage among the two-phase methods (i.e., save-and-lookup, CONFOUND, and CONFOUND+), which allrequired saving pairs in the memory. The number of all pairs

TABLE 3Experimental Datasets

Dataset # Items # Pairs # Transactions

accidents 468 109,278 340,183BMS-POS 1,657 1,371,996 3,367,020BMS-WebView-1 497 123,256 149,639BMS-WebView-2 3,340 5,576,130 358,278pumsb 2,113 2,231,328 49,046pumsb_star 2,088 2,178,828 49,046T10I4D100K 870 378,015 100,000T40I10D100K 942 443,211 100,000

Fig. 3. Comparisons of running time: x-axis: u; y-axis: Time in seconds, shown in logarithm.

ZHOU ET AL.: PARADOXICAL CORRELATION PATTERN MINING 1569

(see lines in black) is the number of pairs the save-and-lookupalgorithm needed to save. It is determined as MðM � 1Þ=2,whereM is the number of unique items. Therefore, it does notdepend on threshold. The number of strong pairs (see lines ingreen) is the number of pairs the CONFOUND and CON-FOUND+ algorithms needed to save. Note that the y-axis rep-resenting the number of pairs is shown in logarithm. It is clearthat there is a substantial gap between the black and greenlines. When u increases, the number of strong pairs decreasesquickly. We can conclude that the CONFOUND and CON-FOUND+ algorithms arememory efficient.

Also shown in Fig. 4 are the number of bounds computedby CONFOUND (red lines) and CONFOUND+ (blue lines),respectively, during the strong pairs search stage. We cansee that they are almost constant across thresholds. Thenumber of bounds by CONFOUND is even more than thetotal number of pairs (black line), because for each pair ofitems, we had to compute both the upper and the lowerbounds of the correlation, making the number of boundstwice as many. In contrast, the number of bounds computedby CONFOUND+, thanks to our sequential traversal proce-dure, reduced this number to a small fraction.

5.3 A Comparison of the Two-Phase Methods

As mentioned earlier, the save-and-lookup, CONFOUND,CONFOUND+ algorithms all have a two-phase structure:strong pair search and reversing confounder search. In thissection, we will compare the running time by phase.

Fig. 5 shows a comparison of the running time (seconds,in logarithm) of Phase I (i.e., BASP search). We can see thatthe save-and-lookup algorithm is the slowest and is almostconstant across all thresholds. CONFOUND and CON-FOUND+ are more efficient, and the running time decreasessharply as the threshold increases.

From Fig. 5 we can see that the BASP search time ofCONFOUND+ is only slightly shorter than CONFOUND.For some datasets, their curves almost overlap. Since CON-FOUND+ used sequential search and CONFOUND com-putes two bounds for all pairs, when the two lines overlap,it means the savings from avoiding computing bounds didnot matter, as the computation of pair correlations tookmuch longer. Indeed, as mentioned earlier, the computationof the exact correlation is time consuming, whereas comput-ing a bound is relatively effortless. The size of the perfor-mance gap depends on the relative cost of computing the

Fig. 4. Comparisons of use of space and bound computing: x-axis: u; y-axis: Number of pairs, shown in logarithm.

Fig. 5. Comparisons of BASP search time: x-axis: u; y-axis: Time in seconds, shown in logarithm.

1570 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 30, NO. 8, AUGUST 2018

correlation versus computing the bounds. In Fig. 5, the save-and-lookup method may take longer than others when thenumber of pairs saved is so large that looking them up takeslonger than computing them. This typically happens withdense data at lower threshold.

Fig. 6 shows a comparison of the running time (seconds,in logarithm) of Phase II (i.e., confounder searching). Wecan see that the searching time by CONFOUND (greenlines) is much shorter than the save-and-lookup (red lines)algorithm for most datasets and parameters, and the perfor-mance gap increases as the threshold u increases. This is sig-nificant because the save-and-lookup algorithm uses a pre-allocated (fixed) memory space, which is efficient forlookup; whereas CONFOUND, which uses a linked list(variable structure), was expected to be more complex forlookup. This shows that selectively saving just the strongpairs rather than all pairs paid off not only for space effi-ciency but also computation efficiency. Note that we areshowing a range of even lower u’s because for higher thresh-olds, the confounder search time gets close to zero (for lackof strong pairs), making it hard to show on logarithm. InFig. 6, the confounder search time of CONFOUND may belonger than the save-and-lookup method for certain data-sets, especially for lower threshold values. The reason isthat the CONFOUND algorithm involves computing over-head such as bound calculation and linked list maintenance.

Fig. 6 also shows that CONFOUND+ (blue lines) is muchfaster than CONFOUND in the confounder search phase,thanks to the separate search procedure among positivepairs and among negative pairs, and the more powerfulpruning. Even though the confounder search time is rela-tively short compared to the strong pairs computing time,the saving in Phase II would be very helpful in situationswhere the strong pairs were incrementally maintained [20]and the confounder search is frequent.

6 RELATED WORK

Correlation computing has been an active research topic in thepast few decades. Using the Pearson’s f coefficient, a compu-tation friendly upper bound was discovered and first utilizedin [13], [19] for identifying all strong pairs. Leveraging this

bound, efficient correlation range queries were studied in[18], [21] Extension to dynamic data environmentwas studiedleveraging projected bounds of f anticipating a buffer offuture transactions. This has resulted in efficient volatile cor-relation computing [20], [22]. Other association measureshave been studied in the data mining literature as well, suchas confidence [7], lift [6], and the cosinemeasure [23].

The Simpson’s paradox generally applies to most associ-ation measures, including those mentioned above. There-fore, it is important to identify the discrepancy betweenlocal and global patterns. Along this line, we found the fol-lowing interesting studies with related concepts.

Contrast sets [24] are subgroups of the market-basketdata, such that the same pattern behaves differently in dif-ferent subgroups. The problem of finding group differenceshas been studied intensively in [25], [26], and [27] hasapplied the contrast set mining method to analyzing brainIschaemina data. A study discovered that contrast-set min-ing is a special case of the rule-discovery framework [28].However, contrast sets emphasize between-group differen-ces, but not necessarily local versus global differences. Spe-cially, contrast sets cannot discover local patterns showingSimpson’s paradox [29]. In other words, some patternsbehave differently in each local segment from the globallevel but appear similar among the local segments.

Niches [30] are defined as surprising association rulesthat contradict set routines, which consist of a number ofdominant trends. Both the dominant trends and the nichescan be captured by emerging patterns [31], which presentsignificant differences between different classes. Althoughthe concept of emerging patterns is closely related to con-trast sets, the niche mining problem tries to discover a smallnumber of exceptions (niches) that contradict the majority(set routines), which reflect the idea of discovering signifi-cant local patterns that are different from global ones. How-ever, the problem setting is different from ours, since itfocuses on one attribute as the class label (e.g., risky custom-ers versus non-risky customers).

7 CONCLUSIONS

In this paper, we investigated how to efficiently discoverparadoxical correlation patterns. We first identified three

Fig. 6. Comparisons of confounder search time: x-axis: u; y-axis: Time in seconds, shown in logarithm.

ZHOU ET AL.: PARADOXICAL CORRELATION PATTERN MINING 1571

necessary conditions of reversing confounders, which helpedus prune the search space and develop an algorithm calledCONFOUND. CONFOUND proves to be efficient in bothmemory space and computing when compared to two base-line methods. Moreover, we added further enhancements toCONFOUND and put forward a CONFOUND+ algorithm,which substantially improves the confounder search effi-ciency. The experimental results on benchmark datasetsshowed that the proposed CONFOUND and CONFOUND+ algorithms can effectively identify confounders, and thecomputational performance can be orders of magnitudefaster than the baseline methods.

Many extensions are possible for future work. First, itwill be interesting to extend the discovery of reversing con-founders to other types of confounders, including false posi-tives (i.e., correlated globally but not correlated whencontrolled by this confounding variable) and false negatives(i.e., uncorrelated globally but correlated locally). Second,as a common challenge for all work in association patternmining, it will be interesting to investigate how to integratethe discovered patterns into an overall predictive or deci-sion optimization model. For example, the paradoxical cor-relation patterns may be used to adjust algorithms in therecommender systems. Finally, as the f correlation coeffi-cient only measures strength of correlation between binaryvariables, it would be useful to extend our framework intoother association measures, including those for other vari-able types, and to consider statistical significance of the cor-relation patterns.

APPENDIX A

TRAVERSING THE SEARCH REGIONS

Even though the search region may be expressed with threeranges as shown in Eq. (12), the actual computation of posi-tive or negative strong pairs will be further sliced into fiveranges of sA, or ten ranges of ðsA; sBÞ combination, as anno-tated in Fig. 7.

Procedure. TraverseSearchRegion(I)

1 Sort I by non-decreasing item support;2 a 2 // Initialize the index of item A3 RegionOne // a increases until sA �

u1þu

4 RegionTwo // a increases until sA � 0:55 RegionThree // a increases until sA �

11þu

6 RegionFour // a increases until sA �1

1þu2

7 RegionFive // all others

The overall process of traversing the search region isoutlined in Procedure TraverseSearchRegion. Since weonly need to search for the region where sA � sB, inLine 2 we initialize the index of item A as the second leastfrequent item. Note that unlike Algorithm 1 that sorts Iby non-increasing order of item support, here we chooseto start from the least frequent, as most real-world dataare highly skewed to the right. In other words, there is alarge number of infrequent items, whereas the number ofhigh-support items tends to be very small or none. There-fore, if we traverse from the least frequent items, it islikely to get the majority of the work done early on, andthe search can stop early if there is no high-support item.

In Lines 3 through 7, we traverse the five regions by sA

sequentially. The detailed traversing pseudocode withineach region is provided from Procedure RegionOnethrough Procedure RegionFive.

Procedure. RegionOne

1 while a �M and sA �u

1þu do2 b a� 1;3 while sB � f1ðsAÞ do4 findNAB and compute fAB;5 if fAB � u then save pair fA;Bg to Lþ;6 b b� 1;7 end8 a aþ 1;9 end

Line 1 in Procedure RegionOne specifies the range of sA

that defines Region 1 in Fig. 7. M is the total number ofunique items, and the upper bound of Region 1, u

1þu, is thevalue of sA such that f3ðsAÞ ¼ sA. In this region, all itempairs are candidates for positive strong pairs but not nega-tive strong pairs. Therefore, we only need to check whetherthe correlation of each pair is above u and insert the pairinto Lþ if true (Line 5).

As the index of item A (i.e., a) continues to increment, thesearch continues into Region 2 (see Procedure RegionTwo).The upper bound for Region 2, 0.5, is the value of sA suchthat f1ðsAÞ ¼ f3ðsAÞ, and is easy to tell in Fig. 7 by symme-try. Region 2 is further divided into two subregions, 2a and2b, based on candidacy status. In Region 2a, all item pairsare candidates for both positive strong pairs and negativestrong pairs. Therefore, we need to check whether the pair’scorrelation is above u and insert it into Lþ if true (Line 5);and below �u and insert it into L� if true (Line 6). In Region2b, all item pairs are candidates for positive strong pairs butnot negative strong pairs. Therefore, we only need to checkwhether the correlation is above u and insert the pair intoLþ if true (Line 11).

Fig. 7. Search space for all strong pairs.

1572 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 30, NO. 8, AUGUST 2018

Procedure. RegionTwo

1 while a �M and sA � 0:5 do2 b a� 1;

// Region 2a in Fig. 73 while sB � f3ðsAÞ do4 find NAB and compute fAB;5 if fAB � u then save pair fA;Bg to Lþ;6 if fAB � �u then save pair fA;Bg to L�;7 b b� 1;8 end

// Region 2b in Fig. 79 while sB � f1ðsAÞ do10 find NAB and compute fAB;

11 if fAB � u then save pair fA;Bg to Lþ;12 b b� 1;13 end14 a aþ 1;15 end

Similarly, the search continues into Regions 3, 4, and 5(see Fig. 8). Each of these regions was further divided intosubregions based on candidacy status. In particular, theboundary point between Regions 3 and 4 is the value of sA

such that f2ðsAÞ ¼ sA; and the boundary point betweenRegions 4 and 5 is the value of sA such that f2ðsAÞ ¼ f3ðsAÞ.

ACKNOWLEDGMENTS

This research was partially supported by the National Sci-ence Foundation (NSF) via the grant number IIS-1648664.Also, it was supported in part by the Natural Science Foun-dation of China (71329201, 71531001).

REFERENCES

[1] C. Alexander, Market Models: A Guide to Financial Data Analysis.Hoboken, NJ, USA: Wiley, 2001.

[2] H. V. Storch and F. W. Zwiers, Statistical Analysis in ClimateResearch. Cambridge, U.K.: Cambridge Univ. Press, 2002.

[3] L. Duan, M. Khoshneshin, W. N. Street, and M. Liu, “Adversedrug effect detection,” IEEE J. Biomed. Health Inform., vol. 17, no. 2,pp. 305–311, Mar. 2013.

[4] J. Cohen, P. Cohen, S. G. West, and L. S. Aiken, Applied MultipleRegression/Correlation Analysis for the Behavioral Science, 3rd ed.Hillsdale, NJ, USA: Lawrence Erlbaum, 2002.

[5] W. P. Kuo, T. K. Jenssen, A. J. Butte, L. Ohno-Machado, and I. S.Kohane, “Analysis of matched MRNA measurements from twodifferent microarray technologies,” Bioinf., vol. 18, no. 3, pp. 405–412, 2002.

[6] S. Brin, R. Motwani, and C. Silverstein, “Beyond market baskets:Generalizing association rules to correlations,” in Proc. ACM SIG-MOD Int. Conf. Manage. Data, 1997, pp. 265–276.

[7] R. Agrawal, T. Imielinski, and A. N. Swami, “Mining associationrules between sets of items in large databases,” in Proc. ACM SIG-MOD Int. Conf. Manage. Data, 1993, pp. 207–216.

[8] X. Zhang, F. Pan, and W. Wang, “CARE: Finding local linear cor-relations in high dimensional data,” in Proc. IEEE Int. Conf. DataEng., 2008, pp. 130–139.

[9] G. Dong, J. Han, J. M. W. Lam, J. Pei, and K. Wang, “Mining multi-dimensional constrained gradients in data cubes,” in Proc. Int.Conf. Very Large Data Bases, 2001, pp. 321–330.

[10] K. A. Bollen, “Political democracy: Conceptual and measurementtraps,” Stud. Comparative Int. Develop., vol. 25, no. 1, pp. 7–24, 1990.

[11] J. A. Delaney, L. Opatrny, J. M. Brophy, and S. Suissa, “Drug-druginteractions between antithrombotic medications and the risk ofgastrointestinal bleeding,” Can. Med. Assoc. J., vol. 177, no. 4,pp. 347–351, 2007.

[12] H. T. Reynolds, The Analysis of Cross-Classifications. New York, NY,USA: Free Press, 1977.

[13] H. Xiong, S. Shekhar, P.-N. Tan, and V. Kumar, “TAPER: A two-step approach for all-strong-pairs correlation query in large data-bases,” IEEE Trans. Knowl. Data Eng., vol. 18, no. 4, pp. 493–508,Apr. 2006.

[14] W. Zhou and H. Xiong, “Efficient discovery of confounders inlarge data sets,” in Proc. 9th IEEE Int. Conf. Data Mining, 2009,pp. 647–656.

[15] C. H. Wagner, “Simpson’s paradox in real life,” Amer. Statistician,vol. 36, no. 1, pp. 46–48, 1982.

[16] C. R. Blyth, “On simpson’s paradox and the sure-thing principle,”J. Amer. Statistical Assoc., vol. 67, no. 338, pp. 364–366, 1972.

[17] K. Baba, R. Shibata, and M. Sibuya, “Partial correlation andconditional correlation as measures of conditional independence,”Australian New Zealand J. Statist., vol. 46, no. 4, pp. 657–664, 2004.

[18] W. Zhou and H. Zhang, “Correlation range query for effective rec-ommendations,”World Wide Web, vol. 18, no. 3, pp. 709–729, 2015.

[19] H. Xiong, S. Shekhar, P. Tan, and V. Kumar, “Exploiting a sup-port-based upper bound of Pearson’s correlation coefficient forefficiently identifying strongly correlated pairs,” in Proc. ACMSIGKDD Int. Conf. Knowl. Discovery Data Mining, 2004, pp. 334–343.

[20] W. Zhou and H. Xiong, “Checkpoint evolution for volatile correla-tion computing,”Mach. Learn., vol. 83, no. 1, pp. 103–131, 2011.

[21] W. Zhou and H. Zhang, “Correlation range query,” in Proc. 14thInt. Conf. Web-Age Inf. Manage., 2013, pp. 490–501.

[22] W. Zhou and H. Xiong, “Volatile correlation computation: Acheckpoint view,” in Proc. 14th ACM SIGKDD Int. Conf. Knowl.Discovery Data Mining, 2008, pp. 848–856.

Fig. 8. Pseudocode for regions 3, 4, and 5.

ZHOU ET AL.: PARADOXICAL CORRELATION PATTERN MINING 1573

[23] J. Wu, S. Zhu, H. Liu, and G. Xia, “Cosine interesting pattern dis-covery,” Inf. Sci., vol. 184, no. 1, pp. 176–195, 2012.

[24] P. Kralj, N. Lavrac, D. Gamberger, and A. Krstacic, “Contrast setmining for distinguishing between similar diseases,” in Proc. 12thConf. Artif. Intell. Med., 2007, pp. 109–118.

[25] S. Bay and M. Pazzani, “Detecting group differences: Mining con-trast sets,” Data Mining Knowl. Discovery, vol. 5, no. 3, pp. 213–246,2001.

[26] S. D. Bay and M. J. Pazzani, “Detecting change in categorical data:Mining contrast sets,” in Proc. ACM SIGKDD Int. Conf. Knowl. Dis-covery Data Mining, 1999, pp. 302–306.

[27] P. Kralj, N. Lavrac, D. Gamberger, and A. Krstacic, “Contrast setmining through subgroup discovery applied to brain ischaeminadata,” in Proc. Pacific-Asia Conf. Advances Knowl. Discovery DataMining, 2007, pp. 579–586.

[28] G. I. Webb, S. M. Butler, and D. A. Newlands, “On detecting dif-ferences between groups,” in Proc. ACM SIGKDD Int. Conf. Knowl.Discovery Data Mining, 2003, pp. 256–265.

[29] M. L. Samuels, “Simson’s paradox and related phenomena,” J.Amer. Statistical Assoc., vol. 88, pp. 81–88, 1993.

[30] G. Dong and K. Deshpande, “Efficient mining of niches and setroutines,” in Proc. Pacific-Asia Conf. Advances Knowl. DiscoveryData Mining, 2001, pp. 234–246.

[31] G. Dong and J. Li, “Efficient mining of emerging patterns: Discov-ering trends and differences,” in Proc. ACM SIGKDD Int. Conf.Knowl. Discovery Data Mining, 1999, pp. 43–52.

Wenjun Zhou (S’11–M’12) received the BSdegree from the University of Science and Tech-nology of China (USTC), the MS degree from theUniversity of Michigan-Ann Arbor, and the PhDdegree from Rutgers University. She is currentlyan assistant professor in the Department of Busi-ness Analytics and Statistics, Haslam College ofBusiness, University of Tennessee, Knoxville.Her general research interests include data min-ing, business analytics, and statistical computing.She was nominated R. Stanley Bowden II Faculty

research fellow in 2016 and Roy & Audrey Fancher Faculty ResearchFellow in 2017. She was the recipient of the Best Paper Award at WAIM2013, Best Student Paper Runner-Up Award at KDD 2008, and BestPaper Runner-Up Award at ICTAI 2006. She was among the top fivefinalists for the ACM SIGKDD Doctoral Dissertation Award in 2011. Sheis a member of the IEEE.

Hui Xiong (SM’07) received the BE degree fromthe University of Science and Technology ofChina (USTC), the MS degree from the NationalUniversity of Singapore, and the PhD degree fromthe University of Minnesota. He is currently a fullprofessor and vice chair of the Management Sci-ence and Information Systems Department, andthe director of the Center for Information Assur-ance, Rutgers University, where he received atwo-year early promotion/tenure (2009), the Boardof Trustees Research Fellowship for Scholarly

Excellence (2009), and the ICDM-2011 Best Research Paper Award(2011). His general area of research is data and knowledge engineering,with a focus on developing effective and efficient data analysis techniquesfor emerging data intensive applications. He has published prolifically inrefereed journals and conference proceedings (four books, 70+ journalpapers, and 100+ conference papers). He is a co-editor-in-chief of theEncyclopedia of GIS, an associate editor of the IEEE Transactions onData and Knowledge Engineering (TKDE), the IEEE Transactions on BigData, the ACM Transactions on Knowledge Discovery from Data(TKDD), and the ACM Transactions on Management Information Sys-tems. He has served regularly on the organization and program commit-tees of numerous conferences, including as a program co-chair of theIndustrial and Government Track for the 18th ACMSIGKDD InternationalConference on Knowledge Discovery and Data Mining (KDD), a programco-chair for the IEEE 2013 International Conference on Data Mining(ICDM), and a general co-chair for the IEEE 2015 International Confer-ence on Data Mining (ICDM). For his outstanding contributions to datamining and mobile computing, he was elected an ACM distinguished sci-entist in 2014. He is a senior member of the IEEE.

Lian Duan received the PhD degree in manage-ment sciences from the University of Iowa and thePhD degree in computer sciences from the Chi-nese Academy of Sciences. He is an assistant pro-fessor in the Department of Information Systemsand Business Analytics with Hofstra University. Hisresearch interests include correlation analysis,health informatics, and social networks. He haspublished in the ACM Transactions on KnowledgeDiscovery from Data, KDD, ICDM, InformationSystems, theAnnals of Operations Research, etc.

Keli Xiao received the BS degree in computerscience from the Wuhan University of Technol-ogy, in 2006, the MA degree in computer sciencefrom Queens College, City University of NewYork, in 2008, and the master’s degree in quanti-tative finance and the PhD degree in manage-ment (finance) from Rutgers University, in 2009and 2013, respectively. He is an assistant profes-sor in the College of Business, Stony Brook Uni-versity. His research interests include businessanalytics, data mining, quantitative finance, and

economic bubbles. He has published papers in refereed journals andconference proceedings, such as the IEEE Transactions on Knowledgeand Data Engineering, the ACM Transactions on Knowledge Discoveryfrom Data, KDD, and ICDM.

Robert Mee received the BS degree in manage-ment science from theGeorgia Institute of Technol-ogy, and theMS and PhD degrees in statistics fromIowa State University. He is currently a professor ofbusiness analytics and statistics, University ofTennessee’s Haslam College of Business and alecture professor with the Nankai University’s Insti-tute of Statistics in Tianjin, China. He is a fellow ofthe American Statistical Association and has auth-ored more than 50 refereed journal articles. He isalso the author of A Comprehensive Guide to Fac-torial Two-Level Experimentation (Springer, 2009).

" For more information on this or any other computing topic,please visit our Digital Library at www.computer.org/publications/dlib.

1574 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 30, NO. 8, AUGUST 2018