ron s. kenett
DESCRIPTION
Relative Linkage Disequilibrium: An intersection between evolution, algebraic statistics, text mining and contingency tables. Ron S. Kenett KPA Ltd., Raanana, Israel and Department of Statistics and Applied Mathematics "Diego de Castro", University of Torino, Italy in collaboration with - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Ron S. Kenett](https://reader035.vdocuments.mx/reader035/viewer/2022062314/56813a48550346895da23c76/html5/thumbnails/1.jpg)
Relative Linkage Disequilibrium: An intersection between evolution, algebraic statistics, text mining and
contingency tables
Ron S. Kenett KPA Ltd., Raanana, Israel and Department of Statistics and Applied
Mathematics "Diego de Castro", University of Torino, Italy
in collaboration with
Silvia SaliniDepartment of Economics, Business and Statistics, University of Milan, Italy
![Page 2: Ron S. Kenett](https://reader035.vdocuments.mx/reader035/viewer/2022062314/56813a48550346895da23c76/html5/thumbnails/2.jpg)
Text Mining
Outline
Contingency TablesG
raph
ical
Dis
play
s
“Algebraic Statistics”Evolution
![Page 3: Ron S. Kenett](https://reader035.vdocuments.mx/reader035/viewer/2022062314/56813a48550346895da23c76/html5/thumbnails/3.jpg)
Background 1930 - 1975
FISHER, R. A. 1930, The Genetical Theory of Natural Selection. Clarendon, Oxford, U.K..LEWONTIN, R., and KOJIMA, K., 1960, The evolutionary dynamics of complex polymorphisms. Evolution 14: 458-472.KARLIN, S . and FELDMAN, M., 1970, Linkage and selection: Two locus symmetric viability model. Theoretical Population Biology 1 : 39-71.KARLIN, S., 1975, General two-locus models: some objectives, results and interpretations. Theoretical Population Biology 7 : 364-398.KARLIN,S . and KENETT, R. 1977, Variable Spatial Selection with Two Stages of Migration and Comparisons Between Different Timings, Theoretical Population Biology, 11, pp. 386-409.
Linkage Disequilibrium
Evolution
R.A. Fisher Sam KarlinThe Fundamental Theorem of Natural Selection:“the rate of increase of fitness of any organism at any time is equal to its genetic variance at that time”
The Fundamental Theorem of Natural Selection:“the rate of increase of fitness of any organism at any time is equal to its genetic variance at that time”
![Page 4: Ron S. Kenett](https://reader035.vdocuments.mx/reader035/viewer/2022062314/56813a48550346895da23c76/html5/thumbnails/4.jpg)
Background - 1975Contingency Tables
Discrete multivariate analysis: theory and practice“The first comprehensive treatise on the analysis of categorical data using loglinear and related statistical models…”
Discrete multivariate analysis: theory and practice“The first comprehensive treatise on the analysis of categorical data using loglinear and related statistical models…”
![Page 5: Ron S. Kenett](https://reader035.vdocuments.mx/reader035/viewer/2022062314/56813a48550346895da23c76/html5/thumbnails/5.jpg)
Background - 1983Graphical Display
![Page 6: Ron S. Kenett](https://reader035.vdocuments.mx/reader035/viewer/2022062314/56813a48550346895da23c76/html5/thumbnails/6.jpg)
Background - 2008Algebraic Statistics
![Page 7: Ron S. Kenett](https://reader035.vdocuments.mx/reader035/viewer/2022062314/56813a48550346895da23c76/html5/thumbnails/7.jpg)
254 itemsets:The “e” relationship
……….. female…………..infection…………………
……….. female……………………..…………………
……….. ……….…………..infection…………………
……………..………………………….……..…………
57
40
109
48
Text Mining
![Page 8: Ron S. Kenett](https://reader035.vdocuments.mx/reader035/viewer/2022062314/56813a48550346895da23c76/html5/thumbnails/8.jpg)
The DataContingency Tables
e RHS ^RHS
LHS x1=.224 x2=.157
^LHS x3=.429 x4=.189
d RHS ^RHS
LHS x1=.389 x2=.056
^LHS x3=.50 x4=.056
h RHS ^RHS
LHS x1=.057 x2=.321
^LHS x3=.189 x4=.434
LHS
^LHS
RHS^RHS
RHS^RHS
e RHS ^RHS
LHS 57 40
^LHS 109 48
![Page 9: Ron S. Kenett](https://reader035.vdocuments.mx/reader035/viewer/2022062314/56813a48550346895da23c76/html5/thumbnails/9.jpg)
“…a man who has carefully investigated a printed table, finds, when done, that he has only a very faint and partial idea of what he has read; and that like a figure imprinted on sand, is soon totally erased and defaced.”
William Playfair (1786), The Commercial and Political Atlas, from Edward R. Tufte (1983), The Visual Display of Quantitative Information.
Contingency Tables
![Page 10: Ron S. Kenett](https://reader035.vdocuments.mx/reader035/viewer/2022062314/56813a48550346895da23c76/html5/thumbnails/10.jpg)
The SimplexGraphical Display
x1
x2
x3
x4
4
1
1, 0 , 1...4.i ii
x x i
![Page 11: Ron S. Kenett](https://reader035.vdocuments.mx/reader035/viewer/2022062314/56813a48550346895da23c76/html5/thumbnails/11.jpg)
The SimplexGraphical Display
4
1
1, 0 , 1...4.i ii
x x i
![Page 12: Ron S. Kenett](https://reader035.vdocuments.mx/reader035/viewer/2022062314/56813a48550346895da23c76/html5/thumbnails/12.jpg)
Linkage Disequilibrium“Algebraic Statistics”
1x fg D
2 (1 )x f g D
3 (1 )x f g D
4 (1 )(1 )x f g D
1 3f x x
4 2 3D x x x x
RHS ^RHS
LHS x1 x2
^LHS x3 x4
4
1
1, 0 , 1...4.i ii
x x i
1 2g x x
D can be extended to more dimensions…
Two loci, two alleles each, four genotypes:AB, Ab, aB, ab
Two loci, two alleles each, four genotypes:AB, Ab, aB, ab
![Page 13: Ron S. Kenett](https://reader035.vdocuments.mx/reader035/viewer/2022062314/56813a48550346895da23c76/html5/thumbnails/13.jpg)
Linkage Disequilibrium“Algebraic Statistics”
1 2 3 4( , , , )
( ,1 )
( ,1 )
(1, 1)
where
X x x x x
f f f
g g g
e
4
1
1, 0 , 1...4.i ii
x x i
1 3
1 2
f x x
g x x
X f g De e
An algebraic observation…
![Page 14: Ron S. Kenett](https://reader035.vdocuments.mx/reader035/viewer/2022062314/56813a48550346895da23c76/html5/thumbnails/14.jpg)
Relative Linkage Disequilibrium
“Algebraic Statistics”
3 2
3
2
0If Dthenif x x
Dthen RLD
D xD
else RLDD x
1 4
1
4
elseif x x
Dthen RLD
D xD
else RLDD x
M
DRLD
D
DM is the distance from the point corresponding to the contingency table on the surface D=0 in the ee direction, to the surface of the simplex, in that direction.
D is the distance from the point corresponding to the contingency table in the simplex, to the surface D=0 in the ee direction.
![Page 15: Ron S. Kenett](https://reader035.vdocuments.mx/reader035/viewer/2022062314/56813a48550346895da23c76/html5/thumbnails/15.jpg)
Graphical Display
0D M
DRLD
D
MDD
![Page 16: Ron S. Kenett](https://reader035.vdocuments.mx/reader035/viewer/2022062314/56813a48550346895da23c76/html5/thumbnails/16.jpg)
RLD Example
RLD
Contingency Tables
4
2 3
x x
x x
h RHS ^RHS
LHS x1=.057 x2=.321
^LHS x3=.189 x4=.434
e RHS ^RHS
LHS x1=.224 x2=.157
^LHS x3=.429 x4=.189
d RHS ^RHS
LHS x1=.389 x2=.056
^LHS x3=.50 x4=.056
![Page 17: Ron S. Kenett](https://reader035.vdocuments.mx/reader035/viewer/2022062314/56813a48550346895da23c76/html5/thumbnails/17.jpg)
Association Rules
• Association rules are one of the most popular unsupervised data mining methods used in applications such as Market Basket Analysis, to measure the associations between products purchased by each consumer, or in web clickstream analysis, to measure the association between the pages seen (sequentially) by a visitor of a site.
• Mining frequent itemsets and association rules is a popular and well researched method for discovering interesting relations between variables in large databases. The structure of the data to be analyzed is typically referred to as transactional.
• Once obtained, the list of association rules extractable from a given dataset is compared in order to evaluate their importance level. The measures commonly used to assess the strength of an association rule are the indexes of support, confidence, and lift.
Text Mining
![Page 18: Ron S. Kenett](https://reader035.vdocuments.mx/reader035/viewer/2022062314/56813a48550346895da23c76/html5/thumbnails/18.jpg)
Support
• The support for a rule A => B is obtained dividing the number of transactions which satisfy the rule, NA=>B, by the total number of transactions, N.
support {A=>B} = NA=>B / N
Text Mining
![Page 19: Ron S. Kenett](https://reader035.vdocuments.mx/reader035/viewer/2022062314/56813a48550346895da23c76/html5/thumbnails/19.jpg)
SupportText Mining
RHS ^RHS
LHS x1 x2
g
^LHS x3 x4
1-g
f 1-f 1
support {A=>B} = {NA=>B / N} = x1
The higher the support the stronger the information that both type of events occur together.
![Page 20: Ron S. Kenett](https://reader035.vdocuments.mx/reader035/viewer/2022062314/56813a48550346895da23c76/html5/thumbnails/20.jpg)
Confidence
• The confidence of the rule A => B is obtained by dividing the number of transactions which satisfy the rule, NA=>B , by the number of transactions which contain the body of the rule, A.
confidence {A=>B} = NA=>B / NA
Text Mining
![Page 21: Ron S. Kenett](https://reader035.vdocuments.mx/reader035/viewer/2022062314/56813a48550346895da23c76/html5/thumbnails/21.jpg)
ConfidenceText Mining
RHS ^RHS
LHS x1 x2
g
^LHS x3 x4
1-g
f 1-f 1
confidence {A=>B} = {NA=>B / NA}= x1/g
A high confidence that the LHS event leads to the RHS event implies causation or statistical dependence.
![Page 22: Ron S. Kenett](https://reader035.vdocuments.mx/reader035/viewer/2022062314/56813a48550346895da23c76/html5/thumbnails/22.jpg)
Lift
• The lift of the rule A => B is the deviation of the support of the whole rule from the support expected under independence given the supports of the LHS (A) and the RHS (B).
lift {A=>B} = confidence{A=>B} / support{B}= support{A=>B}/support{A}support{B}
Text Mining
![Page 23: Ron S. Kenett](https://reader035.vdocuments.mx/reader035/viewer/2022062314/56813a48550346895da23c76/html5/thumbnails/23.jpg)
Relative Linkage Disequilibrium and other measures
RLD
Sup = 57/254 = .224
Conf = 57/97 = .588
Sup (RHS) = 166/254 = .654
lift = .588/.654 = .90
Contingency Tables
e RHS ^RHS
LHS 57 40
^LHS 109 48
4
2 3
x x
x x
![Page 24: Ron S. Kenett](https://reader035.vdocuments.mx/reader035/viewer/2022062314/56813a48550346895da23c76/html5/thumbnails/24.jpg)
The groceries example
Association Rules
First 20 rules for groceries data, sorted by Lift
![Page 25: Ron S. Kenett](https://reader035.vdocuments.mx/reader035/viewer/2022062314/56813a48550346895da23c76/html5/thumbnails/25.jpg)
The groceries exampleAssociation Rules
Plot of Relative Disequilibrium versus Lift for the 430 rules of groceries data set
For “Lift” > 2.5,RLD varies between 1%-40%
RLD
Lift
![Page 26: Ron S. Kenett](https://reader035.vdocuments.mx/reader035/viewer/2022062314/56813a48550346895da23c76/html5/thumbnails/26.jpg)
The groceries exampleAssociation Rules
![Page 27: Ron S. Kenett](https://reader035.vdocuments.mx/reader035/viewer/2022062314/56813a48550346895da23c76/html5/thumbnails/27.jpg)
The groceries exampleAssociation Rules
Plot of Relative Disequilibrium versus Support for the 430 rules of groceries data set
RLD shows morevariability than “support”RLD
Support
![Page 28: Ron S. Kenett](https://reader035.vdocuments.mx/reader035/viewer/2022062314/56813a48550346895da23c76/html5/thumbnails/28.jpg)
The groceries exampleAssociation Rules
Plot of Relative Disequilibrium versus Confidence for the 430 rules of groceries data set
For RLD of 20%,“Confidence” varies between 1%-40%
RLD
Confidence
![Page 29: Ron S. Kenett](https://reader035.vdocuments.mx/reader035/viewer/2022062314/56813a48550346895da23c76/html5/thumbnails/29.jpg)
The groceries example
Association Rules
First 20 rules for groceries data, sorted by Lift
![Page 30: Ron S. Kenett](https://reader035.vdocuments.mx/reader035/viewer/2022062314/56813a48550346895da23c76/html5/thumbnails/30.jpg)
The groceries exampleAssociation Rules
Plot of Relative Disequilibrium versus Lift for the 430 rules of groceries data set
For “Lift” > 2.5,RLD varies between 1%-40%
RLD
Lift
![Page 31: Ron S. Kenett](https://reader035.vdocuments.mx/reader035/viewer/2022062314/56813a48550346895da23c76/html5/thumbnails/31.jpg)
The groceries exampleAssociation Rules
Plot of RLD versus Chisquare for the top 20 rules of groceries data set, sorted by RLD
RL
D
ChiSquared
![Page 32: Ron S. Kenett](https://reader035.vdocuments.mx/reader035/viewer/2022062314/56813a48550346895da23c76/html5/thumbnails/32.jpg)
The groceries exampleAssociation Rules
Plot of RLD versus Odds Ratio for the top 20 rules of groceries data set, sorted by RLD
RL
D
Odds Ratio
![Page 33: Ron S. Kenett](https://reader035.vdocuments.mx/reader035/viewer/2022062314/56813a48550346895da23c76/html5/thumbnails/33.jpg)
Summary
• RLD is intuitive• RLD yields different answers from
“usual” measures• RLD can be extended to higher
dimensions• There are opportunities in considering
the relationship between Association Rules and Contingency Tables
![Page 34: Ron S. Kenett](https://reader035.vdocuments.mx/reader035/viewer/2022062314/56813a48550346895da23c76/html5/thumbnails/34.jpg)
Text MiningContingency TablesG
raph
ical
Dis
play
s
Algebraic StatisticsEvolution
![Page 35: Ron S. Kenett](https://reader035.vdocuments.mx/reader035/viewer/2022062314/56813a48550346895da23c76/html5/thumbnails/35.jpg)
References• Bishop, Y.M. M., Fienberg, S.E. and Holland, P.W. (1975). Discrete Multivariate Analysis:
Theory and Practice. M.I.T. Press, Cambridge, MA. Paperback edition (1977). Reprinted, by Springer-Verlag, New York (2007).
• Fisher, R. A. (1930). The Genetical Theory of Natural Selection. Clarendon, Oxford, U.K..• Hahsler, M., Grun, B., and Hornik, K. (2005). arules – A computational environment for mining
association rules and frequent item sets. Journal of Statistical Software, 14(15):1–25. ISSN 1548-7660. URL http://www.jstatsoft.org/v14/i15/.
• Karlin, S. and Feldman, M. (1970). Linkage and selection: Two locus symmetric viability model. Theoretical Population Biology 1 : 39-71.
• Karlin, S. and Kenett, R.S. (1977). Variable Spatial Selection with Two Stages of Migration and Comparisons Between Different Timings, Theoretical Population Biology, 11, pp. 386-409.
• Kenett, R. (1983). On an Exploratory Analysis of Contingency Tables. The Statistician, 32, pp. 395-403.
• Kenett, R. and Salini, S. (2008). Relative Linkage Disequilibrium: A New measure for association rules UNIMI - Research Papers in Economics, Business, and Statistics http://services.bepress.com/unimi/statistics/art32/
• Lewontin, R,. C., and Kojima, K. (1960). The evolutionary dynamics of complex polymorphisms. Evolution 14: 458-472.
• Omiecinski, E. (2003). Alternative interest measures for mining associations in databases. IEEE Transactions on Knowledge and Data Engineering, 15(1):57–69.
• Piatetsky-Shapiro, G. (1991). Discovery, analysis, and presentation of strong rules. In: Knowledge Discovery in Databases, pages 229–248.
• Shimada, K., Hirasawa K, and Hu J. (2006) Association Rule Mining with Chi-Squared Test Using Alternate Genetic Network Programming, ICDM2006.
• Tan, P-N, Kumar, V., and Srivastava, J. (2004). Selecting the right objective measure for association analysis. Information Systems, 29(4):293–313.
![Page 36: Ron S. Kenett](https://reader035.vdocuments.mx/reader035/viewer/2022062314/56813a48550346895da23c76/html5/thumbnails/36.jpg)
Backup slides
![Page 37: Ron S. Kenett](https://reader035.vdocuments.mx/reader035/viewer/2022062314/56813a48550346895da23c76/html5/thumbnails/37.jpg)
The telecom systems exampleAssociation Rules
PBX No Severity
Customer Type - English EC2 EC1 ALARM1 ALARM2 ALARM3
90009 2 High Tech SEC08 Security NO_ALARM NP NP
90009 2 High Tech NTC09 Network Communications NO_ALARM NP NP
90009 2 High Tech SEC08 Security NO_ALARM NP NP
90009 2 High Tech SEC08 Security NO_ALARM NP NP
90021 2 Municipalities SEC08 Security NO_ALARM NP NP
90033 2 Transportation SFW05 Software PCM TIME SLOT NP NP
90033 3 Transportation INT04 Interface PCM TIME SLOT NP NP
90033 3 Transportation SEC05 Security PCM TIME SLOT NP NP
90038 2 Municipalities SFW05 Software NO_ALARM NP NP
![Page 38: Ron S. Kenett](https://reader035.vdocuments.mx/reader035/viewer/2022062314/56813a48550346895da23c76/html5/thumbnails/38.jpg)
The telecom systems exampleAssociation Rules
Item Frequency Plot (Support>0.1) of telecom data set
![Page 39: Ron S. Kenett](https://reader035.vdocuments.mx/reader035/viewer/2022062314/56813a48550346895da23c76/html5/thumbnails/39.jpg)
The telecom systems exampleAssociation Rules
3D Simplex Representation for 200 rules of telecom data set and for the top 10 rules sorted by RLD
![Page 40: Ron S. Kenett](https://reader035.vdocuments.mx/reader035/viewer/2022062314/56813a48550346895da23c76/html5/thumbnails/40.jpg)
The telecom systems exampleAssociation Rules
2D Simplex Representation for
the top 10 rules sorted by RLD of telecom data set
![Page 41: Ron S. Kenett](https://reader035.vdocuments.mx/reader035/viewer/2022062314/56813a48550346895da23c76/html5/thumbnails/41.jpg)
The telecom systems exampleAssociation Rules
Top 10 rules sorted by RLD of telecom data set
![Page 42: Ron S. Kenett](https://reader035.vdocuments.mx/reader035/viewer/2022062314/56813a48550346895da23c76/html5/thumbnails/42.jpg)
RLD Statistical Properties
Contingency Tables
![Page 43: Ron S. Kenett](https://reader035.vdocuments.mx/reader035/viewer/2022062314/56813a48550346895da23c76/html5/thumbnails/43.jpg)
RLD Statistical PropertiesContingency Tables