10. data mining data mining is one aspect of database query processing (on the "what if" or pattern...

of 111 /111
10. Data Mining Data Mining is one aspect of Database Query Processing (on the "what if" or pattern and trend querying end of Query Processing, rather than the "please find" end. To say it another way, data mining queries are on the ad hoc or unstructured end of the query spectrum rather than standard report generation or "retieve all records matching a criteria" or SQL side). Still, Data Mining queries ARE queries and are processed (or will eventually be processed) by a Database Management System the same way queries are processed today, namely: 1. SCAN and PARSE (SCANNER-PARSER): A Scanner identifies the tokens or language elements of the DM query. The Parser check for syntax or grammar validity. 2. VALIDATED: The Validator checks for valid names and semantic correctness. 3. CONVERTER converts to an internal representation. |4. QUERY OPTIMIZED: the Optimzier devises a stategy for executing the DM query (chooses among alternative Query internal representations). 5. CODE GENERATION: generates code to implement each operator in the selected DM query plan (the optimizer-selected internal representation). 6. RUNTIME DATABASE PROCESSORING: run plan code. Developing new, efficient and effective DataMining Query (DMQ) processors is the central need and issue in DBMS research today (far and away!). These notes concentrate on 5,i.e., generating code (algorithms) to implement operators (at a high level): Association Rule Mining (ARM), Clustering (CLU), Classification (CLA)

Author: rosaline-terry

Post on 28-Dec-2015

258 views

Category:

Documents


13 download

Embed Size (px)

TRANSCRIPT

  • 10. Data MiningData Mining is one aspect of Database Query Processing (on the "what if" or pattern and trend querying end of Query Processing, rather than the "please find" end. To say it another way, data mining queries are on the ad hoc or unstructured end of the query spectrum rather than standard report generation or "retieve all records matching a criteria" or SQL side).

    Still, Data Mining queries ARE queries and are processed (or will eventually be processed) by a Database Management System the same way queries are processed today, namely:

    1. SCAN and PARSE (SCANNER-PARSER): A Scanner identifies the tokens or language elements of the DM query. The Parser check for syntax or grammar validity.

    2. VALIDATED: The Validator checks for valid names and semantic correctness.

    3. CONVERTER converts to an internal representation.

    |4. QUERY OPTIMIZED: the Optimzier devises a stategy for executing the DM query (chooses among alternative Query internal representations).

    5. CODE GENERATION: generates code to implement each operator in the selected DM query plan (the optimizer-selected internal representation).

    6. RUNTIME DATABASE PROCESSORING: run plan code.

    Developing new, efficient and effective DataMining Query (DMQ) processors is the central need and issue in DBMS research today (far and away!).

    These notes concentrate on 5,i.e., generating code (algorithms) to implement operators (at a high level):

    Association Rule Mining (ARM), Clustering (CLU), Classification (CLA)

  • Machine Learning is almost always based on Near Neighbor Set(s), NNS.Clustering, even density based, identifies near neighbor cores 1st (round NNSs, about a center).Classification is continuity based and Near Neighbor Sets (NNS) are the central concept in continuity>0 >0 : d(x,a)
  • Query Processing and Optimization: Relational Queries to Data MiningMost people have Data from which they want information.So, most people need DBMSs whether they know it or not.A major component of any DBMS is the DataMining Query Processor.Queries can range from structure to unstructured:On the Data Mining end, we have barely scratched the surface.But those scratches have already made the difference between becoming the worldsbiggest corporation (Walmart - got into DM for supply chain mgmt early) and filing for bankruptcy (KMart - didn't!).

  • Recall the Entity-Relationship (ER) Model's notion of a RelationshipRelationship: Association among 2 [or more, the # of entities is the degree] entities.The Graph of a Relationship: A degree=2 relationship between entity T and I generates a bipartite undirected graph (bipartite means that the node set is a disjoint union of two subsets and that all edges must run from one subset to the other).A degree=2 relationship between an entity and itself, e.g., Employee Reports_To Employee, generates a uni-partite undirected graph.Relationships can have attributes too!

  • Association Rule Mining (ARM)In a relationship between entities, T is the set of Transactions an enterprise performs and I is the set of Items on which those transactions are performed.In Market Basket Research (MBR) a transaction is a checkout transaction and an item is an Item in that customer's market basket which gets checked out)

    An I-Association Rule, AC, relates 2 disjoint subsets of I (I-temsets) has 2 main measures, support and confidence (A is called the antecedent, C is called the consequent)There are also the dual concepts of T-association rules (just reverse the roles of T and I above).Examples of Association Rules include: The MBR, relationship between customer cash-register transactions, T, and purchasable items, I (t is related to i iff i is being bought by that customer during that cash-register transaction.).

    In Software Engineering (SE), the relationship between Aspects, T, and Code Modules, I (t is related to i iff module, i, is part of the aspect, t).

    In Bioformatics, the relationship between experiments, T, and genes, I (t is related to i iff gene, i, expresses at a threshold level during experiment, t).

    In ER diagramming, any part of relationship in which iI is part of tT (t is related to i iff i is part of t); and any ISA relationship in which iI ISA tT (t is related to i iff i IS A t) . . .The support of an I-set, A, is the fraction of T-instances related to every I-instance in A, e.g. if A={i1,i2} and C={i4} then supp(A)= |{t2,t4}|/|{t1,t2,t3,t4,t5}| = 2/5 Note: | | means set size or count of elements in the set. I.e., T2 and T4 are the only transactions from the total transaction set, T={T1,T2,T3,T4,T5}. that are related to both i1 and i2, (buy i1 and i2 during the pertinent T-period of time).support of rule, AC, is defined as supp{A C} = |{T2, T4}|/|{T1,T2,T3,T4,T5}| = 2/5confidence of rule, AC, is supp(AC)/ supp(A) = (2/5) / (2/5) = 1 DM Queriers typically want STRONG RULES: suppminsupp, confminconf (minsupp and minconf are threshold levels)Note that Conf(AC) is also just the conditional probability of t being related to C, given that t is related to A).

  • Finding Strong Assoc RulesThe relationship between Transactions and Items can be expressed in a Transaction Table where each transaction is a row containing its IDand the list of the items that are related to that transaction:If minsupp is set by the querier at .5 and minconf at .75:To find frequent or Large itemsets (support minsupp)PseudoCode: Assume the items in Lk-1 are ordered:

    Step 1: self-joining Lk-1 insert into Ck select p.item1, p.item2, , p.itemk-1, q.itemk-1 p from Lk-1, q from Lk-1 where p.item1=q.item1,..,p.itemk-2=q.itemk-2, p.itemk-1

  • R11

    00001011then process using multi-operand logical ANDs.Vertical basic binary Predicate-tree (P-tree): vertically partition table; compress each vertical bit slice into a basic binary P-tree as followsPtree Review: A data table, R(A1..An), containing horizontal structures (records) isprocessed vertically (vertical scans)The basic binary P-tree, P1,1, for R11 is built top-down by record truth of predicate pure1 recursively on halves, until purity.But it is pure (pure0) so this branch ends

  • R11

    00001011Top-down construction of basic binary P-trees is good for understanding, but bottom-up is more efficient.Bottom-up construction of P11 is done using in-order tree traversal and the collapsing of pure siblings, as follow:0 1 0 1 1 1 1 1 0 0 0 10 1 1 1 1 1 1 1 0 0 0 00 1 0 1 1 0 1 0 1 0 0 10 1 0 1 1 1 1 0 1 1 1 11 0 1 0 1 0 0 0 1 1 0 00 1 0 0 1 0 0 0 1 1 0 11 1 1 0 0 0 0 0 1 1 0 01 1 1 0 0 0 0 0 1 1 0 0

    R11 R12 R13 R21 R22 R23 R31 R32 R33 R41 R42 R43

    P11

    0

  • R(A1 A2 A3 A4)2 7 6 16 7 6 02 7 5 12 7 5 75 2 1 42 2 1 57 0 1 47 0 1 421-level has the only 1-bit so the 1-count = 1*21 = 2Processing Efficiencies? (prefixed leaf-sizes have been removed)

  • Database D Example ARM using uncompressed P-trees (note: I have placed the 1-count at the root of each Ptree)

    TID1234510010110200011013001110140001001

    Sheet1

    TIDItems

    1001 3 4

    2002 3 5

    3001 2 3 5

    4002 5

  • L3L1L21-ItemSets dont support Association Rules (They will have no antecedent or no consequent).Are there any Strong Rules supported byLarge 2-ItemSets (at minconf=.75)?{1,3}conf{1}{3} = supp{1,3}/supp{1} = 2/2 = 1 .75 STRONGconf{3}{1} = supp{1,3}/supp{3} = 2/3 = .67 < .75{2,3}conf{2}{3} = supp{2,3}/supp{2} = 2/3 = .67 < .75 conf{3}{2} = supp{2,3}/supp{3} = 2/3 = .67 < .75{2,5}conf{2}{5} = supp{2,5}/supp{2} = 3/3 = 1 .75 STRONG!conf{5}{2} = supp{2,5}/supp{5} = 3/3 = 1 .75 STRONG!{3,5}conf{3}{5} = supp{3,5}/supp{3} = 2/3 = .67 < .75 conf{5}{3} = supp{3,5}/supp{5} = 2/3 = .67 < .75Are there any Strong Rules supported byLarge 3-ItemSets?{2,3,5}conf{2,3}{5} = supp{2,3,5}/supp{2,3} = 2/2 = 1 .75 STRONG!conf{2,5}{3} = supp{2,3,5}/supp{2,5} = 2/3 = .67 < .75conf{3,5}{2} = supp{2,3,5}/supp{3,5} = 2/3 = .67 < .75No subset antecedent can yield a strong rule either (i.e., no need to check conf{2}{3,5} or conf{5}{2,3} since both denominators will be at least as large and therefore, both confidences will be at least as low.No need to check conf{3}{2,5} or conf{5}{2,3}DONE!2-Itemsets do support ARs.

    Sheet1

    itemsetsup

    {2 3 5}2

    Sheet1

    itemsetsup.

    {1}2

    {2}3

    {3}3

    {5}3

    Sheet1

    itemsetsup

    {1 3}2

    {2 3}2

    {2 5}3

    {3 5}2

  • Ptree-ARM versus Apriori on aerial photo (RGB) data together with yeild dataScalability with support threshold 1320 1320 pixel TIFF- Yield dataset (total number of transactions is ~1,700,000).P-ARM compared to Horizontal Apriori (classical) and FP-growth (an improvement of it).In P-ARM, we find all frequent itemsets, not just those containing Yield (for fairness)Aerial TIFF images (R,G,B) with synchronized yield (Y).Scalability with number of transactions Identical resultsP-ARM is more scalable for lower support thresholds.P-ARM algorithm is more scalable to large spatial datasets.

    Chart2

    70.0491060.464

    50.706860

    39.391705

    26.354516

    23.343300

    16.044168

    16.1989

    14.84132

    14.84310.716

    P-ARM

    Apriori

    Support threshold

    Run time (Sec.)

    Sheet1

    minsupbits consideredFrequent ItemsetsP-treeFP-tree

    20%11519.835300

    22450.706703.4211160.168

    31680.823384.472

    49128.74500

    10%11519.852300

    24270.0491060.464417

    344133.451816.054

    427153.847400

    Sheet1

    00

    00

    00

    00

    P-tree

    FP-tree

    bits considered

    time (s)

    Minimum-support=20%

    Sheet2

    00

    00

    00

    00

    P-tree

    FP-tree

    bits considered

    time (s)

    Minimum-support=10%

    Sheet3

    P-tree:

    FreqItemsets1234time1234

    1%41841821%68.772124.509521.04

    10%1542442710%19.85270.049133.451153.847

    20%152416920%19.83550.70680.823128.74

    30%15178230%19.77939.39136.06850.805

    40%15115040%20.28126.35430.45249.37

    50%1582050%19.97123.34326.65649.781

    60%1530060%20.38116.04426.08949.51

    70%1530070%20.21416.1925.89250.055

    80%1510080%19.77314.84125.97949.252

    90%900090%19.42714.84325.9949.526

    Sheet3

    0000

    0000

    0000

    0000

    0000

    0000

    0000

    0000

    0000

    Band1

    Band2

    Band3

    Band4

    Total Frequent Itemsets

    0000

    0000

    0000

    0000

    0000

    0000

    0000

    0000

    0000

    Bit1

    Bit2

    Bit3

    Bit4

    Minimum Support

    Time (s)

    bits=2

    P-treeFP-tree

    10%70.0491060.464417

    20%50.7068601160.168

    30%39.391705

    40%26.354516

    50%23.343300

    60%16.044168

    70%16.1989

    80%14.84132

    90%14.84310.716

    00

    00

    00

    00

    00

    00

    00

    00

    00

    P-tree runtime

    FP-tree runtime

    Support threshold

    Run time (Sec.)

    Chart3

    998

    13815

    20025

    25234

    32943

    44052

    65059

    85665

    113170

    Apriori

    P-ARM

    Number of transactions(K)

    Time (Sec.)

    Sheet1

    minsupbits consideredFrequent ItemsetsP-treeFP-tree

    20%11519.835300

    22450.706703.4211160.168

    31680.823384.472

    49128.74500

    10%11519.852300

    24270.0491060.464417

    344133.451816.054

    427153.847400

    Sheet1

    00

    00

    00

    00

    P-tree

    FP-tree

    bits considered

    time (s)

    Minimum-support=20%

    Sheet2

    00

    00

    00

    00

    P-tree

    FP-tree

    bits considered

    time (s)

    Minimum-support=10%

    Sheet3

    P-tree:

    FreqItemsets1234time1234

    1%41841821%68.772124.509521.04

    10%1542442710%19.85270.049133.451153.847

    20%152416920%19.83550.70680.823128.74

    30%15178230%19.77939.39136.06850.805

    40%15115040%20.28126.35430.45249.37

    50%1582050%19.97123.34326.65649.781

    60%1530060%20.38116.04426.08949.51

    70%1530070%20.21416.1925.89250.055

    80%1510080%19.77314.84125.97949.252

    90%900090%19.42714.84325.9949.526

    time/itemsets10%1.32351.6678133.451153.847

    20%1.32232.112880.823128.74

    30%1.31862.317136.06850.805

    40%1.352126.35430.45249.37

    50%1.331423.34326.65649.781

    60%1.358716.04426.08949.51

    70%1.347616.1925.89250.055

    80%1.318214.84125.97949.252

    90%2.158614.84325.9949.526

    Sheet3

    0000

    0000

    0000

    0000

    0000

    0000

    0000

    0000

    0000

    Band1

    Band2

    Band3

    Band4

    Total Frequent Itemsets

    Sheet4

    0000

    0000

    0000

    0000

    0000

    0000

    0000

    0000

    0000

    Bit1

    Bit2

    Bit3

    Bit4

    Minimum Support

    Time (s)

    bits=2

    P-treeFP-tree

    10%70.0491060.464417

    20%50.706703.4211160.168

    30%39.391598.371

    40%26.354328.662

    50%23.343201.039

    60%16.04468.238

    70%16.1966.245

    80%14.84114.651

    90%14.84310.716

    00

    00

    00

    00

    00

    00

    00

    00

    00

    P-tree runtime

    FP-tree runtime

    Support threshold

    Run time (Sec.)

    Scalability with number of transactionssupport threshold=10%

    FPP-tree

    1.7M106070

    100998

    30013815

    50020025

    70025234

    90032943

    110044052

    130065059

    150085665

    1700113170

    990

    1380

    2000

    2520

    3290

    4400

    6500

    8560

    11600

    FP-tree runtime

    P-tree runtime

    Number of transactions(K)

    Time (Sec.)

  • P-ARM versus FP-growth (see literature for definition)Scalability with support threshold 17,424,000 pixels (transactions)Scalability with number of trans FP-growth = efficient, tree-based frequent pattern mining method (details later)For a dataset of 100K bytes, FP-growth runs very fast. But for images of large size, P-ARM achieves better performance. P-ARM achieves better performance in the case of low support threshold.

    Chart2

    70.0491060.464

    50.706703.421

    39.391598.371

    26.354328.662

    23.343201.039

    16.04468.238

    16.1966.245

    14.84114.651

    14.84310.716

    P-ARM

    FP-growth

    Support threshold

    Run time (Sec.)

    Sheet1

    minsupbits consideredFrequent ItemsetsP-treeFP-tree

    20%11519.835300

    22450.706703.4211160.168

    31680.823384.472

    49128.74500

    10%11519.852300

    24270.0491060.464417

    344133.451816.054

    427153.847400

    Sheet1

    00

    00

    00

    00

    P-tree

    FP-tree

    bits considered

    time (s)

    Minimum-support=20%

    Sheet2

    00

    00

    00

    00

    P-tree

    FP-tree

    bits considered

    time (s)

    Minimum-support=10%

    Sheet3

    P-tree:

    FreqItemsets1234time1234

    1%41841821%68.772124.509521.04

    10%1542442710%19.85270.049133.451153.847

    20%152416920%19.83550.70680.823128.74

    30%15178230%19.77939.39136.06850.805

    40%15115040%20.28126.35430.45249.37

    50%1582050%19.97123.34326.65649.781

    60%1530060%20.38116.04426.08949.51

    70%1530070%20.21416.1925.89250.055

    80%1510080%19.77314.84125.97949.252

    90%900090%19.42714.84325.9949.526

    Sheet3

    0000

    0000

    0000

    0000

    0000

    0000

    0000

    0000

    0000

    Band1

    Band2

    Band3

    Band4

    Total Frequent Itemsets

    0000

    0000

    0000

    0000

    0000

    0000

    0000

    0000

    0000

    Bit1

    Bit2

    Bit3

    Bit4

    Minimum Support

    Time (s)

    bits=2

    P-treeFP-tree

    10%70.0491060.464417

    20%50.706703.4211160.168

    30%39.391598.371

    40%26.354328.662

    50%23.343201.039

    60%16.04468.238

    70%16.1966.245

    80%14.84114.651

    90%14.84310.716

    00

    00

    00

    00

    00

    00

    00

    00

    00

    P-tree runtime

    FP-tree runtime

    Support threshold

    Run time (Sec.)

    Chart3

    310

    618

    3625

    9034

    12543

    38252

    57659

    81365

    106070

    FP-growth

    P-ARM

    Number of transactions(K)

    Time (Sec.)

    Sheet1

    minsupbits consideredFrequent ItemsetsP-treeFP-tree

    20%11519.835300

    22450.706703.4211160.168

    31680.823384.472

    49128.74500

    10%11519.852300

    24270.0491060.464417

    344133.451816.054

    427153.847400

    Sheet1

    00

    00

    00

    00

    P-tree

    FP-tree

    bits considered

    time (s)

    Minimum-support=20%

    Sheet2

    00

    00

    00

    00

    P-tree

    FP-tree

    bits considered

    time (s)

    Minimum-support=10%

    Sheet3

    P-tree:

    FreqItemsets1234time1234

    1%41841821%68.772124.509521.04

    10%1542442710%19.85270.049133.451153.847

    20%152416920%19.83550.70680.823128.74

    30%15178230%19.77939.39136.06850.805

    40%15115040%20.28126.35430.45249.37

    50%1582050%19.97123.34326.65649.781

    60%1530060%20.38116.04426.08949.51

    70%1530070%20.21416.1925.89250.055

    80%1510080%19.77314.84125.97949.252

    90%900090%19.42714.84325.9949.526

    time/itemsets10%1.32351.6678133.451153.847

    20%1.32232.112880.823128.74

    30%1.31862.317136.06850.805

    40%1.352126.35430.45249.37

    50%1.331423.34326.65649.781

    60%1.358716.04426.08949.51

    70%1.347616.1925.89250.055

    80%1.318214.84125.97949.252

    90%2.158614.84325.9949.526

    Sheet3

    0000

    0000

    0000

    0000

    0000

    0000

    0000

    0000

    0000

    Band1

    Band2

    Band3

    Band4

    Total Frequent Itemsets

    Sheet4

    0000

    0000

    0000

    0000

    0000

    0000

    0000

    0000

    0000

    Bit1

    Bit2

    Bit3

    Bit4

    Minimum Support

    Time (s)

    bits=2

    P-treeFP-tree

    10%70.0491060.464417

    20%50.706703.4211160.168

    30%39.391598.371

    40%26.354328.662

    50%23.343201.039

    60%16.04468.238

    70%16.1966.245

    80%14.84114.651

    90%14.84310.716

    00

    00

    00

    00

    00

    00

    00

    00

    00

    P-tree runtime

    FP-tree runtime

    Support threshold

    Run time (Sec.)

    Scalability with number of transactionssupport threshold=10%

    FPP-tree

    1.7M106070

    100310

    300618

    5003625

    7009034

    90012543

    110038252

    130057659

    150081365

    1700106070

    00

    00

    00

    00

    00

    00

    00

    00

    00

    FP-tree runtime

    P-tree runtime

    Number of transactions(K)

    Time (Sec.)

  • Other methods (other than FP-growth) to Improve Aprioris Efficiency(see the literature or the html notes 10datamining.html in Other Materials for more detail)Hash-based itemset counting: A k-itemset whose corresponding hashing bucket count is below the threshold cannot be frequentTransaction reduction: A transaction that does not contain any frequent k-itemset is useless in subsequent scansPartitioning: Any itemset that is potentially frequent in DB must be frequent in at least one of the partitions of DBSampling: mining on a subset of given data, lower support threshold + a method to determine the completenessDynamic itemset counting: add new candidate itemsets only when all of their subsets are estimated to be frequentThe core of the Apriori algorithm:Use only large (k 1)-itemsets to generate candidate large k-itemsetsUse database scan and pattern matching to collect counts for the candidate itemsetsThe bottleneck of Apriori: candidate generation 1. Huge candidate sets: 104 large 1-itemset may generate 107 candidate 2-itemsets To discover large pattern of size 100, eg, {a1a100}, we need to generate 2100 1030 candidates.2. Multiple scans of database: (Needs (n +1 ) scans, n = length of the longest pattern)

  • 3 steps: Build a Model of the TDS feature-to-class relationship, Test that Model, Use the Model (to predict the most likely class of each unclassified sample). Note: other names for this process: regression analysis, case-based reasoning,...)Other Typical Applications:Targeted Product Marketing (the so-called classsical Business Intelligence problem)Medical Diagnosis (the so-called Computer Aided Diagnosis or CAD)Nearest Neighbor Classifiers (NNCs) use a portion of the TDS as the model (neighboring tuples vote)finding the neighbor set is much faster than building other models but it must be done anew for each unclasified sample. (NNC is called a lazy classifier because it get's lazy and doesn't take the time to build a concise model of the relationship between feature tuples and class labels ahead of time).Eager Classifiers (~all other classifiers) build 1 concise model once and for all - then use it for all unclassified samples. The model building can be very costly but that cost can be amortized over all the classifications of a large number of unclassified samples (e.g., all RGB points in a field).ClassificationUsing a Training Data Set (TDS) in which each feature tuple is already classified (has a class value attached to it in the class column, called its class label.),1. Build a model of the TDS (called the TRAINING PHASE). 2. Use that model to classify unclassified feature tuples (unclassified samples). E.g., TDS = last year's aerial image of a crop field (feature columns are R,G,B columns together with last year's crop yeilds attached in a class column, e.g., class values={Hi, Med, Lo} yeild. Unclassified samples are the RGB tuples from this year's aerial image 3. Predict the class of each unclassified tuple (in the e.g.,: predict yeild for each point in the field.)

  • yesyesnoEager Classifiers

    Sheet1

    NAMERANKYEARSTENURED

    MikeAssistant Prof3no

    MaryAssistant Prof7yes

    BillProfessor2yes

    JimAssociate Prof7yes

    DaveAssistant Prof6no

    AnneAssociate Prof3no

  • NAMERANKYEARSTENUREDTomAssistant Prof2noMerlisaAssociate Prof7noGeorgeAssociate Prof5yesJosephAssistant Prof7no% correctclassifications?Test Process (2): Usually some of the Training Tuples are set aside as a Test Set and after a model is constructed, the Test Tuples are run through the Model. The Model is acceptable if, e.g., the % correct > 60%. If not, the Model is rejected (never used).Correct=3Incorrect=175%Since 75% is above the acceptability threshold, accept the model!

  • Classification by Decision Tree InductionDecision tree (instead of a simple case statement of rules, the rules are prioritized into a tree)Each Internal node denotes a test or rule on an attribute (test attribute for that node)Each Branch represents an outcome of the test (value of the test attribute)Leaf nodes represent class label decisions (plurality leaf class is predicted class)

    Decision tree model development consists of two phasesTree constructionAt start, all the training examples are at the rootPartition examples recursively based on selected attributesTree pruningIdentify and remove branches that reflect noise or outliers

    Decision tree use: Classifying unclassified samples by filtering them down the decision tree to their proper leaf, than predict the plurality class of that leaf (often only one, depending upon the stopping condition of the construction phase)

  • Algorithm for Decision Tree InductionBasic ID3 algorithm (a simple greedy top-down algorithm)

    At start, the current node is the root and all the training tuples are at the root

    Repeat, down each branch, until the stopping condition is true

    At current node, choose a decision attribute (e.g., one with largest information gain).Each value for that decision attribute is associated with a link to the next level down and that value is used as the selection criterion of that link.Each new level produces a partition of the parent training subset based on the selection value assigned to its link.

    stopping conditions:When all samples for a given node belong to the same classWhen there are no remaining attributes for further partitioning majority voting is employed for classifying the leafWhen there are no samples left

  • Bayesian Classification (eager: Model is based on conditional probabilities. Prediction is done by taking the highest conditionally probable class)A Bayesian classifier is a statistical classifier, which is based on following theorem known as Bayes theorem:

    Bayes theorem:Let X be a data sample whose class label is unknown.Let H be the hypothesis that X belongs to class, H.P(H|X) is the conditional probability of H given X. P(H) is prob of H, then

    P(H|X) = P(X|H)P(H)/P(X)

  • Nave Bayesian ClassificationGiven training set, R(f1..fn, C) where C={C1..Cm} is the class label attribute.

    A Naive Bayesian Classifier will predict the class of unknown data sample, X=(x1..xn), to be the class, Cj having the highest conditional probability, conditioned on X.That isit will predict the class to be Cj iff (a tie handling algorithm may be required). P(Cj|X) P(Ci|X), i j.

    From the Bayes theorem;P(Cj|X) = P(X|Cj)P(Cj)/P(X)

    P(X) is constant for all classes so we need only maximize P(X|Cj)P(Cj): P(Ci)s are known.To reduce the computational complexity of calculating all P(X|Cj)s, the naive assumption is to assume class conditional independence: P(X|Ci) is the product of the P(Xi|Ci)s.

  • Neural Network ClassificatonA Neural Network is trained to make the predictionAdvantagesprediction accuracy is generally highit is generally robust (works when training examples contain errors)output may be discrete, real-valued, or a vector of several discrete or real-valued attributesIt provides fast classification of unclassified samples.CriticismIt is difficult to understand the learned function (involves complex and almost magic weight adjustments.)It makes it difficult to incorporate domain knowledgelong training time (for large training sets, it is prohibitive!)

  • The input feature vector x=(x0..xn) is mapped into variable y by means of the scalar product and a nonlinear function mapping, f (called the damping function). and a bias function, A Neuron

  • The ultimate objective of training obtain a set of weights that makes almost all the tuples in the training data classify correctly (usually using a time consuming "back propagation" procedure which is based, ultimately on Neuton's method. See literature of Other materials - 10datamining.html for examples and alternate training techniques).StepsInitialize weights with random values Feed the input tuples into the networkFor each unitCompute the net input to the unit as a linear combination of all the inputs to the unitCompute the output value using the activation functionCompute the errorUpdate the weights and the biasNeural Network Training

  • Output nodesInput nodesHidden nodesOutput vectorInput vector: xiwijNeural Multi-Layer Perceptron

  • These next 3 slides treat the concept of Distance it great detail. You may feel you don't need this much detail - if so, skip what you feel you don't need.For Nearest Neighbor Classification, a distance is needed (to make sense of "nearest". Other classifiers also use distance.)A distance is a function, d, applied to two n-dimensional points X and Y, is such that d(X, Y) is positive definite: if (X Y), d(X, Y) > 0; if (X = Y), d(X, Y) = 0d(X, Y) is symmetric: d(X, Y) = d(Y, X)d(X, Y) holds triangle inequality: d(X, Y) + d(Y, Z) d(X, Z)

  • An Exampled1 d2 d always A two-dimensional space:

  • Neighborhoods of a Point A Neighborhood (disk neighborhood) of a point, T, is a set of points, S, :

    X S iff d(T, X) r If X is a point on the boundary, d(T, X) = r

  • Classical k-Nearest Neighbor ClassificationSelect a suitable value for k (how many Training Data Set (TDS) neighbors do you want to vote as to the best predicted class for the unclassified feature sample? )

    Determine a suitable distance metric (to give meaning to neighbor)

    Find the k nearest training set points to the unclassified sample.Let them vote (tally up the counts of TDS neighbors that for each class.)

    Predict the highest class vote (plurality class) from among the k-nearest neighbor set.

  • Closed-KNNExample assume 2 features (one in the x-direction and one in the yT is the unclassified sample. using k = 3, find the three nearest neighbor,KNN arbitrarily select one point from the boundary line shown Closed-KNN includes all points on the boundary

    Closed-KNN yields higher classification accuracy than traditional KNN (thesis of MD Maleq Khan, NDSU, 2001).

    The P-tree method always produce closed neighborhoods (and is faster!)

  • k-Nearest Neighbor (kNN) Classification andClosed-k-Nearest Neighbor (CkNN) Classification 1)Select a suitable value for k 2) Determine a suitable distance or similarity notion.3) Find the k nearest neighbor set [closed] of the unclassified sample.4) Find the plurality class in the nearest neighbor set.5) Assign the plurality class as the predicted class of the sample TThat's 1 !CkNN yields higher classification accuracy than traditional kNN.At what additional cost? Actually, at negative cost (faster and more accurate!!)T is the unclassified sample. Use Euclidean distance.k = 3: Find 3 closest neighbors. Move out from T until 3 neighborskNN arbitrarily select one point from that boundary line as 3rd nearest neighbor, whereas, CkNN includes all points on that boundary line.That's 2 !That's more than 3 !

  • The slides numbered 28 through 93 give great detail on the relative performance of kNN and CkNN, on the use of other distance functions and some exampels, etc. There may be more detail on these issue that you want/need. If so, just scan for what you are most interested in or just skip ahead to slide 94 on CLUSTERING.Experimented on two sets of (Arial) Remotely Sensed Images of Best Management Plot (BMP) of Oakes Irrigation Test Area (OITA), ND

    Data contains 6 bands: Red, Green, Blue reflectance values, Soil Moisture, Nitrate, and Yield (class label). Band values ranges from 0 to 255 (8 bits) Considering 8 classes or levels of yield values

  • Performance Accuracy (3 horizontal methods in middle, 3 vertical methods (the 2 most accurate and the least accurate) 1997 Dataset:404550556065707580256102440961638465536262144Training Set Size (no. of pixels)Accuracy (%)kNN-ManhattankNN-EuclidiankNN-MaxkNN using HOBbit distanceP-tree Closed-KNN0-maxClosed-kNN using HOBbit distance

  • Performance Accuracy (3 horizontal methods in middle, 3 vertical methods (the 2 most accurate and the least accurate)1998 Dataset:20404550556065256102440961638465536262144Training Set Size (no of pixels)Accuracy (%)kNN-ManhattankNN-EuclidiankNN-MaxkNN using HOBbit distanceP-tree Closed-KNN-maxClosed-kNN using HOBbit distance

  • Performance Speed (3 horizontal methods in middle, 3 vertical methods (the 2 fastest (the same 2) and the slowest)1997 Dataset: both axis in logarithmic scale0.00010.0010.010.11256102440961638465536262144Training Set Size (no. of pixels)Per Sample Classification time (sec)kNN-ManhattankNN-EuclidiankNN-MaxkNN using HOBbit distanceP-tree Closed-KNN-maxClosed-kNN using HOBbit distHint: NEVER use a log scale to show a WIN!!!

  • Performance Speed (3 horizontal methods in middle, 3 vertical methods (the 2 fastest (the same 2) and the slowest)1998 Dataset : both axis in logarithmic scale0.00010.0010.010.11256102440961638465536262144Training Set Size (no. of pixels)Per Sample Classification Time (sec)kNN-ManhattankNN-EuclidiankNN-MaxkNN using HOBbit distanceP-tree Closed-kNN-maxClosed-kNN using HOBbit distWin-Win situation!!(almost never happens)

    P-tree CkNN and CkNN-H are more accurate and much faster.

    kNN-H is not recommended because it is slower and less accurate (because it doesn't use Closed nbr sets and it requires another step to get rid of ties (why do it?).

    Horizontal kNNs are not recommended because they are less accurate and slower!

  • Key a1 a2 a3 a4 a5 a6 a7 a8 a9 a10=C a11 a12 a13 a14 a15 a16 a17 a18 a19 a20t12 1 0 1 0 0 0 1 1 0 1 0 1 1 0 1 1 0 0 0 1t13 1 0 1 0 0 0 1 1 0 1 0 1 0 0 1 0 0 0 1 1t15 1 0 1 0 0 0 1 1 0 1 0 1 0 1 0 0 1 1 0 0t16 1 0 1 0 0 0 1 1 0 1 1 0 1 0 1 0 0 0 1 0t21 0 1 1 0 1 1 0 0 0 1 1 0 1 0 0 0 1 1 0 1t27 0 1 1 0 1 1 0 0 0 1 0 0 1 1 0 0 1 1 0 0t31 0 1 0 0 1 0 0 0 1 1 1 0 1 0 0 0 1 1 0 1t32 0 1 0 0 1 0 0 0 1 1 0 1 1 0 1 1 0 0 0 1t33 0 1 0 0 1 0 0 0 1 1 0 1 0 0 1 0 0 0 1 1t35 0 1 0 0 1 0 0 0 1 1 0 1 0 1 0 0 1 1 0 0t51 0 1 0 1 0 0 1 1 0 0 1 0 1 0 0 0 1 1 0 1t53 0 1 0 1 0 0 1 1 0 0 0 1 0 0 1 0 0 0 1 1t55 0 1 0 1 0 0 1 1 0 0 0 1 0 1 0 0 1 1 0 0t57 0 1 0 1 0 0 1 1 0 0 0 0 1 1 0 0 1 1 0 0t61 1 0 1 0 1 0 0 0 1 0 1 0 1 0 0 0 1 1 0 1t72 0 0 1 1 0 0 1 1 0 0 0 1 1 0 1 1 0 0 0 1t75 0 0 1 1 0 0 1 1 0 0 0 1 0 1 0 0 1 1 0 0t12 0 0 1 0 1 1 0 2t13 0 0 1 0 1 0 0 1t15 0 0 1 0 1 0 1 2t53 0 0 0 0 1 0 0 10 1WALK THRU: 3NN CLASSIFICATION of an unclassified sample, a=(a5 a6 a11a12a13a14 )=(000000).HORIZONTAL APPROACH ( relevant attributes are Note only 1 of many training tuple at a distance=2 from the sample got to vote. We didnt know that distance=2 was going to be the vote cutoff until the end of the 1st scan.Finding the other distance=2 voters (Closed 3NN set or C3NN) requires another scan.a5 a6 a11 a12 a13 a14 )

  • Key a1 a2 a3 a4 a5 a6 a7 a8 a9 a10=C a11 a12 a13 a14 a15 a16 a17 a18 a19 a20t12 1 0 1 0 0 0 1 1 0 1 0 1 1 0 1 1 0 0 0 1t13 1 0 1 0 0 0 1 1 0 1 0 1 0 0 1 0 0 0 1 1t15 1 0 1 0 0 0 1 1 0 1 0 1 0 1 0 0 1 1 0 0t16 1 0 1 0 0 0 1 1 0 1 1 0 1 0 1 0 0 0 1 0t21 0 1 1 0 1 1 0 0 0 1 1 0 1 0 0 0 1 1 0 1t27 0 1 1 0 1 1 0 0 0 1 0 0 1 1 0 0 1 1 0 0t31 0 1 0 0 1 0 0 0 1 1 1 0 1 0 0 0 1 1 0 1t32 0 1 0 0 1 0 0 0 1 1 0 1 1 0 1 1 0 0 0 1t33 0 1 0 0 1 0 0 0 1 1 0 1 0 0 1 0 0 0 1 1t35 0 1 0 0 1 0 0 0 1 1 0 1 0 1 0 0 1 1 0 0t51 0 1 0 1 0 0 1 1 0 0 1 0 1 0 0 0 1 1 0 1t53 0 1 0 1 0 0 1 1 0 0 0 1 0 0 1 0 0 0 1 1t55 0 1 0 1 0 0 1 1 0 0 0 1 0 1 0 0 1 1 0 0t57 0 1 0 1 0 0 1 1 0 0 0 0 1 1 0 0 1 1 0 0t61 1 0 1 0 1 0 0 0 1 0 1 0 1 0 0 0 1 1 0 1t72 0 0 1 1 0 0 1 1 0 0 0 1 1 0 1 1 0 0 0 1t75 0 0 1 1 0 0 1 1 0 0 0 1 0 1 0 0 1 1 0 0WALK THRU of required 2nd scan to find Closed 3NN set. Does it change vote?Vote after 1st scan.YES! C=0 wins now!

  • C

    00000000001111111C

    11111111110000000WALK THRU: Closed 3NNC using P-treesa2011001011101100110keyt12t13t15t16t21t27t31t32t33t35t51t53t55t57t61 t72t75a111110000000000100a2 00001111111111000a311111100000000111a4000000000 01111011a500001111110000100a600001100000000000a711110000001111011a811110000001111011a900000011110000100

    C11111111110000000a1100011010001000100a12 11100001110110011a13 10011111001001110a14 00100100010011001a15 110100011001000 10a1610000001000000010a17 00101110011011101a1800101110011011101a1901010000100100000Ps00000000000000000a14

    11011011101100110a13

    01100000110110001a12

    00011110001001100a11

    11100101110111011a6

    11110011111111111a51111000000111 1011No neighbors at distance=0First let all training points at distance=0 vote, then distance=1, then distance=2, ... until 3 For distance=0 (exact matches) constructing the P-tree, Ps then AND with PC and PC to compute the vote. (black denotes complement, red denotes uncomplemented

  • C

    00000000001111111C

    11111111110000000a2011001011101100110keyt12t13t15t16t21t27t31t32t33t35t51t53t55t57t61 t72t75a111110000000000100a2 00001111111111000a311111100000000111a4000000000 01111011a500001111110000100a600001100000000000a711110000001111011a811110000001111011a900000011110000100a10 =C11111111110000000a1100011010001000100a12 11100001110110011a13 10011111001001110a14 00100100010011001a15 110100011001000 10a1610000001000000010a17 00101110011011101a1800101110011011101a1901010000100100000PD(s,1)01000000000100000a14

    11011011101100110a13

    01100000110110001a12

    00011110001001100a11

    11100101110111011a6

    11110011111111111a500001111110000100Construct Ptree, PS(s,1) = OR Pi = P|si-ti|=1; |sj-tj|=0, ji

    = OR PS(si,1) S(sj,0) j{5,6,11,12,13,14}-{i}i=5,6,11,12,13,14i=5,6,11,12,13,14WALK THRU: C3NNCdistance=1 nbrs:

  • keyt12t13t15t16t21t27t31t32t33t35t51t53t55t57t61 t72t75a500001111110000100a600001100000000000a10 C11111111110000000a1100011010001000100a12 11100001110110011a13 10011111001001110a14 00100100010011001P5,6P5,11P5,12P5,13P5,14P6,11P6,12P6,13P6,14P11,12P11,13P11,14P12,13P12,14We now have 3 nearest nbrs. We could quite and declare C=1 winner?P13,14We now have the C3NN set and we can declare C=0 the winner!WALK THRU: C3NNCdistance=2 nbrs:

  • In the previous example, there were no exact matches (dis=0 neighbors or similarity=6 neighbors) for the sample.

    There were two neighbors were found at a distance of 1 (dis=1 or sim=5) and nine dis=2, sim=4 neighbors.

    All 11 neighbors got an equal votes even though the two sim=5 are much closer neighbors than the nine sim=4. Also processing for the 9 is costly.

    A better approach would be to weight each vote by the similarity of the voter to the sample (We will use a vote weight function which is linear in the similarity (admittedly, a better choice would be a function which is Gaussian in the similarity, but, so far, it has been too hard to compute).

    As long as we are weighting votes by similarity, we might as well also weight attributes by relevance also (assuming some attributes are more relevant neighbors than others. e.g., the relevance weight of a feature attribute could be the correlation of that attribute to the class label).

    P-trees accommodate this method very well (in fact, a variation on this theme won the KDD-cup competition in 02 ( http://www.biostat.wisc.edu/~craven/kddcup/ )

  • Association of Computing Machinery KDD-Cup-02NDSU Team

  • Closed Manhattan Nearest Neighbor Classifier (uses a linear fctn of Manhattan similarity) Sample is (000000), attribute weights of relevant attributes are their subscripts)keyt12t13t15t16t21t27t31t32t33t35t51t53t55t57t61 t72t75a500001111110000100a600001100000000000

    C11111111110000000a1100011010001000100a12 11100001110110011a13 10011111001001110a14 00100100010011001a14

    11011011101100110a13

    01100000110110001a12

    00011110001001100a11

    11100101110111011a6

    11110011111111111a51111000000111 1011 black is attribute complement, red is uncomplemented.The vote is even simpler than the "equal" vote case. We just note that all tuples vote in accordance with their weighted similarity (if the ai values differs form that of (000000) then the vote contribution is the subscript of that attribute, else zero). Thus, we can just add up the root counts of each relevant attribute weighted by their subscript.Class=1 root counts:rc(PC^Pa5)=4C=1 vote is: 343 =4*5 + 8*6 + 7*11 + 4*12 + 4*13 + 7*14rc(PC^Pa6)=8rc(PC^Pa11)=7rc(PC^Pa12)=4rc(PC^Pa13)=4rc(PC^Pa14)=7C=1 vote is: 343Similarly, C=0 vote is: 258= 6*5 + 7*6 + 5*11 + 3*12 + 3*13 + 4*14

  • We note that the Closed Manhattan NN Classifier uses an influence function which is pyramidal It would be much better to use a Gaussian influence function but it is much harder to implement.One generalization of this method to the case of integer values rather than Boolean, would be to weight each bit position in a more Gaussian shape (i.e., weight the bit positions, b, b-1, ..., 0 (high order to low order) using Gaussian weights. By so doing, at least within each attribute, influences are Gaussian.

    We can call this method, Closed Manhattan Gaussian NN Classification.

    Testing the performance of either CM NNC or CMG NNC would make a greatpaper for this course (thesis?).

    Improving it in some way would make an even better paper (thesis).

  • Machine Learning is based on Near Neighbor Set(s), NNS.Clustering, even density based, identifies near neighbor cores 1st (round NNSs, about a center).Classification is continuity based and Near Neighbor Sets (NNS) are the central concept in continuity>0 >0 : d(x,a)
  • and S Y, contour(f,S) = f-1(S) Equivalently, contour(Af,S) = SELECT A1..An FROM R* WHERE x.AfS. Graphically:A Weather map, f = barometric pressure or temperature, {Si}=equi-width partion of Reals.f = local density (eg, OPTICS: f = reachability distance, {Sk} = partition produced by intersection points of {graph(f), plotted wrt to some walk of R} and a horizontal threshold line.A grid is the intersection of dimension projection contour partitions (next slide for more defintions).A Class is a contour under f:RClassAttr wrt the partition, {Ci} of ClassAttr (where {Ci} are the classes).An L -disk about a is the intersection of all -dimension_projection contours containing a.Functional Contours: function, f:R(A1..An) Y partition, {Si} of Y, the contour set, {f-1(Si)}, is a partition of R (clustering of R):f(x)

  • f:RY, partition S={Sk} of Y, {f-1(Sk)}= S,f-grid of R (grid cells=contours) If Y=Reals, thej.lo f-grid is produced by agglomerating over the j lo bits of Y, fixed (b-j) hi bit pattern.

    The j lo bits walk [isobars of] cells. The b-j hi bits identify cells. (lo=extension / hi=intention) Let b-1,...,0 be the b bit positions of Y. The j.lo f-grid is the partition of R generated by f and S ={Sb-1,...,b-j | Sb-1,...,b-j = [(b-1)(b-2)...(b-j)0..0, (b-1)(b-2)...(b-j)1..1)} partition of Y=Reals.

    If F={fh}, thej.lo F-grid is the intersection partition of the j.lo fh-grids (intersection of partitions). The canonical j.lo grid is the j.lo -grid; ={d:RR[Ad] | d = dth coordinate projection} j-hi gridding is similar ( the b-j lo bits walk cell contents / j hi bits identify cells).

    If the horizontal and vertical dimensions have bitwidths 3 and 2 respectively:GRIDs

  • 2.lo grid 1.hi grid

    111

    110

    101

    100j.lo and j.hi gridding continued

    The horizontal_bitwidth = vertical_bitwidth = b iff j.lo grid = (b-j).hi grid

    e.g., for hb=vb=b=3 and j=2:

    000 001 010 011 100 101 110 111011

    010

    001

    000111

    110

    101

    100000 001 010 011 100 101 110 111011

    010

    001

    000

  • Similarity NearNeighborSets (SNNS) Given similarity s:RRPartiallyOrderedSet (eg, Reals)( i.e., s(x,y)=s(y,x) and s(x,x)s(x,y) x,yR ) and given any C RThe Cardinal disk, skins and rings are (PartiallyOrderedSet = Reals)disk(C,r) {xR | s(x,C)r} also = functional contour, f-1([r, ), where f(x)=sC(x)=s(x,C)skin(C,r) disk(C,r) - Cring(C,r2,r1) disk(C,r2)-disk(C,r1) skin(C,r2)-skin(C,r1) also = functional contour, sC-1(r1,r2]The Ordinal disks, skins and rings are:disk(C,k) C : |disk(C,k)C'|=k and s(x,C)s(y,C) xdisk(C,k), ydisk(C,k)skin(C,k) = disk(C,k)-C (skin comes from s k immediate neighbors and is a kNNS of C.)ring(C,k) = cskin(C,k)-cskin(C,k-1) closeddisk(C,k)alldisk(C,k); closedskin(C,k)allskin(C,k)L skins: skin(a,k) = {x | d, xd is one of the k-NNs of ad} (a local normalizer?)Note: closeddisk(C,r) is redundant, since all r-disks are closed and closeddisk(C,k) = disk(C,s(C,y)) where y = kth NN of C

  • PtreesPredicate-tree: For a predicate on the leaf-nodes of a partition-tree (also induces predicates on i-nodes using quantifiers) Predicate-tree nodes can be truth-values (Boolean P-tree); can be quantified existentially (1 or a threshold %) or universally; Predicate-tree nodes can count # of true leaf children of that component (Count P-tree)Purity-tree: universally quantified Boolean-Predicate-tree (e.g., if the predicate is , Pure1-tree or P1tree)A 1-bit at a node iff corresponding component is pure1 (universally quantified)There are many other useful predicates, e.g., NonPure0-trees; But we will focus on P1trees.All Ptrees shown so far were: 1-dimensional (recursively partition by halving bit files), but they can be; 2-D (recursively quartering) (e.g., used for 2-D images); 3-D (recursively eighth-ing), ; Or based on purity runs or LZW-runs or Vertical, compressed, lossless structures that facilitates fast horizontal AND-processingJury is still out on parallelization, vertical (by relation) or horizontal (by tree node) or some combination?Horizontal parallelization is pretty, but network multicast overhead is hugeUse active networking? Clusters of Playstations?...Formally, P-trees are be defined as any of the following;Partition-tree: Tree of nested partitions (a partition P(R)={C1..Cn}; each component is partitioned by P(Ci)={Ci,1..Ci,ni} i=1..n; each component is partitioned by P(Ci,j)={Ci,j1..Ci,jnij}... )Partition treeR/ \ C1 Cn /\ /\ C11C1,n1 Cn1Cn,nn . . .Further observations about Ptrees:Partition-tree: have set nodesPredicate-tree: have either Boolean nodes (Boolean P-tree) or count nodes (Count P-tree)Purity-tree: being universally quantified Boolean-Predicate-tree have Boolean nodes (since the count is always the full count of leaves, expressing Purity-trees as count-trees is redundant.Partition-tree can sliced at a given level if each partition at that level is labeled with very same label set (e.g., Month partition of years).A Partition-tree can be generalized to a Set-graph when the siblings of a node do not form a partition.

  • 0Pf,S = equivalently, the existential R*-bit map of predicate, R*.Af S The Compressed Ptree, sPf,S is the compression of 0Pf,S with equi-width leaf size, s, as follows

    1. Choose a walk of R (converts 0Pf,S from bit map to bit vector)2. Equi-width partition 0Pf,S with segment size, s (s=leafsize, the last segment can be short)3. Eliminate and mask to 0, all pure-zero segments(call mask, NotPure0 Mask or EM)4. Eliminate and mask to 1, all pure-one segments(call mask, Pure1 Mask or UM)Compressing each leaf of sPf,S with leafsize=s2 gives: s1,s2Pf,S Recursivly, s1, s2, s3Pf,S s1, s2, s3, s4Pf,S ...(builds an EM and a UM tree)BASIC P-treesIf Ai Real or Binary and fi,j(x) jth bit of xi ; {(*)Pfi,j ,{1} (*)Pi,j}j=b..0 are basic (*)P-trees of Ai, *= s1..skIf Ai Categorical and fi,a(x)=1 if xi=a, else 0; {(*)Pfi,a,{1} (*)Pi,a}aR[Ai] are basic (*)P-trees of Ai

    Notes:The UM masks (e.g., of 2k,...,20Pi,j, with k=roof(log2|R| ), form a (binary) tree.Whenever the EM bit is 1, that entire subtree can be eliminated (since it represents a pure0 segment), then a 0-node at level-k (lowest level = level-0) with no sub-tree indicates a 2k-run of zeros. In this construction, the UM tree is redundant.We call these EM trees the basic binary P-trees. The next slide shows a top-down (easy to understand) construction of and the following slide is a (much more efficient) bottom up construction of the same. We have suppressed the leafsize prefix.(EM=existential aggregation UM=universal aggregation)The partitions used to create P-trees can come from functional contours (Note: there is a natural duality between partitions and functions, namely a partition creates a function from the space of points partitioned to the set of partition components and a function creates the pre-image partition of its domain).In Functional Contour terms (i.e., f-1(S) where f:R(A1..An)Y, SY), the uncompressed Ptree oruncompressed Predicate-tree 0Pf, S = bitmap of set containment-predicate, 0Pf,S(x)=true iff xf-1(S)

  • Example functionals: Total Variation (TV) functionals TV(a) = xR(x-a)o(x-a) If we use d for a index variable over the dimensions, = xRd=1..n(xd2 - 2adxd + ad2) i,j,k bit slices indexes Note that the first term does not depend upon a. Thus, the derived attribute, TV-TV() (eliminate 1st term) is much simpler to compute and has identical contours (just lowers the graph by TV() ).We also find it useful to post-compose a log to reduce the number of bit slices.The resulting functional is called the High-Dimension-ready Total Variation or HDTV(a).

  • The length of LNTV(a) depends only on the length of a-, so isobars are hyper-circles centered at The graph of LNTV is a log-shaped hyper-funnel: From equation 7,Normalized Total Variation, NTV(a) TV(a)-TV() = d(adad- dd))= |R| (-2d(add-dd) + TV(a) = x,d,i,j 2i+j xdixdj+ |R| ( -2dadd + dadad )LNTV(a) = ln( NTV(a) ) = ln( TV(a)-TV() ) = ln|R| + ln|a-|2= |R| |a-|2 go inward and outward along a- by to the points;inner point, b=+(1-/|a-|)(a-) andouter point, c=-(1+/|a-|)(a-).For an -contour ring (radius about a)Then take g(b) and g(c) as lower and upper endpoints of a vertical interval.

    Then we use EIN formulas on that interval to get a mask P-tree for the -contour(which is a well-pruned superset of the -neighborhood of a)Thus there is a simpler function which gives us circular contours, the Log Normal TV function

  • use circumscribing Ad-contour (Note: Ad is not a derived attribute at all, but just Ad, so we already have its basic P-trees).LNTV(b)LNTV(c)bcAs pre-processing, calculate basic P-trees for the LNTV derived attribute (or another hypercircular contour derived attribute).To classify a1. Calculate b and c (Depend on a, )2. Form mask P-tree for training pts with LNTV-values[LNTV(b), LNTV(c)]3. User that P-tree to prune out the candidate NNS.If the count of candidates is small, proceed to scan and assign class votes using Gaussian vote function, else prune further using a dimension projections).contour of dimension projection f(a)=a1x1x2LNTV(x)If the LNTV circumscribing contour of a is still too populous, We can also note that LNTV can be further simplified (retaining same contours) using h(a)=|a-|. Since we create the derived attribute by scanning the training set, why not just use this very simple function?Others leap to mind, e.g., hb(a)=|a-b|

  • Graphs of functionals with hyper-circular contours

  • Angular Variation functionals: e.g., AV(a) ( 1/|a| ) xR xoa d is an index over the dimensions, = (1/|a|)xRd=1..nxdad COS (and AV) has hyper-conic isobars center on COS(a) AV(a)/(|||R|) = oa/(|||a|) = cos(a)COS and AV have -contour(a) = the space between two hyper-cones center on which just circumscribes the Euclidean -hyperdisk at a.= (1/|a|)d(xxdad) factor out adIntersection (in pink) with LNTV -contour.Graphs of functionals with hyper-conic contours:E.g., COSb(a) for any vector, b

  • f(a)x = (x-a)o(x-a) d = index over dims, = d=1..n(xd2 - 2adxd + ad2) i,j,k bit slices indexes Adding up the Gaussianweighted votes for class c:Collecting diagonal terms inside expi,j,d inside exp, coefs are multiplied by 1|0-bit (depends on x). For fixed i,j,d either coef is x-indep (if 1bit) or not (if 0bit)=d,i,j 2i+j ( xdixdj - 2adixdj + adiadj )=d,i,j 2i+j ( xdi-adj )(xdj-adj)Some additiona formulas:

  • fd(a)x = |x-a|d= |i 2i ( xdi adi )|= | i: adi=0 2ixdi - i: adi=1 2ix'di |Thus, for the derived attribute, fd(a) = numeric distance of xd from ad, if we remember that: when adi=1, subtract those contributing powers of 2 (don't add) and that we use the complement dimension=d basic Ptrees, then it should work.

    The point is that we can get a set of near basic or negative basic Ptrees, nbPtrees, for derived attr fd(a) directly from the basic Ptrees for Ad for free. Thus, the near basic Ptrees for fd(a) are

    the basic Ad Ptrees for those bit-positions where adi = 0 and they are the complements of the basic Ad Ptrees for those bit-positions where adi = 1 (called fd(a)'s nbPtrees)

    Caution: subtract the contribution of the nbPtrees for positions where adi=1Note: nbPtrees are not predicate trees (are they? What's the predicate?) The EIN ring formulas are related to this, how?

    If we are simply after easy pruning contours containing a (so that we can scan to get the actual Euclidean epsilon nbrs and/or to get Guassian weighted vote counts, we can use Hobbit-type contours (middle earth contours of a?).

    See next slide for a discussion of hobbit contours.

  • A principle: A job is not done until the Mathematics is completed.The Mathematics of a research job includes0. Getting to the frontiers of the area (researching, organizing, understanding and integrating everything others have done in the area up to the present moment and what they are likely to do next).1. developing a killer idea for a better way to do something.2. proving claims (theorems, performance evaluation, simulation, etc.),3. simplification (everything is simple once fully understood),4. generalization (to the widest possible application scope), and4. insight (what are the main issues and underlying mega-truths (with full drill down)).

    Therefore, we need to ask the following questions at this point:Should we use the vector of medians (the only good choice of middle point in mulidimensional space, since the point closest to the mean definition is influenced by skewness, like the mean).

    We will denote the vector of medians as

    h(a)=|a-| is an important functional (better than h(a)=|a-|?)

    If we compute the median of an even number of values as the count-weighted average of the middle two values, then in binary columns, and coincide. (so if and are far apart, that tells us there is high skew in the data (and the coordinates where they differ are the columns where the skew is found).

  • Are they easy P-tree computations? Do they offer advantages? When? What? Why?

    E.g., do they automatically normalize for us?What about the vector of standard deviations, ? (computable with P-trees!) Do we have an improvement of BIRCH here? - generating similar comprehensive statistical measures, but much faster and more focused?)

    We can do the same for any rank statistic (or order statistic), e.g., vector of 1st or 3rd quartiles, Q1 or Q3 ; the vector of kth rank values (kth ordinal values).

    If we preprocessed to get the basic P-trees of , and each mixed quartile vector (e.g., in 2-D add 5 new derived attributes; , Q1,1, Q1,2, Q2,1, Q2,2; where Qi,j is the ith quartile of the jth column), what does this tell us (e.g., what can we conclude about the location of core clusters? Maybe all we need is the basic P-trees of the column quartiles, Q1..Qn ?)Additional Mathematics to enjoy:L ordinal disks:

    disk(C,k) = {x | xd is one of the k-Nearest Neighbors of ad d}.

    skin(C,k), closed skin(C,k) and ring(C,k) are defined as above.

  • The Middle Earth Contours of a are gotten by ANDing in the basic Ptree for ad,i=1 and ANDing in the complement if ad,i=0 (down to some bit-position threshold in each dimension, bptd . bptd can be the same for each d or not).

    Caution: Hobbit contours of a are not symmetric about a. That becomes a problem (for knowing when you have a symmetric nbrhd in the contour) expecially when many lowest order bits of a are identical (e.g., if ad = 8 = 1000 )

    If the low order bits of ad are zeros, one should union (OR) take the Hobbit contour of ad - 1 (e.g., for 8 also take 7=0111)

    If the low order bits of ad are ones, one should union (OR) the Hobbit contour of ad + 1 (e.g, for 7=111 also take 8=1000)

    Some need research:

    Since we are looking for an easy prune to get our mask down to a scannable size (low root count) but not so much of a prune that we have too few voters within Euclidean epsilon distance of a for a good vote, how can we quickly determine an easy choice of a Hobbit prune to accomplish that? Note that there are many Hobbit contours. We can start with pruning injust one dimension and with only the lowest order bit in that dimension and work from there, how though?

    THIS COULD BE VERY USEFUL?

  • Suppose there are two classes, red and green and they are on the cylinder shown. Then the vector connecting medians (vcm) in YZ space is shown in purple.sxyzThen the unit vector in the direction of the vector connecting medians (uvcm) in YZ space is shown in blue.The vector from the midpoint of the medians to s is in orange.The inner product of the blue and the orange is the same as the inner product we would get by doing the same thing in all 3 dimensions!The point is that the x-component of the red vector of medians and that of the green are identical so that the x component of the vcm is zero.Thus, when the small vcm comp in a given dimension is very small or zero, we can eliminate that dimension!That's why I suggest a threshold for the innerproduct in each dimension first.It is a feature or attribute relevance tool.

  • DBQ versus MIKE (DataBase Querying vs. Mining through data for Information and Knowledge Extraction

    Why do we call it Mining through data for Information & Knowledge Extraction and not just Data Mining? We Mine Silver and Gold! We don't just Mine Rock (The emphasis should be on the desired result, not the discard. The name should emphasize what we mine for, not what we mine from.)

    Silver and Gold are low-volume, high-value products, found (or not) in the mountains of rock (high-volume, low-value). Information and knowledge are low-volume, high-value, hiding in mountains of data (high-volme, low-value)

    In both MIKE and MSG the output and substrate are substantially different in structure (chemical / data structure)Just as in Mining Silver and Gold, we extract (hopefully) Silver and Gold from raw Rock, in Mining through data for Information and Knowledge, we extract (hopefully) Information and Knowledge from raw Data.So Mining through data for Information and Knowledge Extraction is the correct terminology and MIKE is the correct acronym, not Data Mining (DM).

    How is Data Base Querying (DBQ) different from Mining thru data for Info & Knowledge (MIKE)?

    In all mining (MIKE as well as MSG) we hope to successfully mine out something of value, but failure is likely, whereas in DBQ, valuable results are likely and no result is unlikely.

    DBQ should be called Data Base Quarrying, since it is more similar to Granite Quarrying (GQ), in that what we extract has the same structure as that from which we extract it (the substrate). It has higher value because its detail and specificity. I.e., the output records of a DBQ are exactly the reduce size set of records we demanded and expected from our query and the output grave stones of GQ are exactly the size and shape we demanded and expected, and in both cases what is left is a substance that is the same as what is taken).

    In sum,DBQ = Quarrying (highly predictable output and the output has same structure as the substrate (sets of records)).MIKE = Mining (unpredictable output and the output has different structure than the substrate (e.g., T/F or partition).

  • Some good Dataset for classificationKDDCUP-99 Dataset (Network Intrusion Dataset)4.8 millions records, 32 numerical attributes6 classes, each contains >10,000 recordsClass distribution:

    Testing set: 120 records, 20 per class4 synthetic datasets (randomly generated):10,000 records (SS-I)100,000 records (SS-II)1,000,000 records (SS-III) 2,000,000 records (SS-IV)

  • Speed and ScalabilitySpeed (Scalability) Comparison (k=5, hs=25)Machine: Intel Pentium 4 CPU 2.6 GHz, 3.8GB RAM, running Red Hat LinuxNote: these evaluations were done when we were still sorting the derived TV attribute and before we used Gaussian vote weighting. Therefore both speed and accuracy of SMART-TV have improved markedly!10002000300048910102030405060708090100Training Set Cardinality (x1000)Time in SecondsRunning Time Against Varying CardinalitySMART-TVPKNNKNN

  • Dataset (Cont.)OPTICS dataset8,000 points, 8 classes (CL-1, CL-2,,CL-8) 2 numerical attributes

    Training set: 7,920 points Testing set: 80 points, 10 per class

    Chart1

    16.7761772765.0091357622.9276316365.2047094490.135320750.4449740629.9621098587.67001084

    18.3051486761.8767911118.4712202765.0185938188.0707381953.4611027329.6359121490.0135566

    21.2792716164.4699536917.9725660566.3336033388.1867684751.090928131.0787970183.41408683

    16.1435602565.6513780621.1910170164.9472245788.597001352.4150659938.4444553577.63286703

    16.7420585463.2387485629.7235656484.6751108188.1080080453.9368570643.4689063683.32916991

    14.4499963764.6058561123.6068665.6656406489.0258629553.8098363729.497133494.92067401

    10.6868109764.7326687916.2749864877.4878003186.8609283752.3417924138.4154899283.03252521

    10.3087588462.0819055422.0476277457.7298784190.9823841953.6768415932.9008658675.40521067

    13.4793692670.1517386817.3742779370.4610992288.544966552.3514586334.4717279373.69890335

    17.4331697364.1032025119.3890197460.3271947384.6952716351.6141748237.4065219591.08995444

    8.52517401666.9966755719.0889137244.0818428686.3605620154.687182942.7680183777.46519691

    13.2333300464.5292956114.4150769363.8036113888.8597991953.6018875433.8748898184.86968391

    16.575384963.7165272323.3700083773.3439729288.3067822954.7524733536.3987137377.67961139

    20.9088710662.8239512922.2658379159.9296959985.9552774452.1729709444.868454688.37414868

    13.1794948459.383418429.1220067371.4297061488.9625909453.6344847435.1449663396.51599607

    16.1558172969.4154470713.6080838169.3818978585.7660218751.1510249934.3772583882.08871189

    14.787308664.7535917822.9930640367.8454019388.051561853.5952752739.1364641380.22199488

    23.2283582662.9190104817.8663476153.1887323487.7034396555.9215297128.1944055284.64807101

    20.8846630863.8464245419.1569892452.5674122588.4139083649.6447929842.2487392874.02553761

    24.9576940165.0657689216.6532144266.86313189.7710479351.6790236731.150293397.89335929

    15.160224157.9418249228.0708087565.8521218685.4068503850.8174841139.7030305390.32958767

    16.809156463.5738621120.3722123959.1198651487.2497866153.6393896132.7594422897.69059763

    22.3143104962.0055824718.7582359973.829609688.0901788953.3650007837.5869868491.03931203

    14.5971208962.8769015716.5557946174.6827378590.5881137953.8405359244.0532624285.90480813

    9.84865627559.0316082324.6881079249.0635437886.8204731455.1441296140.4123588377.16694786

    11.8775134859.3400933220.5965455884.6782979486.2298375353.9144057928.4506668881.65590649

    10.3104213765.5823007121.6013665645.7719144486.2175176752.9733516835.0994531194.95543874

    16.485591862.8541857624.8562837261.9484073284.004863651.1442414333.1553647581.56278298

    18.1971271461.9542324224.2068580968.8035897788.4690799356.824831237.697624276.65577857

    8.99895089362.1054092712.1135449666.7905368687.9952956248.8702032942.2733461477.86826032

    6.90930506266.0169205124.4368074378.9126414187.0386744651.2961076237.4962971789.81509606

    14.7203361363.1595428117.9046349562.3204794487.4066843751.8091188632.2449414377.49523145

    25.0534929165.5928203926.7020172154.908428887.0573059955.2454079638.8070865789.9379522

    10.8696240259.7350128817.0123309165.225008588.9557724456.2047608343.5981722988.21333767

    7.09426955156.8504885428.9481048169.6838969788.862567352.9983130131.6555831776.14141414

    20.8607706163.8350254219.5251216254.7409976488.4912687251.9329977435.6490166493.14530373

    14.9943606961.5975904517.0269648872.1268052891.1807958549.515576934.3123250873.64419899

    7.08270303862.3786621220.2063251944.1674044286.1492509853.1007489930.1574063380.94184528

    19.3399190263.9141277716.4264113149.0613477687.4991739854.552762241.9260286291.39970858

    9.78708992962.2105962520.67340864.5476168488.1026399951.0759342941.4851697781.25525652

    12.1199655361.9481338220.4494630269.6252621288.4173946353.0188664137.6739201387.62433093

    23.4244747563.2535779919.8947937752.041685987.7999123350.1955773332.1021908385.10032731

    23.9644850564.0525419319.3445528344.7531880489.1900448552.2840184839.6807333280.27629782

    7.8516241764.6464772312.6914615562.6356579685.5205273955.4975811236.6964053990.93930945

    16.8184284762.131071117.0352387951.744896187.3389282251.6219496636.7010066793.8652306

    10.8829460860.0514680922.9742977861.4518656987.889526254.8328696636.0868567572.86186337

    11.8226933966.4076881423.5266805659.7029145989.0429025852.6704205444.1942881484.9703418

    12.2251103566.5488171716.4915485973.306006789.0508807354.0558660127.840893492.44700196

    11.6313700161.4275526314.3098874247.4624478385.6878467453.8415275140.7712646477.8050158

    10.1323993457.8546952317.6006617779.5211555188.5680113348.7975781433.7639722292.37291101

    14.4902040165.9776689221.9658199363.401971386.4601863455.5441424932.9610117181.13897567

    24.3355903366.9927959425.8591746479.9723816989.1705698352.6527432736.7540743584.60221883

    25.0435797864.1139653919.232702148.1619874588.7605471252.431452239.3096860995.9859113

    18.0544472563.7609226621.4365849446.0760356188.260644354.9026027835.030347680.85463198

    12.273303764.0902594121.8704821385.1431562688.0618542954.2958261430.2852070485.21539663

    22.9145551660.9146551224.6248489857.3707470489.9186064350.8259082537.8117446479.7186515

    19.4772464965.294735223.0011456774.2838562685.3597598150.5975026939.2550998374.25468812

    8.32650353766.8571786220.1899563666.7367936988.4803595853.5576080639.6666748473.62670683

    15.7089408667.9734663321.0067510449.9411934691.3112590851.2837375835.7757056679.503839

    16.827599759.8871404718.635782368.0529483185.3125369650.7864904827.8431552685.37971363

    13.3220682465.230348921.1500305776.9681457388.1211191955.7582291639.8426848783.06822421

    8.09797704564.0763788621.0781183669.5162084186.4746997453.1467821834.1558300892.28066005

    14.2966063663.2482462718.962675652.550824787.3421810755.2161149937.4774579676.87396389

    19.3600110662.6516715720.3140584147.7526531887.0316267955.5132286236.7010644182.70472117

    20.0268947764.2515212222.5841707968.2116803588.7149806950.9029111535.1250272175.58227227

    11.2564053463.6528874823.5445585966.7820910289.6272385455.9088265240.7871522573.64780423

    17.4262772562.6219175922.1474764171.0026896390.9710673654.1328281432.9012979871.68161419

    11.7438890162.6523231315.7614928783.0960853591.0280358252.8042731341.6979079879.47663531

    9.14996572563.9759599917.6082469665.7690953687.8295276853.1796759235.8991191288.6406292

    15.3990333564.3281269415.9257713148.2540230490.6063347455.2245297242.6305018782.68311564

    22.4646963462.5515352118.7991919657.3171352587.9378463250.34330936.9914130380.04482089

    21.8994178264.8124867319.4971606848.8088322890.7736166455.6440769931.8649169683.44304651

    8.55533620659.3316891122.3083896856.9229124986.3378675653.4939724233.8827279287.62453609

    20.9321297860.751167815.7290566771.8615803488.9199731352.690580633.1006695485.58372465

    8.40048202864.4259520712.3151557465.1793247189.9511246449.4670330633.9967428581.18305049

    21.5977356967.1750751415.3426245466.0405715188.3644812355.7868692742.2923387181.73048625

    10.2736797760.5170417917.0629743858.3130087189.4221782855.5971629639.295718184.2638124

    19.1114635364.4546992918.2042517656.868248686.8014597953.8230555435.0685776279.86988451

    15.2070874263.3054571512.7952034467.3753738690.6702751457.104944530.8711164491.55081457

    20.1362467759.9196906719.5130014858.0509561487.9658706954.170130434.0158821782.54350847

    17.6725688956.4596291218.4316255880.8460775887.4525475955.4930946442.7235520995.90306801

    15.5669267863.7038003822.7412960850.2680451687.3331153350.4197918339.0293740886.70460393

    15.7918421862.756816198.32525747664.1357724989.2767256254.9308856941.5799774773.12912795

    16.8152038561.7830937419.0077848665.0179219185.8053469853.1032808833.1472250988.03040303

    21.7588327259.836529421.0209709559.6182498189.8026779453.8054721428.074113575.98281618

    21.1311089561.6586437326.7526617879.9937184286.0257727149.2046816439.3748043978.28099097

    24.7798136863.2595362222.1633348780.8075332687.6180568652.0856840735.2209097796.34858684

    20.4310272165.5454921316.948276963.3440155788.1419726851.8499401233.2922518991.49112101

    13.8806619965.2481380516.8157607868.2032951190.3577178950.5255162337.3722713577.65667099

    19.5515232261.1585722721.9173278675.0135076387.4801614252.0672678334.5194291872.3593858

    10.8208859860.4867227421.97373861.6406390787.8564628653.8038467735.0626457282.48103914

    23.7170922367.4403710424.2441046459.4755751487.5345641251.1474877940.1236473788.70213712

    13.8853828165.12756669.0817968877.8646290686.0575185654.4268144433.10796276.85017063

    7.57772213161.0109469717.025740562.7872793190.4459947754.1704244638.2879814283.45244903

    13.1559155560.8126250922.5500931346.3755354487.1181813449.7232758738.5005830187.09227395

    12.7859826359.9482542723.0342175560.0476975789.1053903454.4160220433.1165007978.06373462

    18.1384290461.8655256622.7977976270.1158908991.4052416853.3168689737.8908126992.3502691

    23.2971668864.4494666523.7938350456.5788306488.9854069750.0216126637.5964173279.05535602

    20.2400772164.647392319.5777764760.713416884.9895517553.1492957835.8732243582.42874693

    21.5470235369.8943488417.2092524849.2474429586.7337788654.4686917840.7979831887.08944628

    10.1090499464.178441720.5855276771.4457821287.0656333453.1528932533.1345095191.38741049

    11.1229398863.8562592621.8326174663.6352828789.7045282750.6843674236.3150073695.64888207

    17.3513895863.9194296725.1833249267.4872311285.756468352.8591883335.2854099989.4553544

    16.9735543562.1731273722.4409350960.3901619188.5742411952.6912510631.4944303489.96644567

    19.4878583965.8587863524.7353105171.1304203487.1576303252.5427359638.285010989.31201778

    18.772394659.6342822122.1792286763.9478148590.0333867956.3441528341.0845500986.62923056

    18.0965781268.1292121916.0132283458.0648920384.8252620251.4928361637.3875632390.80085381

    19.0226706165.4604837217.5121144469.7613999687.6205489353.7148174737.5133634781.70626913

    12.7500939758.6226717518.6485213457.8570953287.9038136552.8973249735.2000184479.99051811

    13.3841477960.3125283922.0035593949.3742397585.5143947150.0228260534.1146434390.99819869

    12.4884945861.143045710.2232722661.0374740485.7067454952.4529380631.0786634384.92018959

    10.4506469969.6215039913.4685727763.2530503488.4230003952.1529593740.8810179889.99841049

    20.8243031561.6367414322.3151448875.1702070287.8365014251.6352624438.7208419788.69152401

    21.069824964.6624352522.1041589572.5676784385.9280695754.5792647835.5220667775.41881398

    11.7804721764.561842522.8159323979.2623689787.2329406451.1543957942.216370693.01989827

    17.920465860.5207180521.8963310751.339331987.4161214453.7886376636.8158780186.70420663

    14.1983413661.2436102124.5009158669.2383508489.3193714954.1227387136.1629205379.20748927

    21.4428174865.0000751919.8447969264.4833458788.3405422852.5452843438.177278783.18357099

    15.6189967160.1821393221.922886451.9426866886.2951179153.3591138838.5227026283.21081899

    22.6422217663.7741811622.0575568377.3909560387.9585222954.3057195933.7359300574.56870796

    14.196244365.2076418317.8530193969.5781009387.5561553154.0488856939.2996376476.2826525

    10.2843414665.2786634213.7487974366.0846471990.880001652.6406112537.9926519494.1762044

    14.7064393165.8482952220.9501549555.9886324286.0695491951.6358534435.2730008378.23397926

    18.5159422260.8789528323.6724197848.4725316591.7684461753.3752735343.9809795186.77226228

    24.2486620362.7034468314.2029583566.0614101685.6750017953.6781411733.6630534280.22563709

    18.6567449663.1102726418.1651977776.1091309686.8033032853.7624855737.9020490386.50780563

    23.787894263.8429604420.7506983780.4762542988.7172296453.4558225834.9428116781.42580116

    9.36897585262.7145819121.1707105971.3067521289.0207084751.3497185538.8012155675.1422148

    19.5580734163.8252253818.8099494864.0574917686.5914576357.3879538635.8179928672.94614895

    9.33270862758.1436045216.1804341971.0395900291.1283152651.4705587142.0071500598.31114902

    14.5715722563.4526687117.4966077477.6883915585.4163463251.9743532734.4116165179.25689781

    20.4316360565.7864320412.765484372.4690182187.0310541352.5207517137.3314775390.08252161

    8.42245371165.9499032415.4175258668.7424615890.7062545952.3514170937.8045945790.65855043

    23.0037182963.1754080923.6146578263.3201402488.1770821251.875386436.7434539281.0450679

    16.3545903964.9152509622.7717456469.6159143985.2035484450.6651843134.3846855579.17239002

    10.5749322463.1687254222.8482307671.7116599789.1027303152.8937428331.6560707475.88632727

    19.3205862966.778790118.1856473149.363431888.7750837251.7189772335.0596699274.3158974

    13.9516200861.3419021212.0276749568.8068063787.2781158752.1162153637.3541581487.90798143

    23.1745760360.3118289619.1506908483.3168244988.9556969653.7739082639.4914832893.66437843

    10.3351600464.2042850818.4731002560.9782573991.2560487150.6073453733.2677348177.68367899

    10.095359666.6115557522.9121777355.101888588.3519341155.8838882137.8494713271.99177991

    14.4482401361.2291743417.349816682.6888029888.2827512155.6914496635.9962262888.91227251

    15.9035218361.5547014526.7801301762.3068074392.4833118154.4239470941.1802994890.7566343

    13.5666947662.2184944822.8203408969.9617350588.8416457850.7285389233.0381843981.7938433

    14.4522788758.4000288718.2547092271.71373687.884644251.2627887940.0286150292.64430311

    12.0909102566.4180793820.1289172155.7042973588.0961060851.9554838638.8863418278.81495262

    11.4446951161.2854827522.2973198576.8626586989.8639460951.0839730238.0308290292.74802788

    12.087242958.1600443315.074092358.2787333387.6261187252.450794932.7132766585.63104177

    13.0579260764.0246011718.7661411552.3297280588.7918001556.1606933540.9767095384.87403999

    19.1848386463.6305103920.9314876870.7511244690.1719868153.9589714140.9555640289.9042664

    16.7037475663.6569967318.5029885251.0591710490.7889709752.7259490934.5640362586.7495234

    16.9407442364.8881609818.9935060263.9488577289.1349875253.6102361838.550691382.53211479

    12.0676130566.2318844523.2650981756.4513635485.9110631551.4835328836.3372512195.31891694

    11.8016231462.5256895821.0826100558.9204877687.6596098951.8779072936.7607074189.11111569

    19.6210290265.3688666319.2336164180.2758608886.0211189250.3104938440.1952489382.97084243

    22.6805540664.750346520.0573127754.54007888.9802131454.1155994143.7684233688.06512198

    19.7675353360.9113865516.2371475673.7252050988.0636349251.0959568746.3614933182.09328714

    21.4903398760.2351225818.4194026980.1793678888.2821432451.5759008438.8095558286.94612308

    17.8434230856.9987802614.0399074157.5165247888.8672660253.454082930.8439695483.30113136

    22.4422786166.0979990117.0958682880.3159724786.508761252.4817170944.4750758680.65911866

    9.75515924263.737044712.5287873945.7355384489.5763969251.6113875536.3534970387.43105941

    9.49985999360.6434820915.9053851674.0991381890.3368904956.263211537.9291115879.88528341

    12.978308164.9170100318.8110402252.2040783188.2549108753.6024782426.0767105593.16613558

    21.3921156364.2364106615.7323370160.6550875190.4057599556.1641734138.7060723386.15936973

    13.256578861.4946796117.6113860182.5671830987.8380381154.0661104936.5931266893.2234635

    21.4126506265.1123506221.5479015677.6483326290.9792920551.8939057237.7956357185.39032335

    22.2548381367.129511123.2204886580.5182091590.0725219449.2058452433.7511630279.53236045

    9.53952609964.621924426.6428956749.5040737586.3505326150.8868170330.1657713284.041867

    19.9267938264.4981292713.6874692256.9963809988.5500974451.7789365534.4136737188.32776394

    8.02261054162.0127306515.4870059583.1064538687.3541969955.4614638331.6859105387.52950926

    6.71753706560.000126116.9415851972.9298185190.5229512451.6076991633.5583165384.49754153

    11.4426292363.2135717521.0102504670.9646692188.6415901353.1655552332.9974339290.23311607

    20.1502408665.4960246712.5765969862.3619437888.1631403451.0618699730.4134493877.7549319

    17.8083273766.4742233815.5038348657.9698911286.440686451.0684169936.1671931978.24971077

    16.8903158462.9429283217.131040870.1193972286.5344784552.3496876838.1901723290.50265427

    11.3937498260.6041940621.2016324262.6555843685.0549853153.4466536834.043841690.04010775

    16.2470868255.0559040415.2299304759.0485795589.8080603257.1396634836.9488187580.13325307

    19.1794141366.8956343417.4935266561.0431572786.3876890750.3169204634.8456242279.79364463

    12.7780170761.9966168818.173399364.2224936688.2922700455.4303570137.9572020876.23819363

    6.39813246564.0707346518.2312696153.0547032890.5099153351.4791521736.9929226175.44658405

    18.0217824562.7580357514.9089081749.9698702988.642485850.6925140935.2172835186.38112866

    21.3043214962.8129462315.0021787576.4689788688.8009927451.9384906142.9355235375.35240195

    17.3461045264.3412467320.9543981747.6880002487.9808320854.6910141734.7070505380.00333341

    17.8344561957.6260024122.6147803671.9138806390.4850908551.7709940642.3285455378.55916511

    20.3911180457.0813184517.268128957.5569964790.0078798653.4527985734.2407549884.25899918

    11.0428747659.8030376921.7836681850.1822943183.6106465254.6616440537.3365331985.87530043

    23.5067312364.6530367623.7692389962.5256561389.9785269250.465052534.5506444878.38201735

    7.47580400464.7205162918.5563058382.229520588.0746531454.1379408938.3703223473.39357617

    14.0838994764.833807815.1320810578.9977721789.9359905356.1496839537.1741523390.73698866

    17.5636986267.0067705924.8421503169.1691167887.912090551.7121691434.7665385877.36716189

    8.95234007662.2054088821.9781373569.3977750688.2710958552.8297604935.270784387.06720477

    11.1169141358.8476130316.0885713358.9914323993.4016548151.5498672132.2057064978.57393027

    14.8182375264.2853289118.7426552684.468992988.1885925451.824059637.5527507588.30547703

    14.8858890858.9623929915.2258824873.0227079190.7933007552.4508445135.890739192.12289609

    12.8820239964.0993622118.8904389868.0304827388.380700552.592941630.3396318277.98129216

    13.4659593459.1692049223.5801916261.5310442788.2243570453.1074822139.6729637585.06043604

    17.8629508662.1787221715.5714973964.9962688586.4955096953.1571087834.6368954482.8368656

    11.6673256666.3243620715.1994306453.4999788584.8375418555.287040336.7368868691.14721317

    15.9990642961.3204639423.1101083958.2375802386.3824162353.7032627636.7122066372.04802185

    20.9809769265.5819468122.3139726561.6658672187.9428016152.9574483939.1740729876.9698209

    24.5442352261.8790213921.9982308553.7660581988.3230336150.59829436.521948674.4574515

    14.7674546363.120061669.83769889862.0067705388.3759338454.0759508244.2544373193.40052757

    10.5093937466.022978029.42167193982.2109960989.2180359251.8452085543.0437371292.1621962

    18.5419767565.3143027117.6671510476.2277822387.0954774851.7623961140.6622976378.7801211

    16.466132463.1300072222.0210623280.7627063989.9948703152.2148222542.6866444893.63636883

    8.31058622764.4940984517.6651280159.0020989287.5129827754.7424613537.5773707381.38670792

    19.0049898666.7226235211.3147695157.0140568886.4632573452.4264805837.1796384387.63403927

    11.8655892566.2451964511.8081960761.8329394283.4142202850.7511747941.3366027679.17524144

    10.4091417561.4324583414.4506954669.9590183487.9572084556.369411236.8908673987.35423259

    13.6310262865.5051194621.1529981951.6432644687.8908724450.9662452737.9635447177.25134302

    16.4309895358.791598822.4869002152.3964306385.9857822150.9603099235.9785178974.75708473

    7.60923083863.471986219.5716164448.8343472987.6016801254.0552144134.9614382478.15897464

    17.2823692661.9053068525.1017625846.6343994787.8957377655.8209156232.0533730179.55422296

    20.293571762.563338413.1990437853.7428966388.203624750.9561374439.4793145377.75607812

    16.7863892262.4228302819.472948267.0169978190.0647946351.6778591137.1306246984.34133318

    11.8205309664.7633365914.0404638663.5325057687.1407803154.4596197927.8850524694.10944373

    7.3824261767.8635659921.9446524758.0612470490.2182737353.618205639.770274394.50985383

    17.0553611662.3803087516.7984378780.6930788489.5693984550.75965836.0317004788.65928559

    15.34484358.7090705922.0763496779.4396834587.5149795449.8786355643.1308987782.07319224

    13.9887337561.1514514418.4763416857.6758644787.5183765757.0361689638.7297018195.3559265

    21.5391188657.2372070221.1846183149.099190186.5685897453.1832717837.8930404581.3865541

    16.1086087261.3434287418.0925159181.6452443687.7308358853.4006681141.8535469280.02239947

    17.7627035660.8121457227.292740565.6908555788.1224595549.8008367741.1338072282.2019511

    8.24009573164.5823641118.7272390867.4174937690.4572195552.63084633.4093316788.59219069

    20.2848657767.2371181724.9044007555.2800172290.0699752252.8696716439.7109738593.04764201

    23.2848191860.7877036725.0122937954.4992256885.9786209357.6266262938.0459247374.08560711

    13.9705978562.5189314221.9149060578.8208203285.8461419753.5839172434.1321496682.1078255

    18.1957686661.4423051121.4038441381.5047359488.6265652355.2539684832.9636064995.12757784

    9.9456247565.260068420.7815041177.2183933790.4648426652.0444645644.6246984778.22383421

    15.2515168858.8857695115.4141659677.5353168789.0377010850.7143075136.4219829182.00308257

    15.8259399962.1021250315.2924203171.6893302585.9826147953.1203266130.495938373.18892837

    8.47509168361.4393194712.8841926962.0640180790.1295444654.6061010135.2101636784.07210276

    22.0005535363.7875051524.0751489261.6530016885.9624467253.6187134640.9180400795.11822704

    18.2553911961.692039721.660062168.4348222688.4974851249.8550687336.2438249771.99830978

    24.4914094361.892569413.8982977365.2734310984.8625112952.6965238135.3712547187.80214376

    23.1873299363.552990526.4855069454.0382025190.4102648553.5140981338.0914321890.14723496

    23.144257465.9218814516.4398620676.0060287187.7091264254.6409944841.6468768173.12659435

    20.0999043460.5926761614.612654578.9521719287.6065550954.3262055540.9690138285.22569261

    14.3309410764.3169257319.0821353751.9729673385.7490020553.9379323541.4877897496.19006389

    18.8030839168.4839809817.5422201366.8531320289.3175173952.3779182936.7447287176.78324933

    23.263071966.1911938823.1026000264.4794675385.7729237651.3873095137.0077036173.29789073

    9.17651532762.0725749621.0185546565.6831308187.0612913556.3708122843.9484060472.94363288

    23.9501104262.8836526920.5529367683.7424093686.8197315452.7225311931.7619666185.44470973

    11.4991681463.094725721.832584263.8430206588.2925979553.2602290440.2721534895.61275893

    10.0999755860.55891813.6909753253.1633727586.8598861452.0181219139.0840208477.7245738

    10.7783688566.2458783123.4904640871.3529726686.9612270553.3643639137.4153207275.28209492

    8.72918234264.1706183322.220420250.3200627189.9588560355.9918090636.4762430294.01597357

    22.6562306760.4149386517.0202048176.9894333485.8042633251.6770714837.1630788270.7692346

    23.4172315464.4942719119.933796784.6546972489.7549342354.5240409836.9950773591.52124284

    11.0056174762.6859549219.478751974.0503806487.4770794755.2242268540.0766834188.3911286

    8.18716421765.4938497315.9367897259.9184178987.1200249654.4275493536.1653554671.75693597

    12.6033687562.2251054520.5759933747.5925050789.0224891753.6315428337.9408253484.88717489

    18.6573120262.5427688320.3654761267.3690385989.4068581252.6422600734.2960977781.6024978

    18.6367170861.7771704120.8999680452.0562221886.5515224657.0151687131.5423902671.87307961

    23.8425928359.9982388823.1609843845.8859500884.0306143854.6312094440.3039399486.99609323

    13.690377162.5806348921.3809253682.0585339190.1035605153.6606602936.6911280790.74434345

    20.7566832370.1408629420.5650956964.6924292687.3085237954.3972247838.5225076183.78264525

    7.18235638760.3557876419.1674567155.4214498785.8054733151.6334951337.7972826181.71656787

    22.3392804662.6751328219.4109363954.3713013486.5611938753.6352659533.5162376396.66605979

    8.36066349362.4662660822.8358061560.79971686.0597267552.4059761233.0347415493.45891984

    12.0972475660.5793612514.281083682.5938332786.4698114647.8959087433.67467585.55607984

    12.5593096660.0900119725.974113783.2804809587.1078830352.7828113228.369829686.92439641

    13.5163957362.7939705322.4664982359.0770760589.211539555.5368587439.0612198997.54617928

    16.9979683462.4659410614.1478739260.5829653387.2006203454.2884838735.837764395.6333706

    17.4997239263.5700515920.9357980453.5328246690.3113805951.2659239931.3633908681.32837613

    21.4840737263.4631521525.2795470950.4273825289.7923542754.4161690441.3380733395.16000142

    17.4966849961.040581917.5455950250.7070815689.60499252.0380325635.8071460483.18577009

    18.4550755867.9267268422.787122348.390850489.0689906748.56746840.1757593578.16693144

    22.3979319163.1926187923.453414683.498625185.765415652.6035821341.4819141283.06977997

    14.7534010463.8987525420.7884414566.1054670788.4537762250.4228866837.5743274595.7656547

    22.2561468466.8101983311.2155238367.8952896887.1934590252.0860596832.2287018680.98025006

    6.54552057861.8663339612.3254184264.4697918790.0001143451.2396840445.6973762673.50967567

    16.5889779662.4141415419.4862282483.4269393189.2148778253.6933827234.6558203673.60692832

    11.2028289656.6176822823.6139080953.3678879887.1593291149.0635050735.329406382.96188864

    8.94473323662.7004920220.8552961177.5439348889.0337368351.9596473230.5159319591.73812081

    14.6335469561.3275754521.1030389983.6176585787.1011039152.6494463540.9662563492.67657537

    20.0679106163.0331716712.7346813464.5764241889.0783117751.3718095441.1125869788.66941273

    12.4757648261.8031732513.9427781659.4046968386.3060077349.8452826739.3882506487.72321075

    16.068884959.9502556422.727672976.4237843289.1566577853.539454128.7509100994.97828065

    22.0067221158.4385381521.8219951960.8581628287.9820402451.6084855832.3572233890.30085536

    19.6877998259.8281678123.9698095961.3266494686.8937343554.2373162737.5108461884.84057182

    14.8818045360.7715233920.7950015868.4925229588.2837997254.9036042532.8264839186.54787262

    22.8622124766.2559048114.404349147.8791135686.2554535951.5314556936.4704604282.31871391

    15.754920366.3875624122.3693942165.6739365391.179213751.1325060642.60979549