horizontal data sets: number of attributes is of the same order to several orders of magnitude...

39
Horizontal data sets: Number of attributes is of the same order to several orders of magnitude higher than the number of records. Example: genetic data sets, can have 10,000 attributes and 100 records. 10, 000 attributes, up to 100 million combinations of two attributes and up to 1 trillion 3 attribute sets!

Post on 21-Dec-2015

213 views

Category:

Documents


0 download

TRANSCRIPT

Horizontal data sets: Number of attributes is of the same order to several orders of magnitude higher than the number of records.

Example: genetic data sets, can have 10,000 attributes and 100 records.

10, 000 attributes, up to 100 million combinations of two attributes and up to 1 trillion 3 attribute sets!

Data Driven AlgorithmConstructing the Max-conf kernel for small data sets:

Input: i) a Database DB ii) a fixed consequent C

Output:

a set R of rules such that for any rule of the form X->Cthere exists a rule X'->C in R, where X' is a superset of X and X'->C has a a higher confidence then X->C

Algorithm:// DB(C) is the set of records that satisfy the consequent // RS is a working set which maintain the current subset of records that satisfy the consequentCOMMON is the set of common descriptors for the record set RS;

MaxConfKernelSet(DB, C, DB(C), RS, COMMON) {

i= size(RS)+1; if (i==1) {COMMON=Descriptors in the ith record in DB(C);} RS=RS \union {ith record in DB(C)}; while (i<=size(DB(C))) do {Delete from COMMON the descriptors not shared by the ith record; Compute support of records satisfying {COMMON-C};Compute the confidence of COMMON-C->C;if (COMMON-C)!=null) {if sufficient support and not duplicateoutput "COMMON-C->C [support, conf]" ; MaxConfKernelSet(DB, C, DB(C), RS, COMMON); RS=RS-{ith record in DB(C)};i++;RS=RS \union {ith record in DB(C)};}} Invoke:MaxConfKenalSet(DB,C, DB(C), null, null); // RS, COMMONis empty initially

OLAP and Statistical databases

• Statistical databases – from early 80s– Mutidimensional datasets concerned with

summariziation over the dimensions of the data sets. 2-D representations – census, socioeconomic data etd

• OLAP: on line analytical processing: mid 90s

Multi-dimensional Statistical Table

2-D representation of statistical data

A graph model for statistical data

A scheme for stat data

More schemes

More schemes

Relational representation of statistical object

Automatic aggregation concept

Terms in SDB and OLAP

SDB and OLAP operators

Completeness of statistical algebra

Overlapping and timevarying categories

Physical organization

Encoding column category values

Array linearization

Header compression

Lattice of materialization

Partitioning of a data cube into subcubes

Cube operator

Data Cube – shortcomings of SQL

Sales Roll Up by Model by Year and by color

Using ALL value

3 dimensional rollup in SQL

Cross-tabulation in SQL

Cross Tabulation

CUBE operator

Support of histograms

A 3D data cube

ALL value and decoration field

Decorations

ROLLUP operator

Percentage of total as an aggregate function

Indices

STAR scheme