presentation
DESCRIPTION
TRANSCRIPT
A New OLAP Aggregation A New OLAP Aggregation Based on the AHC Based on the AHC
TechniqueTechnique
DOLAP 2004
R. Ben Messaoud, O. Boussaid, S. Rabaséda
Laboratoire ERIC – Université de Lyon 25, avenue Pierre-Mendès–France
69676, Bron Cedex – Francehttp://eric.univ-lyon2.fr
November 13, 2004 Ben Messaoud et al. 2
Complex data
1
2
3
4
5
0
Definition:
Data are considered complex if they are …
Multi-formats: information can be supported by different kind of data (numeric, symbolic, texts, images, sounds, videos …)
Multi-structures: structured, unstructured or semi-structured (relational databases, XML documents …)
Multi-sources: data come from different sources (distributed databases, web …)
Multi-modals: the same information can be described differently (data in different languages …)
Multi-versions: data are updated through time (temporal databases, periodical inventory …)
November 13, 2004 Ben Messaoud et al. 3
General context
1
2
3
4
5
0
Complex dataHuge volumes of complex dataWarehousing complex data …OLAP facts as complex objects
Analyze complex dataCurrent OLAP tools aren’t suited to process complex dataData mining is able to process complex data like images, texts, videos …
Coupling OLAP and data miningAnalyze complex data on-lineNew operator OpAC: Operator of Aggregation by Clustering (AHC)
Data mining OLAP
Complex data MDBMS
OpAC
November 13, 2004 Ben Messaoud et al. 4
Outline
Complex data and general context
Related work: Coupling OLAP and data mining
Objectives of the proposed operator
Formalization of the operator
Implementation and demonstration
Conclusion and future works
1
2
3
4
5
0
November 13, 2004 Ben Messaoud et al. 5
Three approaches for coupling OLAP and data miningFirst approach: Extending the query languages of decision support systemsSecond approach: Adapting multidimensional environment to classical data mining techniquesThird approach: Adapting data mining methods for multidimensional data
1
2
3
4
5
0
Related work
Data mining OLAP
DBMS
First approach
Second approach
Third approach
November 13, 2004 Ben Messaoud et al. 6
1
2
3
4
5
0
Data mining OLAP
Related work
These works proved that:Associating data mining to OLAP is a promising way to involve rich analysis tasksData mining is able to extend the analysis power of OLAP
Use data mining to enhance OLAP tools in order to process complex data
OpAC: A new OLAP operator based on a data mining technique
OpAC
November 13, 2004 Ben Messaoud et al. 7
ObjectivesClassic OLAP aggregation Vs OpAC aggregation
Classic OLAP:Summarizes numerical data in a fewer number of valuesComputes additive measures (Sum, Average, Max, Min …)
Example: Sales cube
+ Bellingham
+ Bremerton
+ Olympia
+ Redmond
+ Seattle
+ Berkeley
+ Beverly Hills
+ Los Angeles
$700
$400
$850
$250
$320
$820
$910
$680
32
20
44
9
15
41
50
38
Sales Count
- Washington
- California
$2520
$2410
Sales Count
+ Washington
+ California
120
129
$2520
$2410
Sales Count
+ Washington
+ California
120
129
1
2
3
4
5
0
November 13, 2004 Ben Messaoud et al. 8
Classic OLAP aggregation Vs OpAC aggregation
OpAC aggregation:What about aggregating complex objects?How to aggregate images, texts or videos with classic OLAP tools?Complex objects are not additive OLAP measures …
Orange coral
Nebraska, USA
Toco toucan
Maldives
Images Size
3560px
2340px
4434px
3260px
ASM
0,016
0,021
0,014
0,012
Example: Images cube
?
Objectives
1
2
3
4
5
0
November 13, 2004 Ben Messaoud et al. 9
How to aggregate complex objects?
Using a data mining technique: AHC (Agglomerative Hierarchical Clustering)
The AHC aggregates data
The hierarchical aspect of the AHC
Objectives
1
2
3
4
5
0
November 13, 2004 Ben Messaoud et al. 10
L1N
orm
aliz
ed f
or
hig
h h
om
ogeneit
y
L1Normalized for low entropy
Very
hig
h
Hig
h
Med
ium
Low
Very
low
Very high
HighMedium
LowVery low
Entr
opy
Homogeneity
Imag
es
Objectives
1
2
3
4
5
0
November 13, 2004 Ben Messaoud et al. 11
Formalization
1
2
3
4
5
0
Di : the ith dimension of a data cube C hij : the jth hirarchical level of the dimension Di
gijt : the tth modality of hij
gijt gijt hij
XXgijtMeasure of gsrv crossed with gijt
where gsrv hsr , s i and r is unique for each s
The set of individuals:
The set of variables:Dimension retained for individuals can’t generate variablesOnly one hierarchical level of a dimension is allowed to generate variables
November 13, 2004 Ben Messaoud et al. 12
Formalization
1
2
3
4
5
0Evaluation tools
Minimize the intra-cluster distancesMaximize the inter-cluster distances
Inter and intra-cluster inertia
A1 , A2 , …, Ak is a partition of PAi is the weight of Ai
GAi is the gravity center of Ai
Iintrak IAik
i=1
Iinterk PAidGAiGk
i=1
November 13, 2004 Ben Messaoud et al. 13V
ery
hig
h
Hig
h
Med
ium
Low
Very
low
Very high
HighMedium
LowVery low
Entr
opy
Homogeneity
1
2
3
4
5
0500
0
100
200
300
400
7 6 5 4 3 2 1
- Inter-clusters - Intra-cluster
Individuals: Modalities from the dimension of images
Variables:L1Normalized values of images for all possible modalities of the entropy dimensionL1Normalized values of images for all possible modalities of the homogeneity dimension
Formalization
November 13, 2004 Ben Messaoud et al. 14
Formalization
Results:
Exploits the cube’s facts describing images to construct groups of similar complex objects
Highlights significant groups of objects by a clustering technique
Clusters –aggregates- are defined both from dimensions and measures of a data cube
Implementation of a prototype
1
2
3
4
5
0
November 13, 2004 Ben Messaoud et al. 15
Implementation
1
2
3
4
5
0Prototype:
Data loading module: Connects to a data cube on Analysis Services of MS SQL ServerUses MDX queries to import information about the cube’s structureExtract data selected by the user
Parameter setting interface:Assists the user to extract individuals and variables from the cubeSelects modalities and measures Defines the clustering problem
Clustering module:Allows the definition of the clustering parameters like dissimilarity metric and aggregation criterionConstructs the AHCPlots the results of the AHC on a dendrogram
November 13, 2004 Ben Messaoud et al. 16
Implementation
1
2
3
4
5
0Images dataset:
3000 images collected from the web:
Semantic annotation: Description, subject and themeDescriptors of texture like:
ENT: EntropyCON: ContrastL1Normalized: Medium Color Characteristic…
Three color channels: RGB
November 13, 2004 Ben Messaoud et al. 17
Implementation
1
2
3
4
5
0Demonstration:
November 13, 2004 Ben Messaoud et al. 18
Conclusion
1
2
3
4
5
0 OpAC is a possible way to realize on-line analysis over complex data
OpAC aggregates complex objects
Aggregates –clusters- are defined from both dimensions and measures of a data cube
Prototype available at :http://bdd.univ-lyon2.fr/?page=logiciel&id=5
November 13, 2004 Ben Messaoud et al. 19
Future works
1
2
3
4
5
0The current evaluation tool may present some limits Use other evaluation indicators to evaluate the quality of partitions Assist user to find the best number of clusters
Exploit the aggregates generated by OpAC in order to reorganize the cube’s dimensions Get a new cube with remarkable regions
Use other data mining technique to enhance the OLAP power with explanation and prediction capabilities
November 13, 2004 Ben Messaoud et al. 20
The EndThe End