building quick and accurate structure for data warehouse
TRANSCRIPT
-
8/13/2019 Building Quick and Accurate Structure for Data Warehouse
1/3
International Journal of Advanced Computer Science, Vol. 2, No. 7, Pp. 274-276, Jul., 2012.
ManuscriptReceived:
17,Apr., 2012
Revised:
12, May,2012
Accepted:
28,Jun., 2012
Published:15,Aug., 2012
Keywords
Datawarehouse,
OLAP,
ROLAP,
Star Schema,
Snowflake
Schema,
Abstract Data Warehouse is Described as aset of subject oriented, integration, non
volatile and time variant system to support
data decision system (DSS) and online
response to manager analytical queries.
These kind of queries are complicated and
they done on the large amount of data and
they used for data decision system and online
analytical processing for data mining .Data
warehouse and online analytical processing
are necessary to support manager decision.In this research, first we describe data
warehousing and consider some kind of data
warehouse scheme and how should we choose
one of them. Main goal of this research, is
acceleration of selecting scheme in data
warehouse design, for this goal first we
consider all of related works and then we
define an algorithm to select data warehouse
scheme and finally we test accuracy of the
algorithm .To implement this test, we use
Microsoft c#.net and Sql server to design
data warehouse and some assumed data
1. IntroductionOne of the problems that exist related to data warehouse
design, is lack of procedures to select appropriate schema.
Available resources [1-3] investigated advantages anddisadvantages of different schemas. Some of them [2-5],solve some of the problems related to schemas and some ofothers [6-8] improved query response time. But none ofthese resources have represented the appropriate frameworkto select appropriate schema based on type of queries andtype of attributes.
In available resources, [3], [5] Schema selection is based
on personal opinion and business requirements. Also, thetool is used, widely affected schema selection. Some of
tools like oracle and MS SQL have higher efficiency withstar schema; While DB2 works better with snowflakeschema. Environment is one of the factors affected schema
selection too. For example if data warehouse is composed ofsome data marts, using star schema is better. With theseconditions, finding the appropriate schema is timeconsuming and is based on try and error. In fact we should
start from completely normal snowflake schema, in eachtime, de-normalize one of the dimensions and measure the
This work was supported by Department Of computer science, Sarvestan
Branch, Islamic Azad University, sarvestan, Iran([email protected])
efficiency. This work is repeated until the optimal
compound schema is obtained.In fact, so far the above mentioned factors have affected
schema selection in data warehouse design. These factors
are necessary for schema selection,but arent sufficient andmay be lead to inappropriate schema selection and lowefficiency.
To solve these problems and represent the appropriate
way to schema selection that improves the efficiency and
usability of data warehouse, in this paper, the newframework to select data schema for data warehouse isreported. In next section, this framework is described and inthe following section, all tests regarding to all classicschemas and some research developed schema [9] which
show the framework is effective, are reported.
2. Schema SelectionIn this section, we will represent a framework for
appropriate schema selection in data warehouse design. Forthis purpose, Decision tree is used. The type of queries and
attributes affected schema selection in this framework. Thetype of query depends on number of join operation neededto response it and type of attributes it access. The types of
attributes are multi valued attributes, single-valuedattributes and indexed attributes.
The structure of framework is formed as a decision treeand represented in Fig. 1.
Fig. 1. schema selection framework
We can state all paths in this decision tree as IF, THENstatements. All these statements have been tested andcorrectness of them was confirmed. In the following, we
will show these statements.
A. Case 1 If in some dimension tables, one attribute acts as aparent in two different hierarchies, Then
Building Quick and Accurate Structure for Data
WarehouseMaryam shekofteh
-
8/13/2019 Building Quick and Accurate Structure for Data Warehouse
2/3
Maryam shekofteh:Building Quick and Accurate Structure for Data Warehouse.
International Journal Publishers Group (IJPG)
275
o If this attribute or one of its ancestors are queriedfrequently, the framework propose Improved StarCluster schema [9].
o Else Star Cluster schema [2] is used.B.
Case 2
If it is possible to normalize some of dimension tables,Then
o If the result tables from normalization thesedimension tables are small, Then star schema andsnowflake schema works equally. So with consideringused tools, schema will be selected.
If used tool is oracle, MS SQL, that worksbetter with star schema, the framework propose starschema.
If used tool is DB2, that works better withsnowflake schema, the framework proposesnowflake schema.
o Else if the attribute is queried, is indexed, theframework propose star schema.
Else with try and error, the appropriate schema isselected.
C. Case 3If it isnt possible to normalize the rest of tables, the
framework propose star schema.
D. Case 4 If there is multi-valued attribute in some dimensiontables, i.e. There are multiple values for one attribute
corresponding to single value for other attribute, theno If the number of multi-valued attributes is known,then
If in most times, queries only need to access tableT1 in first level of tables that resulted fromnormalizing this dimension, then there is nodifference between star schema and snowflakeschema. So with respect to used tool, we could select
data schema. Therefore
If used tool is DB2, that works betterwith snowflake schema, the framework proposesnowflake schema
If used tool is used tool is oracle, MSSQL, that works better with star schema, theframework propose extended star schema [4].
If in most times, queries need to access outerlevel tables, Then the framework propose extendedstar schema [4]
o If the number of multi-valued attributes is notknown, then
If in most times, queries only need to access tableT1 in first level of tables that resulted fromnormalizing this dimension, then there is nodifference between star schema and snowflake
schema. So with respect to used tool, we could selectdata schema. Therefore
If used tool is DB2, that works betterwith snowflake schema, Then the frameworkpropose snowflake schema.
If used tool is used tool is oracle, MSSQL, that works better with star schema, Then
the framework propose extended star schema [4]. If in most times, queries need to access outerlevel tables, then the framework propose extendedstar schema [4].
E. Case 5If conditions that Kimball states in [10] are true, then
using snowflake schema is better. Kimball often prefers touse star schema because of its simplicity and efficiency. Buthe said in certain situations, snowflake schema is not onlyacceptable, but recommended [10]. These situations are the
cases that there are many null values in large abnormal
dimension tables. In these situations, variations ofsnowflake schemas can be useful.
If multiple of above conditions are true, by combiningthe results of each condition, the final schema will beobtained. In the following, we will show the conditions
related to every edges in this decision tree.We assume if dimension table T is normalized, T1, T2, ,
Tnwill be resulted.
e1: An attribute acts as a parent in two differentdimensional hierarchies.
e2:It is possible to normalize some of dimension tables.e3: It isnt possible to normalize the rest of tables.
e4:There is multi-valued attribute in some dimensiontables.
e5: The conditions that Kimball states in [10] are true.e6: The attribute related to edge e1or one of its ancestors
arent queried frequently.e7: The attribute related to edge e1or one of its ancestors
are queried frequently. e8: T1, T2, , Tn are small.
e9: T1, T2, , Tn are large.e10: The number of multi-valued attributes is not known.e11: The number of multi-valued attributes is known.e12: The used tool is oracle, MS SQL, that works
better with star schema.
e13: The used tool is DB2, that works better withsnowflake schema.
e14: often the attribute is queried, is indexed.e15:often the attribute is queried, isnt indexed.e16: In most times, queries need to access T2, , Tn that
areouter level tables.e17: In most times, queries only need to access table T1 in
first level of tables.e18: e16e19: e17e20: Try and error.
e21: e12e22: e13e23: e12e24: e13
-
8/13/2019 Building Quick and Accurate Structure for Data Warehouse
3/3
International Journal of Advanced Computer Science, Vol. 2, No. 7, Pp. 274-276, Jul., 2012.
International Journal Publishers Group (IJPG)
276
D:Star schema
F: Snowflake schemaG: Star Cluster
H: Improved Star Cluster schema
R: Star schema or snowflake schema
N: Extended star schemaP: Extendedstar schema
3. TestsIn this section, all tests which show the framework is
effective regarding to all classic and research developedschemas [9] within different kind of queries, are presented.
The test bed used in this section includes multiple datawarehouses. The states that exist in decision tree werecreated in these data warehouse dimension tables andmultiple types of query were run. The system on which
queries run, has 2500Mhz CPU clock and 256 M-byte RAM.To implement these data warehouses and run queries, SQLserver 2000 and Query Analyzer were used. The required
data is generated by a C#.Net application. Queries run inthis test bed, are different from each other with respect tothe number of join operation needed to response them. In
most resources, query response time is the most importantcriteria to compare schemas in data warehouses. So in thispaper, query response time is the criteria used to evaluate
the framework and compare schemas.
4. ConclusionsBy using the represented framework, data warehouse
builders can choose the best schema for their datawarehouse based on the specified criteria and characteristicsof the application domain. Also, data warehouse researchers
can use this framework to evaluate, compare and extendexisting data schemas. This framework could be extendedtoo.
References
[1] B. McGehee, Use SET STATISTICS IO and SETSTATISTICS TIME to Help Tune Your SQL ServerQueries,Pipelines Newsletter, February 2005
[2] J. Han, M. Kamber, Data mining:Concepts and Techniques,New York, NY: Morgan Kaufman, 2001.[3] A. Sen, & A. P. Sinha, A Comparison Of Data Warehousing
Methodologies,Cummunications of the ACM, Vol. 48, No.
3, pp. 79-84, March 2005.[4] P. Gray, & H. J. Waston, Decision Support in the Data
Warehouse,Prentice Hall, Inc. 1998.[5] P. Lane, & V. Schupmann, Oracle9i Data Warehousing Guide,
Release 2 (9.2),Oracle Corporation, 2000[6] B. Heinsius, E.O.M. Data, Hilversum. The Netherlands,
Querying Star and Snowflake Schemas in SAS,Proceedings of SAS Conference: SUGI26, pp. 123-126,
22-25 April, Long Beach, California, 2001.[7] N. Tryfona, F. Busborg, & J.G.B. Christiansen, "starER: A
Conceptual Model for Data Warehouse Design,"Proceedingsof the International Workshop on Data Warehousing and
OLAP, Kansas City, MO., PP. 3-8, 1999.[8] A. Tsois, N. Karayannidis, & T. Sellis, "MAC:Conceptual Data
Modeling for OLAP," Proceedings of the InternationalWorkshop on Design and Management of Data Warehouses,Interlaken, Switzerland, June 4, 2001.
[9] D. Moody, & M. Kortnik, "From Enterprise Models toDimensional Models: A Methodology for Data Warehouse
and Data Mart Design," Proceedings of the InternationalWorkshop on Design and Management of Data Warehouses,5.1-5.12, Sweden, 2000.
[10] A. Gutierrez, & A. Marotta, An Overview of Data
Warehouse Design Approaches and Techniques, whitepaper, 2000.