building quick and accurate structure for data warehouse

8/13/2019 Building Quick and Accurate Structure for Data Warehouse

1/3

International Journal of Advanced Computer Science, Vol. 2, No. 7, Pp. 274-276, Jul., 2012.

ManuscriptReceived:

17,Apr., 2012

Revised:

12, May,2012

Accepted:

28,Jun., 2012

Published:15,Aug., 2012

Keywords

Datawarehouse,

OLAP,

ROLAP,

Star Schema,

Snowflake

Schema,

Abstract Data Warehouse is Described as aset of subject oriented, integration, non

volatile and time variant system to support

data decision system (DSS) and online

response to manager analytical queries.

These kind of queries are complicated and

they done on the large amount of data and

they used for data decision system and online

analytical processing for data mining .Data

warehouse and online analytical processing

are necessary to support manager decision.In this research, first we describe data

warehousing and consider some kind of data

warehouse scheme and how should we choose

one of them. Main goal of this research, is

acceleration of selecting scheme in data

warehouse design, for this goal first we

consider all of related works and then we

define an algorithm to select data warehouse

scheme and finally we test accuracy of the

algorithm .To implement this test, we use

Microsoft c#.net and Sql server to design

data warehouse and some assumed data

1. IntroductionOne of the problems that exist related to data warehouse

design, is lack of procedures to select appropriate schema.

Available resources [1-3] investigated advantages anddisadvantages of different schemas. Some of them [2-5],solve some of the problems related to schemas and some ofothers [6-8] improved query response time. But none ofthese resources have represented the appropriate frameworkto select appropriate schema based on type of queries andtype of attributes.

In available resources, [3], [5] Schema selection is based

on personal opinion and business requirements. Also, thetool is used, widely affected schema selection. Some of

tools like oracle and MS SQL have higher efficiency withstar schema; While DB2 works better with snowflakeschema. Environment is one of the factors affected schema

selection too. For example if data warehouse is composed ofsome data marts, using star schema is better. With theseconditions, finding the appropriate schema is timeconsuming and is based on try and error. In fact we should

start from completely normal snowflake schema, in eachtime, de-normalize one of the dimensions and measure the

This work was supported by Department Of computer science, Sarvestan

Branch, Islamic Azad University, sarvestan, Iran([email protected])

efficiency. This work is repeated until the optimal

compound schema is obtained.In fact, so far the above mentioned factors have affected

schema selection in data warehouse design. These factors

are necessary for schema selection,but arent sufficient andmay be lead to inappropriate schema selection and lowefficiency.

To solve these problems and represent the appropriate

way to schema selection that improves the efficiency and

usability of data warehouse, in this paper, the newframework to select data schema for data warehouse isreported. In next section, this framework is described and inthe following section, all tests regarding to all classicschemas and some research developed schema [9] which

show the framework is effective, are reported.

2. Schema SelectionIn this section, we will represent a framework for

appropriate schema selection in data warehouse design. Forthis purpose, Decision tree is used. The type of queries and

attributes affected schema selection in this framework. Thetype of query depends on number of join operation neededto response it and type of attributes it access. The types of

attributes are multi valued attributes, single-valuedattributes and indexed attributes.

The structure of framework is formed as a decision treeand represented in Fig. 1.

Fig. 1. schema selection framework

We can state all paths in this decision tree as IF, THENstatements. All these statements have been tested andcorrectness of them was confirmed. In the following, we

will show these statements.

A. Case 1 If in some dimension tables, one attribute acts as aparent in two different hierarchies, Then

Building Quick and Accurate Structure for Data

WarehouseMaryam shekofteh


2/3

Maryam shekofteh:Building Quick and Accurate Structure for Data Warehouse.

International Journal Publishers Group (IJPG)

275

o If this attribute or one of its ancestors are queriedfrequently, the framework propose Improved StarCluster schema [9].

o Else Star Cluster schema [2] is used.B.

Case 2

If it is possible to normalize some of dimension tables,Then

o If the result tables from normalization thesedimension tables are small, Then star schema andsnowflake schema works equally. So with consideringused tools, schema will be selected.

If used tool is oracle, MS SQL, that worksbetter with star schema, the framework propose starschema.

If used tool is DB2, that works better withsnowflake schema, the framework proposesnowflake schema.

o Else if the attribute is queried, is indexed, theframework propose star schema.

Else with try and error, the appropriate schema isselected.

C. Case 3If it isnt possible to normalize the rest of tables, the

framework propose star schema.

D. Case 4 If there is multi-valued attribute in some dimensiontables, i.e. There are multiple values for one attribute

corresponding to single value for other attribute, theno If the number of multi-valued attributes is known,then

If in most times, queries only need to access tableT1 in first level of tables that resulted fromnormalizing this dimension, then there is nodifference between star schema and snowflakeschema. So with respect to used tool, we could select

data schema. Therefore

If used tool is DB2, that works betterwith snowflake schema, the framework proposesnowflake schema

If used tool is used tool is oracle, MSSQL, that works better with star schema, theframework propose extended star schema [4].

If in most times, queries need to access outerlevel tables, Then the framework propose extendedstar schema [4]

o If the number of multi-valued attributes is notknown, then

If in most times, queries only need to access tableT1 in first level of tables that resulted fromnormalizing this dimension, then there is nodifference between star schema and snowflake

schema. So with respect to used tool, we could selectdata schema. Therefore

If used tool is DB2, that works betterwith snowflake schema, Then the frameworkpropose snowflake schema.

If used tool is used tool is oracle, MSSQL, that works better with star schema, Then

the framework propose extended star schema [4]. If in most times, queries need to access outerlevel tables, then the framework propose extendedstar schema [4].

E. Case 5If conditions that Kimball states in [10] are true, then

using snowflake schema is better. Kimball often prefers touse star schema because of its simplicity and efficiency. Buthe said in certain situations, snowflake schema is not onlyacceptable, but recommended [10]. These situations are the

cases that there are many null values in large abnormal

dimension tables. In these situations, variations ofsnowflake schemas can be useful.

If multiple of above conditions are true, by combiningthe results of each condition, the final schema will beobtained. In the following, we will show the conditions

related to every edges in this decision tree.We assume if dimension table T is normalized, T1, T2, ,

Tnwill be resulted.

e1: An attribute acts as a parent in two differentdimensional hierarchies.

e2:It is possible to normalize some of dimension tables.e3: It isnt possible to normalize the rest of tables.

e4:There is multi-valued attribute in some dimensiontables.

e5: The conditions that Kimball states in [10] are true.e6: The attribute related to edge e1or one of its ancestors

arent queried frequently.e7: The attribute related to edge e1or one of its ancestors

are queried frequently. e8: T1, T2, , Tn are small.

e9: T1, T2, , Tn are large.e10: The number of multi-valued attributes is not known.e11: The number of multi-valued attributes is known.e12: The used tool is oracle, MS SQL, that works

better with star schema.

e13: The used tool is DB2, that works better withsnowflake schema.

e14: often the attribute is queried, is indexed.e15:often the attribute is queried, isnt indexed.e16: In most times, queries need to access T2, , Tn that

areouter level tables.e17: In most times, queries only need to access table T1 in

first level of tables.e18: e16e19: e17e20: Try and error.

e21: e12e22: e13e23: e12e24: e13


3/3

International Journal of Advanced Computer Science, Vol. 2, No. 7, Pp. 274-276, Jul., 2012.

International Journal Publishers Group (IJPG)

276

D:Star schema

F: Snowflake schemaG: Star Cluster

H: Improved Star Cluster schema

R: Star schema or snowflake schema

N: Extended star schemaP: Extendedstar schema

3. TestsIn this section, all tests which show the framework is

effective regarding to all classic and research developedschemas [9] within different kind of queries, are presented.

The test bed used in this section includes multiple datawarehouses. The states that exist in decision tree werecreated in these data warehouse dimension tables andmultiple types of query were run. The system on which

queries run, has 2500Mhz CPU clock and 256 M-byte RAM.To implement these data warehouses and run queries, SQLserver 2000 and Query Analyzer were used. The required

data is generated by a C#.Net application. Queries run inthis test bed, are different from each other with respect tothe number of join operation needed to response them. In

most resources, query response time is the most importantcriteria to compare schemas in data warehouses. So in thispaper, query response time is the criteria used to evaluate

the framework and compare schemas.

4. ConclusionsBy using the represented framework, data warehouse

builders can choose the best schema for their datawarehouse based on the specified criteria and characteristicsof the application domain. Also, data warehouse researchers

can use this framework to evaluate, compare and extendexisting data schemas. This framework could be extendedtoo.

References

[1] B. McGehee, Use SET STATISTICS IO and SETSTATISTICS TIME to Help Tune Your SQL ServerQueries,Pipelines Newsletter, February 2005

[2] J. Han, M. Kamber, Data mining:Concepts and Techniques,New York, NY: Morgan Kaufman, 2001.[3] A. Sen, & A. P. Sinha, A Comparison Of Data Warehousing

Methodologies,Cummunications of the ACM, Vol. 48, No.

3, pp. 79-84, March 2005.[4] P. Gray, & H. J. Waston, Decision Support in the Data

Warehouse,Prentice Hall, Inc. 1998.[5] P. Lane, & V. Schupmann, Oracle9i Data Warehousing Guide,

Release 2 (9.2),Oracle Corporation, 2000[6] B. Heinsius, E.O.M. Data, Hilversum. The Netherlands,

Querying Star and Snowflake Schemas in SAS,Proceedings of SAS Conference: SUGI26, pp. 123-126,

22-25 April, Long Beach, California, 2001.[7] N. Tryfona, F. Busborg, & J.G.B. Christiansen, "starER: A

Conceptual Model for Data Warehouse Design,"Proceedingsof the International Workshop on Data Warehousing and

OLAP, Kansas City, MO., PP. 3-8, 1999.[8] A. Tsois, N. Karayannidis, & T. Sellis, "MAC:Conceptual Data

Modeling for OLAP," Proceedings of the InternationalWorkshop on Design and Management of Data Warehouses,Interlaken, Switzerland, June 4, 2001.

[9] D. Moody, & M. Kortnik, "From Enterprise Models toDimensional Models: A Methodology for Data Warehouse

and Data Mart Design," Proceedings of the InternationalWorkshop on Design and Management of Data Warehouses,5.1-5.12, Sweden, 2000.

[10] A. Gutierrez, & A. Marotta, An Overview of Data

Warehouse Design Approaches and Techniques, whitepaper, 2000.

building quick and accurate structure for data warehouse

Documents