building quick and accurate structure for data warehouse

Upload: ijeceditor

Post on 04-Jun-2018

214 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/13/2019 Building Quick and Accurate Structure for Data Warehouse

    1/3

    International Journal of Advanced Computer Science, Vol. 2, No. 7, Pp. 274-276, Jul., 2012.

    ManuscriptReceived:

    17,Apr., 2012

    Revised:

    12, May,2012

    Accepted:

    28,Jun., 2012

    Published:15,Aug., 2012

    Keywords

    Datawarehouse,

    OLAP,

    ROLAP,

    Star Schema,

    Snowflake

    Schema,

    Abstract Data Warehouse is Described as aset of subject oriented, integration, non

    volatile and time variant system to support

    data decision system (DSS) and online

    response to manager analytical queries.

    These kind of queries are complicated and

    they done on the large amount of data and

    they used for data decision system and online

    analytical processing for data mining .Data

    warehouse and online analytical processing

    are necessary to support manager decision.In this research, first we describe data

    warehousing and consider some kind of data

    warehouse scheme and how should we choose

    one of them. Main goal of this research, is

    acceleration of selecting scheme in data

    warehouse design, for this goal first we

    consider all of related works and then we

    define an algorithm to select data warehouse

    scheme and finally we test accuracy of the

    algorithm .To implement this test, we use

    Microsoft c#.net and Sql server to design

    data warehouse and some assumed data

    1. IntroductionOne of the problems that exist related to data warehouse

    design, is lack of procedures to select appropriate schema.

    Available resources [1-3] investigated advantages anddisadvantages of different schemas. Some of them [2-5],solve some of the problems related to schemas and some ofothers [6-8] improved query response time. But none ofthese resources have represented the appropriate frameworkto select appropriate schema based on type of queries andtype of attributes.

    In available resources, [3], [5] Schema selection is based

    on personal opinion and business requirements. Also, thetool is used, widely affected schema selection. Some of

    tools like oracle and MS SQL have higher efficiency withstar schema; While DB2 works better with snowflakeschema. Environment is one of the factors affected schema

    selection too. For example if data warehouse is composed ofsome data marts, using star schema is better. With theseconditions, finding the appropriate schema is timeconsuming and is based on try and error. In fact we should

    start from completely normal snowflake schema, in eachtime, de-normalize one of the dimensions and measure the

    This work was supported by Department Of computer science, Sarvestan

    Branch, Islamic Azad University, sarvestan, Iran([email protected])

    efficiency. This work is repeated until the optimal

    compound schema is obtained.In fact, so far the above mentioned factors have affected

    schema selection in data warehouse design. These factors

    are necessary for schema selection,but arent sufficient andmay be lead to inappropriate schema selection and lowefficiency.

    To solve these problems and represent the appropriate

    way to schema selection that improves the efficiency and

    usability of data warehouse, in this paper, the newframework to select data schema for data warehouse isreported. In next section, this framework is described and inthe following section, all tests regarding to all classicschemas and some research developed schema [9] which

    show the framework is effective, are reported.

    2. Schema SelectionIn this section, we will represent a framework for

    appropriate schema selection in data warehouse design. Forthis purpose, Decision tree is used. The type of queries and

    attributes affected schema selection in this framework. Thetype of query depends on number of join operation neededto response it and type of attributes it access. The types of

    attributes are multi valued attributes, single-valuedattributes and indexed attributes.

    The structure of framework is formed as a decision treeand represented in Fig. 1.

    Fig. 1. schema selection framework

    We can state all paths in this decision tree as IF, THENstatements. All these statements have been tested andcorrectness of them was confirmed. In the following, we

    will show these statements.

    A. Case 1 If in some dimension tables, one attribute acts as aparent in two different hierarchies, Then

    Building Quick and Accurate Structure for Data

    WarehouseMaryam shekofteh

  • 8/13/2019 Building Quick and Accurate Structure for Data Warehouse

    2/3

    Maryam shekofteh:Building Quick and Accurate Structure for Data Warehouse.

    International Journal Publishers Group (IJPG)

    275

    o If this attribute or one of its ancestors are queriedfrequently, the framework propose Improved StarCluster schema [9].

    o Else Star Cluster schema [2] is used.B.

    Case 2

    If it is possible to normalize some of dimension tables,Then

    o If the result tables from normalization thesedimension tables are small, Then star schema andsnowflake schema works equally. So with consideringused tools, schema will be selected.

    If used tool is oracle, MS SQL, that worksbetter with star schema, the framework propose starschema.

    If used tool is DB2, that works better withsnowflake schema, the framework proposesnowflake schema.

    o Else if the attribute is queried, is indexed, theframework propose star schema.

    Else with try and error, the appropriate schema isselected.

    C. Case 3If it isnt possible to normalize the rest of tables, the

    framework propose star schema.

    D. Case 4 If there is multi-valued attribute in some dimensiontables, i.e. There are multiple values for one attribute

    corresponding to single value for other attribute, theno If the number of multi-valued attributes is known,then

    If in most times, queries only need to access tableT1 in first level of tables that resulted fromnormalizing this dimension, then there is nodifference between star schema and snowflakeschema. So with respect to used tool, we could select

    data schema. Therefore

    If used tool is DB2, that works betterwith snowflake schema, the framework proposesnowflake schema

    If used tool is used tool is oracle, MSSQL, that works better with star schema, theframework propose extended star schema [4].

    If in most times, queries need to access outerlevel tables, Then the framework propose extendedstar schema [4]

    o If the number of multi-valued attributes is notknown, then

    If in most times, queries only need to access tableT1 in first level of tables that resulted fromnormalizing this dimension, then there is nodifference between star schema and snowflake

    schema. So with respect to used tool, we could selectdata schema. Therefore

    If used tool is DB2, that works betterwith snowflake schema, Then the frameworkpropose snowflake schema.

    If used tool is used tool is oracle, MSSQL, that works better with star schema, Then

    the framework propose extended star schema [4]. If in most times, queries need to access outerlevel tables, then the framework propose extendedstar schema [4].

    E. Case 5If conditions that Kimball states in [10] are true, then

    using snowflake schema is better. Kimball often prefers touse star schema because of its simplicity and efficiency. Buthe said in certain situations, snowflake schema is not onlyacceptable, but recommended [10]. These situations are the

    cases that there are many null values in large abnormal

    dimension tables. In these situations, variations ofsnowflake schemas can be useful.

    If multiple of above conditions are true, by combiningthe results of each condition, the final schema will beobtained. In the following, we will show the conditions

    related to every edges in this decision tree.We assume if dimension table T is normalized, T1, T2, ,

    Tnwill be resulted.

    e1: An attribute acts as a parent in two differentdimensional hierarchies.

    e2:It is possible to normalize some of dimension tables.e3: It isnt possible to normalize the rest of tables.

    e4:There is multi-valued attribute in some dimensiontables.

    e5: The conditions that Kimball states in [10] are true.e6: The attribute related to edge e1or one of its ancestors

    arent queried frequently.e7: The attribute related to edge e1or one of its ancestors

    are queried frequently. e8: T1, T2, , Tn are small.

    e9: T1, T2, , Tn are large.e10: The number of multi-valued attributes is not known.e11: The number of multi-valued attributes is known.e12: The used tool is oracle, MS SQL, that works

    better with star schema.

    e13: The used tool is DB2, that works better withsnowflake schema.

    e14: often the attribute is queried, is indexed.e15:often the attribute is queried, isnt indexed.e16: In most times, queries need to access T2, , Tn that

    areouter level tables.e17: In most times, queries only need to access table T1 in

    first level of tables.e18: e16e19: e17e20: Try and error.

    e21: e12e22: e13e23: e12e24: e13

  • 8/13/2019 Building Quick and Accurate Structure for Data Warehouse

    3/3

    International Journal of Advanced Computer Science, Vol. 2, No. 7, Pp. 274-276, Jul., 2012.

    International Journal Publishers Group (IJPG)

    276

    D:Star schema

    F: Snowflake schemaG: Star Cluster

    H: Improved Star Cluster schema

    R: Star schema or snowflake schema

    N: Extended star schemaP: Extendedstar schema

    3. TestsIn this section, all tests which show the framework is

    effective regarding to all classic and research developedschemas [9] within different kind of queries, are presented.

    The test bed used in this section includes multiple datawarehouses. The states that exist in decision tree werecreated in these data warehouse dimension tables andmultiple types of query were run. The system on which

    queries run, has 2500Mhz CPU clock and 256 M-byte RAM.To implement these data warehouses and run queries, SQLserver 2000 and Query Analyzer were used. The required

    data is generated by a C#.Net application. Queries run inthis test bed, are different from each other with respect tothe number of join operation needed to response them. In

    most resources, query response time is the most importantcriteria to compare schemas in data warehouses. So in thispaper, query response time is the criteria used to evaluate

    the framework and compare schemas.

    4. ConclusionsBy using the represented framework, data warehouse

    builders can choose the best schema for their datawarehouse based on the specified criteria and characteristicsof the application domain. Also, data warehouse researchers

    can use this framework to evaluate, compare and extendexisting data schemas. This framework could be extendedtoo.

    References

    [1] B. McGehee, Use SET STATISTICS IO and SETSTATISTICS TIME to Help Tune Your SQL ServerQueries,Pipelines Newsletter, February 2005

    [2] J. Han, M. Kamber, Data mining:Concepts and Techniques,New York, NY: Morgan Kaufman, 2001.[3] A. Sen, & A. P. Sinha, A Comparison Of Data Warehousing

    Methodologies,Cummunications of the ACM, Vol. 48, No.

    3, pp. 79-84, March 2005.[4] P. Gray, & H. J. Waston, Decision Support in the Data

    Warehouse,Prentice Hall, Inc. 1998.[5] P. Lane, & V. Schupmann, Oracle9i Data Warehousing Guide,

    Release 2 (9.2),Oracle Corporation, 2000[6] B. Heinsius, E.O.M. Data, Hilversum. The Netherlands,

    Querying Star and Snowflake Schemas in SAS,Proceedings of SAS Conference: SUGI26, pp. 123-126,

    22-25 April, Long Beach, California, 2001.[7] N. Tryfona, F. Busborg, & J.G.B. Christiansen, "starER: A

    Conceptual Model for Data Warehouse Design,"Proceedingsof the International Workshop on Data Warehousing and

    OLAP, Kansas City, MO., PP. 3-8, 1999.[8] A. Tsois, N. Karayannidis, & T. Sellis, "MAC:Conceptual Data

    Modeling for OLAP," Proceedings of the InternationalWorkshop on Design and Management of Data Warehouses,Interlaken, Switzerland, June 4, 2001.

    [9] D. Moody, & M. Kortnik, "From Enterprise Models toDimensional Models: A Methodology for Data Warehouse

    and Data Mart Design," Proceedings of the InternationalWorkshop on Design and Management of Data Warehouses,5.1-5.12, Sweden, 2000.

    [10] A. Gutierrez, & A. Marotta, An Overview of Data

    Warehouse Design Approaches and Techniques, whitepaper, 2000.