Data Warehouse Systems || Logical Data Warehouse Design

Download Data Warehouse Systems || Logical Data Warehouse Design

Post on 22-Feb-2017

224 views

Category:

Documents

5 download

Embed Size (px)

TRANSCRIPT

<ul><li><p>Chapter 5</p><p>Logical Data Warehouse Design</p><p>Conceptual models are useful to design database applications since theyfavor the communication between the stakeholders in a project. However,conceptual models must be translated into logical ones for their implemen-tation on a database management system. In this chapter, we study howthe conceptual multidimensional model studied in the previous chapter canbe represented in the relational model. We start in Sect. 5.1 by describingthe three logical models for data warehouses, namely, relational OLAP(ROLAP), multidimensional OLAP (MOLAP), and hybrid OLAP (HOLAP).In Sect. 5.2, we focus on the relational representation of data warehousesand study four typical implementations: the star, snowflake, starflake, andconstellation schemas. In Sect. 5.3, we present the rules for mapping aconceptual multidimensional model (in our case, the MultiDim model) to therelational model. Section 5.4 discusses how to represent the time dimension.Sections 5.5 and 5.6 study how hierarchies, facts with multiple granularities,and many-to-many dimensions can be implemented in the relational model.Section 5.7 is devoted to the study of slowly changing dimensions, whicharise when dimensions in a data warehouse are updated. In Sect. 5.8, westudy how a data cube can be represented in the relational model and how itcan be queried in SQL using the SQL/OLAP extension. Finally, to show howthese concepts are applied in practice, in Sects. 5.9 and 5.10, we show howthe Northwind cube can be implemented, respectively, in Microsoft AnalysisServices and in Mondrian.</p><p>5.1 Logical Modeling of Data Warehouses</p><p>There are several approaches for implementing a multidimensional model,depending on how the data cube is stored. These approaches are:</p><p>A. Vaisman and E. Zimanyi, Data Warehouse Systems, Data-CentricSystems and Applications, DOI 10.1007/978-3-642-54655-6 5, Springer-Verlag Berlin Heidelberg 2014</p><p>121</p></li><li><p>122 5 Logical Data Warehouse Design</p><p> Relational OLAP (ROLAP), which stores data in relational databasesand supports extensions to SQL and special access methods to efficientlyimplement the multidimensional data model and the related operations.</p><p> Multidimensional OLAP (MOLAP), which stores data in specializedmultidimensional data structures (e.g., arrays) and implements the OLAPoperations over those data structures.</p><p> Hybrid OLAP (HOLAP), which combines both approaches.</p><p>In ROLAP systems, multidimensional data are stored in relational tables.Further, in order to increase performance, aggregates are also precomputedin relational tables (we will study aggregate computation in Chap. 7). Theseaggregates, together with indexing structures, take a large space from thedatabase. Moreover, since multidimensional data reside in relational tables,OLAP operations must be performed on such tables, yielding usually complexSQL statements. Finally, in ROLAP systems, all data management relies onthe underlying relational DBMS. This has several advantages since relationaldatabases are well standardized and provide a large storage capacity.</p><p>In MOLAP systems, data cubes are stored in multidimensional arrays,combined with hashing and indexing techniques. Therefore, the OLAPoperations can be implemented efficiently, since such operations are verynatural and simple to perform. Data management in MOLAP is performed bythe multidimensional engine, which generally provides less storage capacitythan ROLAP systems. Normally, typical index structures (e.g., B-trees, orR-trees) are used to index sparse dimensions (e.g., a product or a storedimension), and dense dimensions (like the time dimension) are stored inlists of multidimensional arrays. Each leaf node of the index tree points tosuch arrays, providing efficient cube querying and storage, since the index ingeneral fits in main memory. Normally, MOLAP systems are used to querydata marts where the number of dimensions is relatively small (less than ten,as a popular rule of thumb). For high-dimensionality data, ROLAP systemsare used. Finally, MOLAP systems are proprietary, which reduces portability.</p><p>While MOLAP systems offer less storage capacity than ROLAP systems,they provide better performance when multidimensional data are queriedor aggregated. Thus, HOLAP systems benefit from the storage capacity ofROLAP and the processing capabilities of MOLAP. For example, a HOLAPserver may store large volumes of detailed data in a relational database, whileaggregations are kept in a separate MOLAP store.</p><p>Current OLAP tools support a combination of the above models. Neverthe-less, most of these tools rely on an underlying data warehouse implementedon a relational database management system. For this reason, in what follows,we study the relational OLAP implementation in detail.</p></li><li><p>5.2 Relational Data Warehouse Design 123</p><p>5.2 Relational Data Warehouse Design</p><p>One possible relational representation of the multidimensional model is basedon the star schema, where there is one central fact table, and a set ofdimension tables, one for each dimension. An example is given in Fig. 5.1,where the fact table is depicted in gray and the dimension tables are depictedin white. The fact table contains the foreign keys of the related dimensiontables, namely, ProductKey, StoreKey, PromotionKey, and TimeKey, and themeasures, namely, Amount and Quantity. As shown in the figure, referentialintegrity constraints are specified between the fact table and each of thedimension tables.</p><p>Time</p><p>TimeKeyDateEventWeekdayFlagWeekendFlagSeason...</p><p>Product</p><p>ProductKeyProductNumber ProductNameDescriptionSizeCategoryNameCategoryDescrDepartmentNameDepartmentDescr...</p><p>Sales</p><p>ProductKeyStoreKeyPromotionKeyTimeKeyAmountQuantity</p><p>Store</p><p>StoreKeyStoreNumberStoreNameStoreAddressManagerNameCityNameCityPopulationCityAreaStateNameStatePopulationStateAreaStateMajorActivity...Promotion</p><p>PromotionKeyPromotionDescrDiscountPercTypeStartDateEndDate...</p><p>Fig. 5.1 An example of a star schema</p><p>In a star schema, the dimension tables are, in general, not normalized.Therefore, they may contain redundant data, especially in the presenceof hierarchies. This is the case for dimension Product in Fig. 5.1 since allproducts belonging to the same category will have redundant information forthe attributes describing the category and the department. The same occursin dimension Store with the attributes describing the city and the state.</p><p>On the other hand, fact tables are usually normalized: their key is the unionof the foreign keys since this union functionally determines all the measures,while there is no functional dependency between the foreign key attributes.In Fig. 5.1, the fact table Sales is normalized, and its key is composed byProductKey, StoreKey, PromotionKey, and TimeKey.</p></li><li><p>124 5 Logical Data Warehouse Design</p><p>A snowflake schema avoids the redundancy of star schemas by normal-izing the dimension tables. Therefore, a dimension is represented by severaltables related by referential integrity constraints. In addition, as in thecase of star schemas, referential integrity constraints also relate the fact tableand the dimension tables at the finest level of detail.</p><p>Time</p><p>TimeKeyDateEventWeekdayFlagWeekendFlagSeason...</p><p>Category</p><p>CategoryKeyCategoryNameDescriptionDepartmentKey...</p><p>Department</p><p>DepartmentKeyDepartmentNameDescription...</p><p>ProductKeyProductNumber ProductNameDescriptionSizeCategoryKey...</p><p>Sales</p><p>ProductKeyStoreKeyPromotionKeyTimeKeyAmountQuantity</p><p>Store</p><p>StoreKeyStoreNumberStoreNameStoreAddressManagerNameCityKey...</p><p>City</p><p>CityKeyCityNameCityPopulationCityAreaStateKey...</p><p>State</p><p>StateKeyStateNameStatePopulationStateAreaStateMajorActivity...</p><p>Promotion</p><p>PromotionKeyPromotionDescrDiscountPercTypeStartDateEndDate...</p><p>Product</p><p>Fig. 5.2 An example of a snowflake schema</p><p>An example of a snowflake schema is given in Fig. 5.2. Here, the facttable is exactly the same as in Fig. 5.1. However, the dimensions Productand Store are now represented by normalized tables. For example, in theProduct dimension, the information about categories has been moved to thetable Category, and only the attribute CategoryKey remained in the originaltable. Thus, only the value of this key is repeated for each product of the samecategory, but the information about a category will only be stored once, intable Category. Normalized tables are easy to maintain and optimize storagespace. However, performance is affected since more joins need to be performedwhen executing queries that require hierarchies to be traversed. For example,the query Total sales by category for the star schema in Fig. 5.1 reads inSQL as follows:</p></li><li><p>5.2 Relational Data Warehouse Design 125</p><p>SELECT CategoryName, SUM(Amount)FROM Product P, Sales SWHERE P.ProductKey = S.ProductKeyGROUP BY CategoryName</p><p>while in the snowflake schema in Fig. 5.2, we need an extra join, as follows:</p><p>SELECT CategoryName, SUM(Amount)FROM Product P, Category C, Sales SWHERE P.ProductKey = S.ProductKey AND P.CategoryKey = C.CategoryKeyGROUP BY CategoryName</p><p>A starflake schema is a combination of the star and the snowflakeschemas, where some dimensions are normalized while others are not. Wewould have a starflake schema if we replace the tables Product, Category, andDepartment in Fig. 5.2, by the dimension table Product of Fig. 5.1, and leaveall other tables in Fig. 5.2 (like dimension table Store) unchanged.</p><p>Purchases</p><p>ProductKeySupplierKeyOrderTimeKeyDueTimeKeyAmountQuantityFreightCost</p><p>Time</p><p>TimeKeyDateEventWeekdayFlagWeekendFlagSeason...</p><p>Product</p><p>ProductKeyProductNumber ProductNameDescriptionSizeCategoryNameCategoryDescrDepartmentNameDepartmentDescr...</p><p>Sales</p><p>ProductKeyStoreKeyPromotionKeyTimeKeyAmountQuantity</p><p>Store</p><p>StoreKeyStoreNumberStoreNameStoreAddressManagerNameCityNameCityPopulationCityAreaStateNameStatePopulationStateAreaStateMajorActivity...</p><p>Promotion</p><p>PromotionKeyPromotionDescrDiscountPercTypeStartDateEndDate...</p><p>Supplier</p><p>SupplierKeySupplierNameContactPersonSupplierAddressCityNameStateName...</p><p>Fig. 5.3 An example of a constellation schema</p><p>Finally, a constellation schema has multiple fact tables that sharedimension tables. The example given in Fig. 5.3 has two fact tables Sales and</p></li><li><p>126 5 Logical Data Warehouse Design</p><p>Purchases sharing the Time and Product dimension. Constellation schemasmay include both normalized and denormalized dimension tables.</p><p>We will discuss further star and snowflake schemas when we study logicalrepresentation of hierarchies later in this chapter.</p><p>5.3 Relational Implementation of the ConceptualModel</p><p>In Chap. 2, we presented a set of rules that can be applied to translate anER model to the relational model. Analogously, we can define a set of rulesto translate the conceptual model we use in this book (the MultiDim model)into the relational model using either the star or snowflake schema. In thissection and the following one, we study such mapping rules.</p><p>Since the MultiDim model is based on the ER model, its mapping to therelational model is based on the rules described in Sect. 2.4.1, as follows:</p><p>Rule 1: A level L, provided it is not related to a fact with a one-to-onerelationship, is mapped to a table TL that contains all attributes of thelevel. A surrogate key may be added to the table; otherwise, the identifierof the level will be the key of the table. Note that additional attributes willbe added to this table when mapping relationships using Rule 3 below.</p><p>Rule 2: A fact F is mapped to a table TF that includes as attributes allmeasures of the fact. Further, a surrogate key may be added to the table.Note that additional attributes will be added to this table when mappingrelationships using Rule 3 below.</p><p>Rule 3: A relationship between either a fact F and a dimension level L, orbetween dimension levels LP and LC (standing for the parent and childlevels, respectively), can be mapped in three different ways, depending onits cardinalities:</p><p>Rule 3a: If the relationship is one-to-one, the table corresponding to thefact (TF ) or to the child level (TC) is extended with all the attributesof the dimension level or the parent level, respectively.</p><p>Rule 3b: If the relationship is one-to-many, the table corresponding tothe fact (TF ) or to the child level (TC) is extended with the surrogatekey of the table corresponding to the dimension level (TL) or the parentlevel (TP ), respectively, that is, there is a foreign key in the fact or childtable pointing to the other table.</p><p>Rule 3c: If the relationship is many-to-many, a new table TB (standingfor bridge table) is created that contains as attributes the surrogate keysof the tables corresponding to the fact (TF ) and the dimension level(TL), or the parent (TP ) and child levels (TC), respectively. The key ofthe table is the combination of both surrogate keys. If the relationship</p></li><li><p>5.3 Relational Implementation of the Conceptual Model 127</p><p>has a distributing attribute, an additional attribute is added to thetable to store this information.</p><p>In the above rules, surrogate keys are generated for each dimension levelin a data warehouse. The main reason for this is to provide independencefrom the keys of the underlying source systems because such keys can changeacross time. Another advantage of this solution is that surrogate keys areusually represented as integers in order to increase efficiency, whereas keysfrom source systems may be represented in less efficient data types such asstrings. Nevertheless, the keys coming from the source systems should alsobe kept in the dimensions to be able to match data from sources with datain the warehouse. Obviously, an alternative solution is to reuse the keys fromthe source systems in the data warehouse.</p><p>Notice that a fact table obtained by the mapping rules above will containthe surrogate key of each level related to the fact with a one-to-manyrelationship, one for each role that the level is playing. The key of the table iscomposed of the surrogate keys of all the participating levels. Alternatively,if a surrogate key is added to the fact table, the combination of the surrogatekeys of all the participating levels becomes an alternate key.</p><p>As we will see in Sect. 5.5, more specialized rules are needed for mappingthe various kinds of hierarchies that we studied in Chap. 4.</p><p>Applying the above rules to the Northwind conceptual data cube givenin Fig. 4.2 yields the tables shown in Fig. 5.4. The Sales table includes eightforeign keys, that is, one for each level related to the fact with a one-to-many relationship. Recall from Chap. 4 that in role-playing dimensions,a dimension plays several roles. This is the case for the dimension Timewhere, in the relational model, each role will be represented by a foreignkey. Thus, OrderDateKey, DueDateKey, and ShippedDateKey are foreign keysto the Time dimension table in Fig. 5.4. Note also that dimension Order isrelated to the fact with a one-to-one relationship. Therefore, the attributesof the dimension are included as part of the fact table. For this reason, sucha dimension is called a fact (or degenerate) dimension. The fact tablealso contains five attributes represen...</p></li></ul>