unit 4. data warehousing data warehousing...
TRANSCRIPT
UNIT 4. DATA WAREHOUSING
Data Warehousing Components -Multi Dimensional Data Model- Data Warehouse Architecture-Data Warehouse Implementation- -Mapping the Data Warehouse to Multiprocessor Architecture- OLAP.-Need- Categorization of OLAP Tools.
4.1 Data Warehousing Components
4.1.1 Definition:” A data warehouse is a subject- oriented, integrated, time-Variant and non-volatile collection of data in support of management decision making process.”
•Subject-Oriented: Stored data targets specific subjects.
Example:It may store data regarding total Sales, Number of Customers, etc. and not general data on everyday operations.
•Integrated:Data may be distributed across heterogeneous sources which haveto be integrated.
Example: Sales data may be on RDB, Customer information on Flatfiles, etc.
•Time Variant:Data stored may not be current but varies with time and data have an element of time.
Example: Data of sales in last 5 years, etc.
•Non-Volatile:It is separate from the Enterprise Operational Database and hence is not subject to frequent modification. It generally has only 2operations performed on it: Loading of dataand Access of data.
4.1.2 Features of a Warehouse:
It is separate from Operational Database.
•Integrates data from heterogeneous systems.
•Stores HUGE amount of data, more historical than current data.
•Does not require data to be highly accurate.
•Queries are generally complex.
•Goal is to execute statistical queries and provide results whichcan influence decision making in favor of the Enterprise.
•These systems are thus called Online Analytical Processing Systems (OLAP).
4.1.3 Three Main reasons for having a separate data base
1. OLTP systems require high concurrency, reliability, locking which provide good performance for short and simple OLTP queries. An OLAP query is very complex and does not require these properties. Use of OLAP query on OLTP system degrades its performance.
2. An OLAP query reads HUGE amount of data and generates the required result. The query is very complex too. Thus special primitives have to provide to support this kind of data access.
3. OLAP systems access historical data and not current volatile data while OLTP systems access current up-to-date data and do not need historical data.
4.1.4 Difference between OLTP and OLAP
Feature OLTP OLAPusers clerk, IT professional knowledge workerfunction day to day operations decision supportDB design application-oriented subject-orienteddata current, up-to-date
detailed, flat relationalisolated
historical, summarized, multidimensionalintegrated, consolidated
usage repetitive ad-hocaccess read/write
index/hash on prim. keylots of scans
unit of work short, simple transaction complex query# records accessed tens millions#users thousands hundredsDB size 100MB-GB 100GB-TBmetric transaction throughput query throughput, response
4.1.5 How do we represent this data???
This multi-dimensional data can be represented using a data cube as shown below.
Or
This figure shows a 3-Dimensional dataModel.X –Dimension : Item typeY –Dimension : Time/PeriodZ –Dimension : Location
Each cell represents the items sold of type ‘x’, in location ‘z’during the quarter‘y’. •Data cube is thus a n -dimensional data model
4.1.6 Schemas used for Multidimensional model
Star schema
There is a central large Fact table with no redundancy
•Each tuple in the fact table has a foreign key to a dimension table which describes the details of that dimension
Problem: Redundancy•Values of city, province_or_state and country would be repeated for two streets in the same city.Thus we can normalize the table by splitting location into sub tables. (Snowflake Schema)
Advantage: PerformanceAs less number of joins required
Snowflake schemaSome of the dimension tables are normalized thus splitting data into additional tables.
Problem: Performance•Too many joins required to form the result.
Thus Snowflake schema is not as popular as the Star schema.
Fact Constellation schema
Two or more fact tables share dimension tables. In the figure below the ‘Sales’ fact table and ‘Shipping’ fact table Share the dimension tables As
As the multiple fact tables are linked to each other by dimension tables, its called as fact Constellation schema
Data Mining Query Language (DMQL)
Syntax:
define cube <cube name>[<dimension list>]:<measure list>
define dimension <dimension name> as (<attribute list>)
Star Schema Example:
Fact Table:define cube sales_star [time,item,branch,location]:dollars_sold=sum(sales_in_dollars),units_sold=count(*)Dimensions:define dimension time as (time_key,day,day_of_week,month,quarter,year)
define dimension item as (item_key,item_name,brand,type)
define dimension branch as (branch_key,branch_name,branch_type)
define dimension location as (location_key,street,city,province,country)
Defining a hierarchy of dimension tables for snowflake schema.
define dimension location as (location_key,street,city(city_key,city,province,country))
Defining a shared dimension table for Fact Constellation Schema
define dimension time as (time_key,day,day_of_week,month,quarter,year)define dimension time as time in cube sales
4.2 Data warehousing architecture
Approaches for data warehousing
1. Top-Down Approach�Starts with overall design and planning�Useful where the technology is mature and well known�Useful where the business problems to be solved are clear and well understood
2. Bottom-Up Approach�Starts with experiment and prototypes�Allows an organization to move forward at considerably less expense�Allows to evaluate the benefits of the technology before making significant commitments
3. Combined Approach�Exploits the planned and strategic nature of the top-down approach�Retains the rapid implementation and opportunistic application of the bottom-up approach
4.2.1 Four views of Data warehouse
Top-down, bottom-up approaches or a combination of both Top-down : Starts with overall design and planning (mature) Bottom-up : Starts with experiments and prototypes (rapid)
From software engineering point of view Waterfal l: structured and systematic analysis at each step before
proceeding to the next Spiral : rapid generation of increasingly functional systems, short turn
around time, quick turn around Typical data warehouse design process
Choose a business process to model, e.g., orders, invoices, etc. Choose the grain (atomic level of data) of the business process Choose the dimensions that will apply to each fact table record
Choose the measure that will populate each fact table record
4.2.2 Three tier architecture
Data Mart
Contains a subset of corporate-wide data that is of value to a specific group of users. Scope is confined to specific selected subjects
�Implemented on low-cost departmental servers that are UNIX or Windows NT based. Implementation cycle is measured in weeks rather than months or years
�They are further classified as:
Independent•Sourced from data captured from one or more operational systems or external information providers•Sourced from data generated locally within a particular department or geographic area
Dependent•Sourced directly from enterprise data warehouses
Virtual Warehouse
�It is a Set of Views over operational databases
�For efficient query processing; only some of the possible summary views may be materialized
�It is easy to build but requires excess capacity on operational database servers
August 17, 2010 Data Mining: Concepts and Techniques 16
Data Warehouse Development: A Recommended Approach
Define a high -level corporate data model
Data Mart
Data Mart
Distributed Data Marts
Multi -Tier Data Warehouse
Enterprise Data Warehouse
Model refinementModel refinement
4.2.3 Data Warehouse Back-End Tools and Utilities
Data extractionget data from multiple, heterogeneous, and external sources
Data cleaningdetect errors in the data and rectify them when possible
Data transformationconvert data from legacy or host format to warehouse format
Loadsort, summarize, consolidate, compute views, check integrity, and build indicies and partitions
Refreshpropagate the updates from the data sources to the warehouse
4.2.4 Metadata Repository
Meta data is the data defining warehouse objects. It stores: Description of the structure of the data warehouse
schema, view, dimensions, hierarchies, derived data defn, data mart locations and contents
Operational meta-datadata lineage (history of migrated data and transformation path), currency of data (active, archived, or purged), monitoring information (warehouse usage statistics, error reports, audit trails)
The algorithms used for summarization The mapping from operational environment to the data warehouse Data related to system performance
Warehouse schema, view and derived data definitions Business data
Business terms and definitions, ownership of data, charging policies
4.2.5 OLAP Server Architectures
Relational OLAP (ROLAP) Use relational or extended-relational DBMS to store and manage
warehouse data and OLAP middle ware Include optimization of DBMS backend, implementation of aggregation
navigation logic, and additional tools and services Greater scalability
Multidimensional OLAP (MOLAP) Sparse array-based multidimensional storage engine Fast indexing to pre-computed summarized data
Hybrid OLAP (HOLAP ) (e.g., Microsoft SQLServer) Flexibility, e.g., low level: relational, high-level: array
Specialized SQL servers (e.g., Redbricks) Specialized support for SQL queries over star/snowflake schemas
4.2.6 OLAP operations
1. Roll-up
Performs aggregation on a data cube, either by climbing up a concept hierarchy for a dimension or by dimension reduction2. Drill-downCan be realized by either stepping down a concept hierarchy for a dimension or introducing additional dimensions3.Slice and DiceSlice performs a selection on one dimension of the given cube, resulting in a sub cubeDice defines a subcube by performing a selection on two or more dimensions
4.Pivot (rotate)
It’s a visualization operation that rotates the data axes in view in order to provide an alternative presentation of the data
4.2.7 Concept hierarchies
It is a sequence of mappings from a set of low-level concepts to higher-level, more general concepts
Review
Key terms
OLAP Rollup Concept hierarchies
all
time item location supplier
time,location
time,supplier
item,location
item,supplier
location,supplier
time,item,supplier
time,location,supplier
item,location,supplier
0-D(apex) cuboid
1-D cuboids
2-D cuboids
3-D cuboids
4-D(base) cuboid
Data Warehouse
Meta Data
MDDB
OLAMEngine
OLAPEngine
User GUI API
Data Cube API
Database API
Data cleaning
Data integration
Layer3OLAP/OLAM
Layer2MDDB
Layer1Data Repository
Layer4User Interface
Filtering&Integration Filtering
Databases
Mining query Mining result
ROLAP Drill down Metadata Repository Data cube pivot Data martcuboid slice Dice
Multiple choice questions
1) _____ routines attempt to fill in missing values (a)Data cleaning (b)Data transformation (c)data integration (d)None of the above
2) Which method is the best for filling missing values? (a)Ignore the tuple (b) Manual filling (c) by global constant (d) use attribute mean
3) What are the methods for handling noisy data? (a)Binning (b)Regression (c)clustering (d)All of the above
4) Rules for examining the data are (a)unique rule (b)consequetive rule (c)null rule (d)All of the above
5) commercial tools for discrepancy detection (a)data scrubbing tools (b)data mining tools (c)data auditing tools (d)All of the
6) Redundancy can be detected by (a)correlation analysis (b)Entity identification problem (c)min-max normalization (d)All of the above
7) Attribute data are scaled to fall within a specified range 0.0. to 1.0 is called as (a)Integration (b)Normalization (c)generalization (d)None of the above
8) Concept hierarchy is also called as (a)Generalization (b)Normalization (c)Integration (d)None of the above
9) smoothing can be performed by (a)Binning (b)clusering (c)regression (d)All of the above
10) Suppose minimum and maximum values are given,we can use ____ normalization (a)Min - max (b)Z- score (c)decimal (d)All of the above 11) Suppose mean and standard deviation of the values are given, we can use __
normalization (a)Min - max (b)Z- score (c)Decimal scaling (d)All of the above
12) suppose recorded values given in a range, we can use ___ normalization. (a)Min - max (b)Z-score (c)Decimal scaling (d)All of the above
13) Attribute construction is also known as (a)Feature construction (b) Feature selection (c) attribute selection (d) none of the above
14) In _______ irrelevant attributes can be detected and removed. (a) Attribute selection (b) Data transformation (c) data integration (d) data cleaning
15) _________ methods can also be applied for data reduction. (a)Data cleaning (b) Data transformation (c) data smoothing (d) All of the above
Review questions
PART – A
1. Define data warehouse2. Define the characteristics of data warehouse3. Define a Data mart?4. Define enterprise data warehouse5. What are the schemas used for multidimensional model?6. What is the disadvantage of star schema?7. What is the disadvantage of snowflake schema?8. What is data warehouse performance issue?9. What is Data Inconsistency Cleaning?10.Merits of Data warehouse.11. What are the characteristics of metadata repository?12.List some of the data warehouse tools.13.Define concept hierarchy.14.Explain OLAP.15.Explain ROLAP.16.Explain MOLAP.17.Explain HOLAP.18.Define specialized SQL server.
PART – B
1. Explain about data warehousing architecture.2. Discuss in detail. OLAP operations.3. Describe about concept hierarchies.4. What are Schemas used for Multidimensional model?5. Discuss about data warehouse implementation.
REFERENCES
1. Jiawei Han, Micheline Kamber, "Data Mining: Concepts and Techniques", MorganKaufmann Publishers, 2002.2. Alex Berson,Stephen J. Smith, “Data Warehousing, Data Mining,& OLAP”, TataMcGraw- Hill, 2004.3. Usama M.Fayyad, Gregory Piatetsky - Shapiro, Padhrai Smyth And RamasamyUthurusamy, "Advances In Knowledge Discovery And Data Mining", The M.I.TPress, 1996.4. Ralph Kimball, "The Data Warehouse Life Cycle Toolkit", John Wiley & Sons Inc.,1998.5. Sean Kelly, "Data Warehousing In Action", John Wiley & Sons Inc., 1997.
For Further references
http://www.cs.wmich.edu/~yang/teach/cs595/han/ch3.pdf
http://www.slideshare.net/idnats/data-warehousing-and-data-mining-presentation-725476
http://onlineassociate.net/ppt/Data-Warehouse-and-OLAP-Technology-Slide-1/