data warehousing dale-marie wilson, ph.d.. evolution of data warehousing since 1970s, organizations...
Post on 21-Dec-2015
215 views
TRANSCRIPT
Evolution of Data Warehousing
Since 1970s, organizations gained competitive advantage Automated business processes More efficient and cost-effective services to
customer Resulted in accumulation of growing
amounts of data in operational databases
Evolution of Data Warehousing
Increased focus on ways to use operational data to support decision-making Means of gaining competitive advantage
Operational systems not designed to support such business activities Typically numerous operational systems with overlapping and
contradictory definitions
Organizations need to turn archives of data into source of knowledge Goal: single integrated / consolidated view of organization’s data
presented to user
Solution: Data Warehouse Provides system capable of supporting decision-making, receiving
data from multiple operational data sources
Data Warehousing Concepts
A subject-oriented, integrated, time-variant, and non-volatile collection of data in support of management’s decision-making process (Inmon, 1993)
Subject-oriented Data
Warehouse organized around major subjects of the enterprise e.g. customers, products, and salesNot major application areas (e.g. customer
invoicing, stock control, and product sales)
Stores decision-support data not application-oriented data
Integrated Data
Integrates corporate application-oriented data from different source systemsIncludes inconsistent data
Integrated data source made consistent Presents unified view of data to users
Time-variant Data
Data accurate and valid at instance in time or over time interval
Time-variance shown in:Extended time data heldImplicit/explicit association of time with dataData represents series of snapshots
Non-volatile Data
Data not updated real-time
Refreshed from operational systems on regular basis
New data added as supplement not replacement
Data Webhouse
Web is source of behavioral dataClickstream – user’s path thru Website and
Web history
Data webhouse is a distributed data warehouse with no central data repository that is implemented over the Web to harness clickstream data
Benefits of Data Warehouse
Potential high returns on investment
Competitive advantage
Increased productivity of corporate decision-makers
Data Warehouse Queries
Queries Range from relatively simple to highly complex Dependent on end-user access tools used
End-user access tools: Reporting, query, and application development
tools Executive information systems (EIS) OLAP tools Data mining tools
Examples of Typical Data Warehouse Queries
What was the total revenue for Scotland in the third quarter of 2004? What was the total revenue for property sales for each type of property in
Great Britain in 2003? What are the three most popular areas in each city for the renting of
property in 2004 and how does this compare with the figures for the previous two years?
What is the monthly revenue for property sales at each branch office, compared with rolling 12-monthly prior figures?
What would be the effect on property sales in the different regions of Britain if legal costs went up by 3.5% and Government taxes went down by 1.5% for properties over £100,000?
Which type of property sells for prices above the average selling price for properties in the main cities of Great Britain and how does this correlate to demographic data?
What is the relationship between the total annual revenue generated by each branch office and the total number of sales staff assigned to each branch office?
Problems of Data Warehousing
Underestimation of resources for data loading
Hidden problems with source systems
Required data not captured
Increased end-user demands
Data homogenization
High demand for resources
Data ownership
High maintenance
Long duration projects
Complexity of integration
Operational Data Resources
Mainframe first generation hierarchical and network databases
Departmental propriety file systems (e.g. VSAM, RMS)
Relational DBMSs (e.g. Informix, Oracle) Private workstations and servers External systems
Internet Commercially available databases Databases associated with organization’s
suppliers or customers
Operational Data Store (ODS)
Repository of current and integrated operational data used for analysis
Structured and supplied with data like data warehouse May act as staging area for data to be moved into
warehouse Created when legacy operational systems incapable of
achieving reporting requirements Benefits:
Provides users with ease-of-use of relational database Distant from decision support functions of data warehouse
Load Manager
Performs operations associated with extraction and loading of data
Size and complexity varies between data warehouses
Constructed using combination of vendor data loading tools and custom-built programs
Warehouse Manager
Performs operations associated with management of data
Constructed using vendor data management tools and custom-built programs
Warehouse Manager
Performs operations associated with management of data Constructed using vendor data management tools and
custom-built programs Operations:
Data analysis to ensure consistency Transformation and merging of source data from temporary
storage Creation of indexes and views on base tables Generation of denormalizations, (if necessary) Generation of aggregations, (if necessary) Backing-up and archiving data
Warehouse Manager
Generates query profiles to determine which indexes and aggregations are appropriate
Query profile Can be generated for each user, group of users,
or the data warehouse Describes characteristics of queries
• Frequency• Target table(s)• Size of results set
Query Manager
Performs operations associated with management of user queries
Constructed using vendor end-user data access tools, data warehouse monitoring tools, database facilities, and custom-built programs
Complexity determined by facilities provided by end-user access tools and database
Operations: Directing queries to appropriate tables Scheduling execution of queries
Can generate query profiles Allows warehouse manager to determine appropriate indexes and
aggregations
Detailed Data
Detailed data stored in database schemaNot stored online Aggregated to next level of detail
Regularly added to warehouse to supplement aggregated data
Lightly and Highly Summarized Data
Stores pre-defined lightly and highly aggregated data generated by warehouse manager
Transient - changes to respond to changing query profiles
Purpose of summary information Improve query performance
Removes requirement to continually perform summary operations in answering user queries
Summary data updated continuously as new data loaded into warehouse
Archive/Backup Data
Stores detailed and summarized data for archiving and backup
Data transferred to storage archives - magnetic tape or optical disk
Metadata
Stores metadata (data about data) definitions used by all processes in warehouse
Used for: Extraction and loading processes
• Used to map data sources to common view of information within warehouse
Warehouse management process • Used to automate production of summary tables
Query management process • Used to direct query to most appropriate data source
Metadata
Metadata structure differs between processes Different purposes
Issues: Multiple copies of metadata describe same data item
Vendor tools and end-user data access use own versions of metadata
Copy management tools use metadata to understand mapping rules that are applied to convert source data into common form
End-user access tools use metadata to understand how to build a query
The management of metadata within data warehouse is very complex task that should not be underestimated
End-User Access Tools
Principal purpose of data warehousing: To provide information to business users for strategic decision-making
Users interact with warehouse using end-user access tools
Data warehouse must efficiently support ad hoc and routine analysis
High performance achieved by: Pre-planning requirements for joins Summations Periodic reports by end-users (where possible)
Main groups of access tools Data reporting and query tools Application development tools Executive information system (EIS) tools Online analytical processing (OLAP) tools Data mining tools
Data Warehouse Information Flows
Inflow - Processes associated with extraction, cleansing, and loading data from source systems
Upflow - Processes associated with adding value to data in warehouse through summarizing, packaging, and distribution
Downflow - Processes associated with archiving and backing-up/recovery of data
Outflow - Processes associated with making data available to end-users
Metaflow - Processes associated with management of metadata
Data Warehousing Tools and Technologies
Building data warehouse is complex taskNo vendor that provides an ‘end-to-end’
set of tools
Necessitates data warehouse built using multiple products from different vendors
Major challenge:Ensuring products work well together and
are fully integrated
Data Warehousing Tools and Technologies
Tasks of capturing data from source systems, cleansing and transforming it, and loading results into target system can be carried out either by separate products, or by a single integrated solution
Integrated solutions include Code Generators Database Data Replication Tools Dynamic Transformation Engines
Data Warehouse DBMS Requirements
Load performance Load processing Data quality management Query performance Terabyte scalability Mass user scalability Networked data warehouse Warehouse administration Integrated dimensional analysis Advanced query functionality
Administration and Management Tools
Monitoring data loading from multiple sources
Data quality and integrity checks Managing and updating metadata Monitoring database performance to
ensure efficient query response times and resource utilization
Auditing data warehouse usage to provide user chargeback information
Administration and Management Tools
Replicating, subsetting, and distributing data
Maintaining efficient data storage management
Purging data Archiving and backing-up data Implementing recovery following failure Security management
Data Mart
A subset of a data warehouse that supports the requirements of a particular department or business function
Characteristics:Focuses on requirements of one
department or business functionDoes not normally contain detailed
operational data unlike data warehousesMore easily understood and navigated
Reasons for Creating a Data Mart
Give users access to data they need to analyze most often
Provide data in form that matches collective view of data by group of users in a department or business function area
Improve end-user response time Reduction in volume of data to be accessed
Provide appropriately structured data as dictated by requirements of end-user access tools
Building data mart is simpler compared with establishing corporate data warehouse
Cost of implementing data marts less than that required to establish data warehouse
Potential users of data mart more clearly defined More easily targeted to obtain support for data mart project
Designing Data Warehouses
Initially, need answers for questions such as: Which user requirements are most important and
which data should be considered first? Which data should be considered first? Should the project be scaled down into
something more manageable? Should the infrastructure for a scaled down
project be capable of ultimately delivering a full-scale enterprise-wide data warehouse?
Designing Data Warehouses
Use of data marts avoids complexities associated with designing data
Difficult to commit to enterprise-wide design that must meet all user requirements
Interim solution => build data marts Goal: creation of data warehouse that
supports requirements of enterprise
Designing Data Warehouses
Requirements collection and analysis stage: Involves interviewing appropriate members of staff (such
as marketing users, finance users, and sales users) • Identify prioritized set of requirements data warehouse must
meet Interviews conducted with members of staff responsible for
operational systems• Identify, which data sources can provide clean, valid, and
consistent data that will remain supported over next few years Interviews provide necessary information for top-down view
(user requirements) and bottom-up view (available data sources)
Database component of data warehouse described using technique called dimensionality modeling
Dimensionality Modelling
Logical design technique that aims to present data in standard, intuitive form that allows for high-performance access
Uses Entity-Relationship modeling concepts with important restrictions: Every dimensional model (DM) composed of one table with
a composite primary key, called fact table, and set of smaller tables called dimension tables
Each dimension table has simple (non-composite) primary key that corresponds exactly to one component of composite key in fact table
Forms ‘star-like’ structure called star schema or star join
Dimensionality Modelling
Natural keys replaced with surrogate keysEvery join between fact and dimension
tables based on surrogate keys, not natural keys
Surrogate key – generalized structure based on integersAllows data in warehouse independence
from data used and produced by OLTP systems
Dimensionality Modelling
Star schema - logical structure Has fact table containing factual data in center Surrounded by dimension tables containing
reference data, which can be denormalized
Facts generated by events that occurred in the past,
Unlikely to change, regardless of how analyzed
Dimensionality Modelling
Fact tables:Where bulk of data in data warehouse Can be extremely large
Important to treat fact data as read-only reference data that will not change over time
Most useful fact tables contain one or more numerical measures, or ‘facts’ that occur for each record and are numeric and additive
Dimensionality Modelling
Dimension tables:Usually contain descriptive textual
informationDimension attributes used as constraints
in data warehouse queries
Star schemas speeds up query performance by denormalizing reference information into single dimension table
Dimensionality Modelling
Snowflake schema Variant of the star schema where dimension
tables do not contain denormalized data
Starflake schema Hybrid structure that contains mixture of star
(denormalized) and snowflake (normalized) schemas
Allows dimensions to be present in both forms to cater for different query requirements
Dimensionality Modelling Advantages of predictable, standard form
of underlying dimensional model:Efficiency Ability to handle changing requirements
• Star schema handles ad hoc user queries wellExtensibility
• Supports changes e.g. adding new dimension, facts
Ability to model common business situations
Predictable query processing
Comparison of DM and ER models
ER model Reduces data redundancy Beneficial to transaction processing
Single ER model normally decomposes into multiple DMs
Multiple DMs are associated through ‘shared’ dimension tables
Database Design Methodology for Data Warehouses
‘Nine-Step Methodology’: Choosing the process Choosing the grain Identifying and conforming the dimensions Choosing the facts Storing pre-calculations in the fact table Rounding out the dimension tables Choosing the duration of the database Tracking slowly changing dimensions Deciding the query priorities and the query
modes
Step 1: Choosing the process
The process (function) refers to subject matter of particular data mart
First data mart built should be:Most likely to be delivered on timeWithin budgetAnswers the most commercially important
business questions
Step 2: Choosing the grain
Decide what a record of fact table represents
Identify dimensions of fact table
Grain decision for fact table also determines grain of each dimension table
Include time as core dimension Always present in star schemas
Step 3: Identifying and Conforming dimensions
Dimensions set context for asking questions about the facts in fact table
If any dimension occurs in two data marts: Must be exactly same dimension Or one must be mathematical subset of other
Dimension used in more than one data mart referred to as being conformed
Step 4: Choosing the facts
Grain of fact table determines which facts can be used in data mart
Facts should be numeric and additive
Unusable facts include: non-numeric facts non-additive facts fact at different granularity from other facts in
table
Step 5: Storing pre-calculations in the fact table
Once facts selectedRe-examine to determine whether there
are opportunities to use pre-calculations
Step 6: Rounding out the dimension tables
Text descriptions are added to dimension tables
Text descriptions should be intuitive and understandable to users
Usefulness of data mart determined by scope and nature of attributes of dimension tables
Step 7: Choosing the duration of the database
Duration measures how far back in time fact table goes
Very large fact tables raises two very significant data warehouse design issues: Often difficult to source increasing old data Mandatory that old versions of important
dimensions be used, not the most current versions - aka ‘Slowly Changing Dimension’ problem
Step 8: Tracking slowly changing dimensions
Slowly changing dimension problem Proper description of old dimension data must be used with old fact data
Generalized key assigned to important dimensions Allows distinction multiple snapshots of dimensions over period of time
Three basic types of slowly changing dimensions: Type 1 - where changed dimension attribute overwritten Type 2 - where changed dimension attribute causes new dimension
record to be created Type 3 - where a changed dimension attribute causes alternate attribute to
be created• Both the old and new values of attribute simultaneously accessible in the same
dimension record
Step 9: Deciding the query priorities and the query modes
Most critical physical design issues affecting end-user’s perception includes: Physical sort order of fact table on disk Presence of pre-stored summaries or
aggregations
Additional physical design issues: Administration Backup Indexing performance Security
Database Design Methodology for Data Warehouses
Methodology designs data mart: Supports requirements of particular business
process Allows easy integration with other related data
marts to form enterprise-wide data warehouse
A dimensional model, which contains more than one fact table sharing one or more conformed dimension tables, Referred to as fact constellation