elsayed hemayed data mining course
DESCRIPTION
Outline Introduction Operational System (OLTP) Vs. Data Warehouse (OLAP) Data Warehouse vs. Data Marts Data Warehouse Architecture Data Warehouse Structure Data WarehouseTRANSCRIPT
Elsayed Hemayed Data Mining Course
Data Warehouse Elsayed HemayedData Mining Course Outline
Introduction Operational System (OLTP) Vs. Data Warehouse(OLAP)
Data Warehouse vs. Data Marts Data Warehouse Architecture Data
Warehouse Structure Data Warehouse Data, Data everywhere I cant
find the data I need
data is scattered over the network many versions, subtle
differences I cant get the data I need need an expert to get the
data I cant understand the data I found available data poorly
documented I cant use the data I found results are unexpected data
needs to be transformed from one form to other Data Warehouse What
is a Data Warehouse?
A single, complete and consistent store of data obtained from a
variety of different sources made available to end users in what
they can understand and use in a business context. [Barry Devlin]
Data Warehouse What are the users saying...
Data should be integratedacross the enterprise Summary data has a
realvalue to the organization Historical data holds the keyto
understanding data overtime What-if capabilities arerequired Data
Warehouse What is Data Warehousing?
A process of transformingdata into information andmaking it
available to usersin a timely enough manner tomake a difference
[Forrester Research, April 1996] Data Information Data Warehouse
Warehouses are Very Large Databases
Terabytes -- 10^12 bytes: Petabytes -- 10^15 bytes: Exabytes --
10^18 bytes: Zettabytes -- 10^21 bytes: Zottabytes -- 10^24 bytes:
Walmart Terabytes Geographic InformationSystems National Medical
Records Weather images Intelligence Agency Videos Data Warehouse
Data Warehousing -- It is a process
Technique for assembling andmanaging data from varioussources for
the purpose ofanswering business questions. Thusmaking decisions
that were notprevious possible A decision support
databasemaintained separately from theorganizations
operationaldatabase Data Warehouse Why Separate Data
Warehouse?
Performance Operational dbs designed & tuned for known
transactions & workloads. Complex OLAP queries would degrade
performance for operationtransactions. Special data organization,
access & implementation methods needed formultidimensional
views & queries. Function Missing data:Decision support
requires historical data, which operationdbs do not typically
maintain. Data consolidation: Decision support requires
consolidation(aggregation, summarization) of data from many
heterogeneous sources: operation dbs, external sources. Data
quality:Different sources typically use inconsistent
datarepresentations, codes, and formats which have to be
reconciled. Data Warehouse Key Definition OLTP: On Line Transaction
Processing
Describes processing at operational sites OLAP: On Line Analytical
Processing Describes processing at warehouse Business Intelligence
refers to reporting andanalysis of data stored in the warehouse
Data warehouse is the foundation for businessintelligence. Data
warehouse/business intelligence (DW/BI)refers to the complete
end-to-end system. Data Warehouse Explorers, Farmers and
Tourists
Tourists:Browse information harvested by farmers Farmers:Harvest
information from known access paths Explorers:Seek out the unknown
and previously unsuspected rewards hiding in the detailed data Data
Warehouse Data Mining works with Warehouse Data
Data Warehousing provides theEnterprise with a memory Data Mining
provides the Enterprise with intelligence Data Warehouse To
summarize ... Operational (OLTP)Systems are used to run abusiness
The Data Warehouse(OLAP) helps tooptimize the business Data
Warehouse Data Warehouse vs. Data Marts
What comes first Data Warehouse Data Mart Vs Data Warehouse
Data mart is a specific, subject-oriented repository ofdata that
was designed to answer specific questions Usually, multiple data
marts exist to serve the needs ofmultiple business units (sales,
marketing, operations,collections, accounting, etc.) Data warehouse
is a single organizational repositoryof enterprise wide data across
many or all subjectareas. Data warehouse is an enterprise wide
collection of datamarts Data Warehouse From the Data Warehouse to
Data Marts
Information Individually Structured Less More History Normalized
Detailed Departmentally Structured Data Warehouse Organizationally
Structured Data Warehouse Data Warehouse and Data Marts
OLAP Data Mart Lightly summarized Departmentally structured Sales
Mktg. Finance Organizationally structured Atomic Detailed Data
Warehouse Data Data Warehouse Characteristics of the Departmental
Data Mart
OLAP Small Flexible Customized by Department Source is
departmentallystructured data warehouse Sales Mktg. Finance Data
Warehouse Data Mart Centric Data Sources Data Marts Data Warehouse
Problems with Data Mart Centric Solution
If you end up creating multiple warehouses, integrating them is a
problem Data Warehouse True Warehouse Data Sources Data Warehouse
Data Marts Data Warehouse Data Warehouse Architecture Data
Warehouse Architecture
Relational Databases Legacy Data PurchasedData ERP Systems Analyze
Query Data WarehouseEngine Optimized Loader Extraction Cleansing
Metadata Repository Data Warehouse Implementing a Warehouse
Monitoring: Getting the data from the sources Data Integration
Cleansing Loading Processing: Query processing, indexing, ...
Managing: Metadata, Design, ... Data Warehouse Monitoring Source
Types: relational, flat file, IMS, WWW, news-wire,
Incremental vs. Refresh new Data Warehouse Monitoring
Techniques
Periodic snapshots Database triggers Log shipping Data shipping
(replication service) Transaction shipping Polling (queries to
source) Application level monitoring Data Warehouse Monitoring
Issues Frequency Data transformation Standards (e.g., ODBC)
periodic: daily, weekly, triggered: on big change, lots of changes,
... Data transformation convert data to uniform format remove &
add fields (e.g., add date to get history) Standards (e.g., ODBC)
Gateways Data Warehouse Refresh Propagate updates on source data to
the warehouse Issues:
when to refresh how to refresh -- refresh techniques Data Warehouse
When to Refresh? periodically (e.g., every night, every week) or
aftersignificant events on every update: not warranted unless
warehousedata requirecurrent data (up to the minute stockquotes)
refresh policy set by administrator based on userneeds and traffic
possibly different policies for different sources Data Warehouse
How To Detect Changes Create a snapshot log table to record ids of
updated rowsof source data and timestamp Detect changes by:
Defining after row triggers to update snapshot logwhen source table
changes Using regular transaction log to detect changes tosource
data Data Warehouse Data Integration Across Sources
Trust Savings Loans Credit card Same data different name Different
data Same name Data found here nowhere else Different keys same
data Data Warehouse Data Transformation Example
Data Warehouse appl A - m,f appl B - 1,0 appl C - x,y appl D -
male, female encoding appl A - pipeline - cm appl B - pipeline - in
appl C - pipeline - feet appl D - pipeline - yds unit appl A -
balance appl B - bal appl C - currbal appl D - balcurr field Data
Warehouse Data Integrity Problems
Same person, different spellings Ahmed, Ahmad, Ahmaad etc...
Multiple ways to denote company name Persistent Systems, PSPL,
Persistent Pvt. LTD. Use of different names Oct 6, 6 Oct Different
account numbers generated by differentapplications for the same
customer Required fields left blank Invalid product codes collected
at point of sale manual entry leads to mistakes in case of a
problem use Data Warehouse Data Extraction and Cleansing
Extract data from existing operational and legacydata Issues:
Sources of data for the warehouse Data quality at the sources
Merging different data sources Data Transformation How to propagate
updates (on the sources) to thewarehouse Terabytes of data to be
loaded Data Warehouse Scrubbing Data Scrubbing Tools Sophisticated
transformation tools.
Used for cleaning the quality of data Clean data is vital for the
success of the warehouse Example Ahmed Aly, Ahmad Ali, Ahmaad Aly,
Ahmad Aly, etc. are thesame person Scrubbing Tools Apertus --
Enterprise/Integrator Vality -- IPE Postal Soft Data Warehouse Data
Loading After extracting, cleaning, validating etc. need toload the
data into the warehouse Issues huge volumes of data to be loaded
small time window available when warehouse can be taken off
line(usually nights) when to build index and summary tables allow
system administrators to monitor, cancel, resume, change loadrates
Recover gracefully -- restart after failure from where you were
andwithout loss of data integrity Data Warehouse Load Techniques
Use SQL to append or insert new data
record at a time interface will lead to random disk I/Os Use batch
load utility Incremental versus Full loads Online versus Offline
loads Data Warehouse Data Warehouse Structure Data Warehouse
Structure
Subject Orientation -- customer, product, policy,account etc... A
subject may be implemented as aset of related tables. E.g.,
customer may be fivetables Data Warehouse Data Warehouse
Structure
base customer ( ) custid, from date, to date, name, phone, dob base
customer ( ) custid, from date, to date, name, credit rating,
employer customer activity ( ) -- monthly summary customer activity
detail ( ) custid, activity date, amount, clerk id, order no
customer activity detail ( ) custid, activity date, amount, line
item no, order no Time is part of key of each table Data Warehouse
Data Granularity in Warehouse
Summarized data stored reduce storage costs reduce cpu usage
increases performance since smaller number of recordsto be
processed design around traditional high level reporting needs
tradeoff with volume of data to be stored anddetailed usage of data
Data Warehouse Granularity in Warehouse
Can not answer some questions with summarizeddata Did Ahmed call
Aly last month? Not possible to answerif total duration of calls by
Ahmed over a monthis onlymaintained andindividual call details are
not. Detailed data too voluminous Data Warehouse Granularity in
Warehouse
Tradeoff is to have dual level of granularity Store summary data on
disks 95% of DSS processing done against this data Store detail on
tapes 5% of DSS processing against this data Data Warehouse
Vertical Partitioning
Acct. No Name Balance Date Opened Interest Rate Address Frequently
accessed Rarely accessed Acct. No Balance Acct. No Name Date Opened
Interest Rate Address Smaller table and so less I/O Data Warehouse
Schema Design Database organization Schema Types
must look like business must be recognizable by business user
approachable by business user Must be simple Schema Types Star
Schema Fact Constellation Schema Snowflake schema Data Warehouse
Dimensional Modeling Fact Table Dimension Table Dimension
Table
Data Warehouse Fact Tables Contain the metrics resulting from a
business process ormeasurement event, such as the sales ordering
process orservice call event Dimensional models should be
structured around businessprocesses and their associated data
sources, This results in ability to design identical, consistent
views of datafor all observers, regardless of which business unit
they belong to,which goes a long way toward eliminating
misunderstandings atbusiness meetings Fact tables granularity
should be set at the lowest, mostatomic level captured by the
business process This allows for maximum flexibility and
extensibility. Business users will be able to ask constantly
changing, free-ranging,and very precise questions. Data Warehouse
Fact Table Central table mostly raw numeric items
narrow rows, a few columns at most large number of rows (millions
to a billion) Access via dimensions Data Warehouse Dimension Tables
Contain the descriptive attributes and characteristicsassociated
with specific, tangible measurementevents, such as the customer,
product, or salesrepresentative associated with an order
beingplaced. Dimension attributes are used for
constraining,grouping, or labeling in a query. Hierarchical
many-to-one relationships aredenormalized into single dimension
tables. Data Warehouse Dimension Table Define business in terms
already familiar to users
Wide rows with lots of descriptive text Small tables (about a
million rows) Joined to fact table by a foreign key heavily indexed
typical dimensions time periods, geographic region (markets,
cities),products, customers, salesperson, etc. Data Warehouse Star
Schema A single fact table and multiple dimension tables m p T
r
date, custno, prodno, cityname,... f a c t c u s t c i t y Data
Warehouse Star Schema Example Data Warehouse Star Schema Example
Data Warehouse Snowflake schema The tables which describe the
dimensions arenormalized. Easy to maintain and saves storage p r o
d T i m e date, custno, prodno, cityname,... f a c t c u s t r e g
c i t y Data Warehouse Snowflake Schema Example
sType store city region Data Warehouse Fact Constellation Booking
Checkout
Multiple fact tables that share many dimensiontables Booking and
Checkout may share many dimensiontables in the hotel industry
Hotels Travel Agents Promotion Room Type Customer Booking Checkout
Data Warehouse Hybrid Approach If a dimension is very sparse (i.e.
most of thepossible values for the dimension have no data)and/or a
dimension has a very long list of attributeswhich may be used in a
query, the dimension tablemay occupy a significant proportion of
thedatabase and snowflaking may be appropriate In practice, many
data warehouses will normalizesome dimensions and not others, and
hence use acombination of snowflake and classic star schema. Data
Warehouse Partitioning Breaking data into severalphysical units
that can be handledseparately Not a question of whether to do itin
data warehouses but how to doit Granularity and partitioning arekey
to effective implementation ofa warehouse Data Warehouse Why
Partition? Flexibility in managing data
Smaller physical units allow easy restructuring free indexing
sequential scans if needed easy reorganization easy recovery easy
monitoring Data Warehouse Criterion for Partitioning
Typically partitioned by date line of business geography
organizational unit any combination of above Data Warehouse Query
Processing Indexing Parallel Query Processing
Pre computed views/aggregates SQL extensions Extended family of
aggregate functions rank (top 10 customers) percentile (top 30% of
customers) median, mode Reporting features running total,
cumulative totals Data Warehouse Metadata Repository Administrative
metadata
source databases and their contents gateway descriptions warehouse
schema, view & derived data definitions dimensions, hierarchies
pre-defined queries and reports data mart locations and contents
data partitions data extraction, cleansing, transformation rules,
defaults data refresh and purging rules user profiles, user groups
security: user authorization, access control Data Warehouse Metdata
Repository .. 2 Business data operational metadata
business terms and definitions ownership of data charging policies
operational metadata data lineage:history of migrated data and
sequenceof transformations applied currency of data:active,
archived, purged monitoring information:warehouse usage
statistics,error reports, audit trails. Data Warehouse Data
Warehouse References
W.H. Inmon, Building the Data Warehouse, SecondEdition, John Wiley
and Sons, 1996 W.H. Inmon, J. D. Welch, Katherine L.
Glassey,Managing the Data Warehouse, John Wiley andSons, 1997 Barry
Devlin, Data Warehouse from Architecture toImplementation, Addison
Wesley Longman, Inc 1997 Data Warehouse Summary Introduction
Operational System (OLTP) Vs. DataWarehouse (OLAP) Data Warehouse
vs. Data Marts Data Warehouse Architecture Data Warehouse Structure
Data Warehouse