data warhouse introduction
TRANSCRIPT
8/10/2019 data warhouse introduction
http://slidepdf.com/reader/full/data-warhouse-introduction 1/93
1Copyright © 2013 Tech Mahindra. All rights reserved.
Sreenivas_Ram+91 99 499 10792
DWH Concepts
8/10/2019 data warhouse introduction
http://slidepdf.com/reader/full/data-warhouse-introduction 2/93
2Copyright © 2013 Tech Mahindra. All rights reserved.
Module Breakup
The whole course is covered in the following modules
Module 1 : DW – Overview & Data warehouse Vs OLTP
Module 2 : Architecture of Data warehouse
Module 3 : ETL Process
Module 4 : Data warehouse Vs Data Mart & Conceptual DW Models
8/10/2019 data warhouse introduction
http://slidepdf.com/reader/full/data-warhouse-introduction 3/93
3Copyright © 2013 Tech Mahindra. All rights reserved.
Module - 1
Data Warehouse
Concepts
8/10/2019 data warhouse introduction
http://slidepdf.com/reader/full/data-warhouse-introduction 4/93
8/10/2019 data warhouse introduction
http://slidepdf.com/reader/full/data-warhouse-introduction 5/93
5Copyright © 2013 Tech Mahindra. All rights reserved.
What is BI?
Business intelligence (BI) is a broad category of application programs andtechnologies for gathering, storing, analyzing, and providing access to data to
help enterprise users make better business decisions.
• BI applications include the activities of
• decision support,
• query and reporting,• online analytical processing (OLAP),
• statistical analysis,
• forecasting, and
• data mining.
Examples : Business Objects : www.businessobjects.com
8/10/2019 data warhouse introduction
http://slidepdf.com/reader/full/data-warhouse-introduction 6/93
6Copyright © 2013 Tech Mahindra. All rights reserved.
BI- Nutshell
Raw Data
8/10/2019 data warhouse introduction
http://slidepdf.com/reader/full/data-warhouse-introduction 7/937Copyright © 2013 Tech Mahindra. All rights reserved.
Which are our
lowest/highest margincustomers ?
Who are my customersand what productsare they buying?
Which customersare most likely to goto the competition ?
What impact willnew products/serviceshave on revenueand margins?
What productpromotionshave the biggestimpact on revenue?
What is the mosteffective distributionchannel?
A producer wants to know….
8/10/2019 data warhouse introduction
http://slidepdf.com/reader/full/data-warhouse-introduction 8/938Copyright © 2013 Tech Mahindra. All rights reserved.
Data, Data everywhere yet …
•I can’t find the data I need
data is scattered over the network
many versions, subtle differences
• I can’t get the data I need
need an expert to get the data
• I can’t understand the data I found
available data poorly documented
• I can’t use the data I found
results are unexpected
data needs to be transformed fromone form to other
8/10/2019 data warhouse introduction
http://slidepdf.com/reader/full/data-warhouse-introduction 9/939Copyright © 2013 Tech Mahindra. All rights reserved.
What is a Data Warehouse?
“A single, complete and
consistent store of data
obtained from a variety of
different sources
made available to end users
in a what they can understand
and use in a
business context.”
[Barry Devlin]
8/10/2019 data warhouse introduction
http://slidepdf.com/reader/full/data-warhouse-introduction 10/9310Copyright © 2013 Tech Mahindra. All rights reserved.
What is Data Warehousing?
The aim of data warehousing is to make more effective use of the dataavailable in an organization and to aid decision-making processes.
A data warehouse is a collection of data gathered and organized so that it can
easily by analyzed, extracted, synthesized, and otherwise be used for the
purposes of further understanding the data. It may be contrasted with data that
is gathered to meet immediate business objectives such as order and paymenttransactions, although this data would also usually become part of a data
warehouse.
A data warehouse is a central integrated database containing data from all the
operational sources and archive systems in an organization. It contains a copy
of transaction data specifically structured for query analysis.
8/10/2019 data warhouse introduction
http://slidepdf.com/reader/full/data-warhouse-introduction 11/9311Copyright © 2013 Tech Mahindra. All rights reserved.
What are the users saying...
• Data should be integrated across
the enterprise
• Summary data has a real value tothe organization
• Historical data holds the key to
understanding data over time
• What-if capabilities are required
8/10/2019 data warhouse introduction
http://slidepdf.com/reader/full/data-warhouse-introduction 12/9312Copyright © 2013 Tech Mahindra. All rights reserved.
What is Data Warehousing?
A process of
transforming data
into information and
making it available to
users in a timely
enough manner tomake a difference
Data
Information
8/10/2019 data warhouse introduction
http://slidepdf.com/reader/full/data-warhouse-introduction 13/9313Copyright © 2013 Tech Mahindra. All rights reserved.
Evolution 60’s: Batch reports
hard to find and analyze information inflexible and expensive, reprogram every new request
70’s: Terminal-based DSS and EIS (executive information systems)
still inflexible, not integrated with desktop tools
80’s: Desktop data access and analysis tools query tools, spreadsheets, GUIs
easier to use, but only access operational databases
90’s till now: Data warehousing with integrated OLAP engines and tools, real
time DW
8/10/2019 data warhouse introduction
http://slidepdf.com/reader/full/data-warhouse-introduction 14/9314Copyright © 2013 Tech Mahindra. All rights reserved.
Data Warehouse
A data warehouse is a
subject-oriented
integrated
time-varying
non-volatile
accessible
collection of data that is used primarily in organizational decision making.
-- Bill Inmons, Building the Data Warehouse 1996
8/10/2019 data warhouse introduction
http://slidepdf.com/reader/full/data-warhouse-introduction 15/9315Copyright © 2013 Tech Mahindra. All rights reserved.
Data Warehouse Architecture
Data WarehouseEngine
Optimized Loader
ExtractionCleansing
AnalyzeQuery
Metadata Repository
RelationalDatabases
LegacyData
PurchasedData
ERPSystems
8/10/2019 data warhouse introduction
http://slidepdf.com/reader/full/data-warhouse-introduction 16/9316Copyright © 2013 Tech Mahindra. All rights reserved.
Data Mining works with Warehouse Data
Data Warehousing provides the Enterprise with a Memory
• Data Mining provides the Enterprise withIntelligence
8/10/2019 data warhouse introduction
http://slidepdf.com/reader/full/data-warhouse-introduction 17/93
8/10/2019 data warhouse introduction
http://slidepdf.com/reader/full/data-warhouse-introduction 18/93
18Copyright © 2013 Tech Mahindra. All rights reserved.
Why Separate Data Warehouse? Performance
Operational database designed & tuned for known transactions & workloads.
Complex OLAP queries would degrade performance. for op transactions.
Special data organization, access & implementation methods needed for
multidimensional views & queries.
Function
Missing data: Decision support requires historical data, which Operational
database do not typically maintain.
Data consolidation: Decision support requires consolidation (aggregation,
summarization) of data from many heterogeneous sources: operationaldatabases, external sources.
Data quality: Different sources typically use inconsistent data representations,
codes, and formats which have to be reconciled.
8/10/2019 data warhouse introduction
http://slidepdf.com/reader/full/data-warhouse-introduction 19/93
8/10/2019 data warhouse introduction
http://slidepdf.com/reader/full/data-warhouse-introduction 20/93
20Copyright © 2013 Tech Mahindra. All rights reserved.
So, what’s different?
8/10/2019 data warhouse introduction
http://slidepdf.com/reader/full/data-warhouse-introduction 21/93
21Copyright © 2013 Tech Mahindra. All rights reserved.
Application-Orientation vs Subject-
Orientation Application-Orientation
OperationalDatabase
LoansCreditCard
Trust
Savings
Subject-Orientation
DataWarehouse
Customer
Vendor
Product
Activity
8/10/2019 data warhouse introduction
http://slidepdf.com/reader/full/data-warhouse-introduction 22/93
22Copyright © 2013 Tech Mahindra. All rights reserved.
OLTP vs Data WarehouseOLTP WAREHOUSE (DSS)
Application Oriented Subject Oriented
Used to run business Used to Analyze business
Detailed data Summarized and Refined
Current up-to-date Snapshot data
Isolated data Integrated Data
Repetitive access Ad-hoc access
Clerical User Knowledge User (Manager
8/10/2019 data warhouse introduction
http://slidepdf.com/reader/full/data-warhouse-introduction 23/93
23Copyright © 2013 Tech Mahindra. All rights reserved.
OLTP Vs Data Warehouse
OLTP DATA WAREHOUSE
Performance Sensitive Performance Relaxed
Few records accessed at a time
(Tens)
Large volumes accessed at a time
(Millions)
Read / Update Access Mostly Read (Batch Update)
No data redundancy Redundancy present
DB size : 100 MB – 100 GB DB Size : 100 GB – Few TBs
Thousands of Users Hundreds of Users
8/10/2019 data warhouse introduction
http://slidepdf.com/reader/full/data-warhouse-introduction 24/93
24Copyright © 2013 Tech Mahindra. All rights reserved.
To summarize ...
OLTP Systems areused to “run” a business
The Data Warehouse helps
to “optimize” the business
8/10/2019 data warhouse introduction
http://slidepdf.com/reader/full/data-warhouse-introduction 25/93
25Copyright © 2013 Tech Mahindra. All rights reserved.
A single, completeand consistent store
of data obtained fromvarious sources
What is BI ?
What is Data Warehouse?
Architecture of Data Warehouse
How Data Mining works with Data warehouse
Benefits Of Data Warehouse
Differences between Data Warehouse and OLTP
Need For a Separate Warehouse
Quick Recap
Reliable Reporting
Rapid Access To Data
Integrated Data
Better Decision Making
BI incorporates the abilityfor mining the data,analyzing and reporting
Data mining providesthe enterprisewarehouse withintelligence
Used to Analyze the
Business
Used to Run the
Business
8/10/2019 data warhouse introduction
http://slidepdf.com/reader/full/data-warhouse-introduction 26/93
26Copyright © 2013 Tech Mahindra. All rights reserved.
QUIZ
1. _____is a subject-oriented view of a data warehouseOLTP system / Data Staging Area / Data Mart / None
2. Data Mining implies _____
Modeling / Forecasting / Explanatory Analysis
3. An order entry system is an example of OLTP systemTrue / False
4. The number of concurrent users of a data warehouse
are not more
False / True
5. Data Extraction is the Process of _____________ A. Taking information/ data from the source and make it available to
DWH
B. Taking extracted data and loaded into DWH
C. Both
Data Mart
Forecasting
True
True
Both
8/10/2019 data warhouse introduction
http://slidepdf.com/reader/full/data-warhouse-introduction 27/93
27Copyright © 2013 Tech Mahindra. All rights reserved.
Module – 2
Data Warehouse
Architecture
8/10/2019 data warhouse introduction
http://slidepdf.com/reader/full/data-warhouse-introduction 28/93
28Copyright © 2013 Tech Mahindra. All rights reserved.
Architecture, Design & Construction
DW Architecture
Loading, Refreshing
Structuring / Modeling
8/10/2019 data warhouse introduction
http://slidepdf.com/reader/full/data-warhouse-introduction 29/93
29Copyright © 2013 Tech Mahindra. All rights reserved.
Topics to be coveredThis Module provides
Data Warehousing Architecture
Generic Two-Level Architecture
– Independent Data Mart
–Dependent Data Mart with Data Store
ETL Process
Data Quality Assurance
Data Quality Tools
ETVL Tools
Meta Data & Importance
29
8/10/2019 data warhouse introduction
http://slidepdf.com/reader/full/data-warhouse-introduction 30/93
30Copyright © 2013 Tech Mahindra. All rights reserved.
Operational Systems
Information Transformation/Migration Infrastructure
External Systems
Enterprise
Data Warehouse
Finance
DatamartIndependent
Sales
DatamartDependent
Mkting
DatamartDependent
Web
Server
Light
Clients
Replication Services
LAN Clients
Data Warehouse Architecture
8/10/2019 data warhouse introduction
http://slidepdf.com/reader/full/data-warhouse-introduction 31/93
31Copyright © 2013 Tech Mahindra. All rights reserved.
DataStores
Legacy
System
Metadata
Repository Staging Area
Extraction/
Transformation
Server
To Warehouse/
Data Mart
• Metadata Design/Mgmt
• Scrubbing Tool
• Mapping Tool
• Extraction Mgmt Tool
• Transformation Tool
• Migration Mgmt Tool
Data Warehouse Architecture
8/10/2019 data warhouse introduction
http://slidepdf.com/reader/full/data-warhouse-introduction 32/93
32Copyright © 2013 Tech Mahindra. All rights reserved.
Data Warehouse Architecture
Generic Two-Level Architecture
Independent Data Mart
Dependent Data Mart and Operational Data Store
All involve some form of extract ion , t ransformat ion and loading (ETL)
8/10/2019 data warhouse introduction
http://slidepdf.com/reader/full/data-warhouse-introduction 33/93
33Copyright © 2013 Tech Mahindra. All rights reserved.
E
T
L
One,company-widewarehouse
Periodic extraction data is not completely current in warehouse
Generic Two-Level Architecture
8/10/2019 data warhouse introduction
http://slidepdf.com/reader/full/data-warhouse-introduction 34/93
34Copyright © 2013 Tech Mahindra. All rights reserved.
Independent Data Mart
Data marts:Mini-warehouses, limited in scope
E
T
L
Separate ETL for eachindependent data mart
Data access complexitydue to mult ip le data marts
Independent Data Mart Architecture
34
Dependent data mart with operational
8/10/2019 data warhouse introduction
http://slidepdf.com/reader/full/data-warhouse-introduction 35/93
35Copyright © 2013 Tech Mahindra. All rights reserved.
Dependent data mart with operationaldata store
ET
L
Single ETL forEnterpris e Data Warehou se(EDW)
Simpler data access
ODS provides option forobtaining current data
Dependent data martsloaded from EDW
Dependent Data Mart with Operational
Data Store
35
8/10/2019 data warhouse introduction
http://slidepdf.com/reader/full/data-warhouse-introduction 36/93
36Copyright © 2013 Tech Mahindra. All rights reserved.
Data and dimensions(tables) are being
shared acrossmultiple data marts
• Generic Two-Level Architecture
Independent Data Mart
Dependent Data Mart
Quick RecapThe Data in thisdata mart is stored
Independent fromother data marts
8/10/2019 data warhouse introduction
http://slidepdf.com/reader/full/data-warhouse-introduction 37/93
37Copyright © 2013 Tech Mahindra. All rights reserved.
QUIZ1. Which of the following statements is FALSE w.r.t top-down
approach?
a. The Data warehouse holds the atomic data extracted from sourcesystems and from there the data is distributed or one of the more data
marts.
b . It takes less time and cost to deploy than other approaches.
c. It enforces consistency and standardization of data across all data
marts.
d . None of the above
2. Main objective's of data warehouse design is/area. Efficient query processing
b. Efficient transaction processing
c. None
3. In independent Data Mart the data and dimensions (tables) are
being shared across multiple data marts.
True / False
4. ODS provides the provision for current data.
True/False
5. The data access complexity is more in Dependent Data Marts
True/False
B
None
False
True
True
8/10/2019 data warhouse introduction
http://slidepdf.com/reader/full/data-warhouse-introduction 38/93
38Copyright © 2013 Tech Mahindra. All rights reserved.
Module – 3
ETL Process
8/10/2019 data warhouse introduction
http://slidepdf.com/reader/full/data-warhouse-introduction 39/93
39Copyright © 2013 Tech Mahindra. All rights reserved.
Building a Data warehouse Topics to be covered:
Extracting
Extracting Data
Extraction Techniques
Extraction Tools
Transforming Importance of Quality Data
Characteristics of Quality Data
Data Quality Assurance
Data Quality Tools
Transforming Data Problems and Solutions
Transforming Techniques Transformation tools
Transporting Data (Loading)
39
8/10/2019 data warhouse introduction
http://slidepdf.com/reader/full/data-warhouse-introduction 40/93
40Copyright © 2013 Tech Mahindra. All rights reserved.
Extracting Data
• Extraction is the Process of getting data from Legacy System or any DataSource
• After extracting, data is put in staging area where it can be scrubbed and
cleaned
• The source of data may be from a single source or from a multiple source
• If the data is from single source it can come from OLTP system or from a flat file
• If the source is from multiple sources then a connector tool is required to
connect between multiple sources
8/10/2019 data warhouse introduction
http://slidepdf.com/reader/full/data-warhouse-introduction 41/93
41Copyright © 2013 Tech Mahindra. All rights reserved.
Extracting Data – Methods of Extraction
• The extraction process can be done either by hand coded method or by usingtools
• Examining the Source Data and Identify the Extraction tool
• The Extracts are typically written in Source system Code
(Ex; PL/SQL or VB Script or COBOL).
• The Extraction tool also generates Source system Code
• Using the tool for Extraction makes the process easier instead of Hand-Coding
• The Pre and Post Process Exists. E.g. Before Extract process there may be acall for sorting the data or a call to a function that scores a record based on a
formula
8/10/2019 data warhouse introduction
http://slidepdf.com/reader/full/data-warhouse-introduction 42/93
42Copyright © 2013 Tech Mahindra. All rights reserved.
Extracting Data – Methods of Extraction
Tools have well defined disciplined approach and documented
Tools provide an easier way to perform the extraction method by providing click,
drag and drop features
Hand coded extraction techniques allow extraction in cost effective manner
since the PL/SQL construct are available with the RDBMS
Hand coded extraction are used when the extraction is to be taken place where
the programmer has clear idea of data structures
Advantageous and disadvantages over Custom-programmed Extraction
(PL SQL Scripts) and tool based extraction
8/10/2019 data warhouse introduction
http://slidepdf.com/reader/full/data-warhouse-introduction 43/93
43Copyright © 2013 Tech Mahindra. All rights reserved.
Extraction Techniques
Bulk Extraction
The entire data warehouse is refreshed periodically by extraction's from the
source systems
All applicable data are extracted from the source systems for loading into
the warehouse
This approach heavily uses the network connection for loading data from
source to target databases, but such mechanism is easy to set up and
maintain
43
8/10/2019 data warhouse introduction
http://slidepdf.com/reader/full/data-warhouse-introduction 44/93
44Copyright © 2013 Tech Mahindra. All rights reserved.
Extraction Techniques
Change-Based Replication
Only data that have been newly inserted or updated in the source systems
are extracted and loaded into the warehouse.
This approach uses less network connection due to the volume of data to
be transported.
This mechanism involves complex programming to determine when a new
warehouse record to be inserted or when an existing warehouse record
must be updated.
44
8/10/2019 data warhouse introduction
http://slidepdf.com/reader/full/data-warhouse-introduction 45/93
8/10/2019 data warhouse introduction
http://slidepdf.com/reader/full/data-warhouse-introduction 46/93
46Copyright © 2013 Tech Mahindra. All rights reserved.
Extraction Tools
Criteria for Identifying Extraction Tool
The Source System Platform and Database
Tools cannot access all types of data source on all types of Computing
platforms
Built-in Extraction or Duplication FunctionalityThe availability of built-in extraction or duplication reduces the technical
difficulties inherent in the data extraction process.
46
8/10/2019 data warhouse introduction
http://slidepdf.com/reader/full/data-warhouse-introduction 47/93
47Copyright © 2013 Tech Mahindra. All rights reserved.
Extraction ToolsExtraction Tools include
Apertus Carleton. Passport
– Users enter extraction and transformation parameters, and Data is
filtered against domains and ranges of legal values.
Evolutionary Technologies. ETL Extract.
– Users write transformation rules. Data is filtered against domains andranges of legal values and compared to other data structures
Platinum. InfoPump
– A data pump product designed to extract data from several
mainframe and client server platforms, perform some filtering and
transformation, and distribute and load to another mainframeplatform database.
– Requires InfoHub for most services.
– The extraction uses custom code modules. This is a client/server
based tool.
8/10/2019 data warhouse introduction
http://slidepdf.com/reader/full/data-warhouse-introduction 48/93
48Copyright © 2013 Tech Mahindra. All rights reserved.
Transformation Phase
Importance of Quality Data
Creating Business Rules
Tools are available to create Reusable Transformation Modules or objects
Simple Data Transformation which includes Date, Number and Character
Conversion
Assigning Surrogate Keys
Combining from Separate Sources
Validating one to one and one to many Relationships
8/10/2019 data warhouse introduction
http://slidepdf.com/reader/full/data-warhouse-introduction 49/93
49Copyright © 2013 Tech Mahindra. All rights reserved.
Transforming Data
Importance of Quality data.
Transformation
Transforming data : Problems and Solutions
Transformation Techniques
Transformation Tools
49
8/10/2019 data warhouse introduction
http://slidepdf.com/reader/full/data-warhouse-introduction 50/93
50Copyright © 2013 Tech Mahindra. All rights reserved.
Importance of Quality Data
Quality Data:
• Before the extracted data is to be transformed, the quality of the data has
to be looked on.
• Once quality data is transformed there will be minimum necessary to
change the data at the target which reduces inconsistencies betweensource and target.
• Data quality management is an approach to enterprise information that
ensures your data to be consistent, accurate and reliable
50
8/10/2019 data warhouse introduction
http://slidepdf.com/reader/full/data-warhouse-introduction 51/93
51Copyright © 2013 Tech Mahindra. All rights reserved.
Data Quality Assurance
Characteristics of Quality Data
Accurate
Complete
Consistent
Unique
Timely
51
8/10/2019 data warhouse introduction
http://slidepdf.com/reader/full/data-warhouse-introduction 52/93
52Copyright © 2013 Tech Mahindra. All rights reserved.
Data Quality Assurance
• Data Quality Tools assist warehousing teams with the task of locating andcorrecting data errors.
• Corrections of data can be made to source or to the target. But when
corrections are made to target it causes inconsistencies between the source
and target data which create synchronization problems.
52
8/10/2019 data warhouse introduction
http://slidepdf.com/reader/full/data-warhouse-introduction 53/93
53Copyright © 2013 Tech Mahindra. All rights reserved.
Data Quality Tools
• Though dirty data continue to be the biggest issues for data warehousinginitiatives, research indicates that data quality investments are small
percentage to total warehouse spending.
• These are some of the Data Quality Tools available
DataFlux. Data Quality Workbench.
Pine Cone Systems. Content Tracker. Prism. Quality Manager.
Vality Technology. Integrity Data Reengineering
53
8/10/2019 data warhouse introduction
http://slidepdf.com/reader/full/data-warhouse-introduction 54/93
54Copyright © 2013 Tech Mahindra. All rights reserved.
Transformation
Transformation :
• Transformation is process by which extracted data are transformed into
appropriate format.
• The data extracted in put into the staging area where cleaning, scrubbing takes
place and stored so that transformation of the clean data can take place.
• For transformation phase data can come from cleansing tool.
• After transformation data goes to the transportation stage.
8/10/2019 data warhouse introduction
http://slidepdf.com/reader/full/data-warhouse-introduction 55/93
55Copyright © 2013 Tech Mahindra. All rights reserved.
Transforming Data – Problems
The Common Problems of Data that come out of a Legacy System are :
Inconsistent or Incorrect use of codes and special characters.
A Single Field is Used for unofficial or undocumented purposes.
Overloaded Codes.
Evolving Data.
Missing, Incorrect or Duplicate values.
55
8/10/2019 data warhouse introduction
http://slidepdf.com/reader/full/data-warhouse-introduction 56/93
56Copyright © 2013 Tech Mahindra. All rights reserved.
Transforming Data – some Solutions
There are different solutions available to ensure the data to be loaded is Corrector not:
Cross-Footing
A template for the quality data norms can be used to identify the
erroneous data by comparing with the norms in the template.
Manual Examination
A sampling methodology can be selected and a manual examination can
be made on the sampled data
Process Validation
Scripts can be generated which takes care of identifying erroneous andsegregate them
56
8/10/2019 data warhouse introduction
http://slidepdf.com/reader/full/data-warhouse-introduction 57/93
57Copyright © 2013 Tech Mahindra. All rights reserved.
Transformation Techniques
Field Splitting and consolidation :
• Single physical field in source system needs to split up into more than one
target warehouse field.
• Several source system field must be consolidated and stored in one single
warehouse field Address field
# 123 ABC Street,
DEF City,
Republic of GH
No : 123
Street : ABC STREET
City : DEF
Country : GH
8/10/2019 data warhouse introduction
http://slidepdf.com/reader/full/data-warhouse-introduction 58/93
58Copyright © 2013 Tech Mahindra. All rights reserved.
Transformation Techniques
Standardization :• Standards and conventions for abbreviations are applied to individual data
items to improve uniformity in both source and target objects.
System A
Order Date
05 August 2007
System B
Order Date08-08-07
System A
Order Date
August 05 2007
System B
Order Date
August 08 2007
8/10/2019 data warhouse introduction
http://slidepdf.com/reader/full/data-warhouse-introduction 59/93
59Copyright © 2013 Tech Mahindra. All rights reserved.
Transformation Techniques
De-duplication :• Rules are defined to identify duplicate stores of customers or products. In
case of two or more repeated records, they are merged to form one
warehouse record.
System A
Customer Name :
John W Istin
System B
Customer Name :
John William Istin
Customer Name :
John William Istin
8/10/2019 data warhouse introduction
http://slidepdf.com/reader/full/data-warhouse-introduction 60/93
60Copyright © 2013 Tech Mahindra. All rights reserved.
Transformation Tools
Some of the Transformation tools includes
Apertus Carleton. Enterprise/Integrarot.
Data Mirror. Transformation Server.
Informatica. Power Mart Designer.
8/10/2019 data warhouse introduction
http://slidepdf.com/reader/full/data-warhouse-introduction 61/93
61Copyright © 2013 Tech Mahindra. All rights reserved.
Building DWH in Refresh Phase
Process Slowly Changing Dimensions
Automate the Extract-Transform-Load Cycle.
Incremental Fact Table Extracts.
Purging and Archiving Data.
8/10/2019 data warhouse introduction
http://slidepdf.com/reader/full/data-warhouse-introduction 62/93
62Copyright © 2013 Tech Mahindra. All rights reserved.
Static extract = capturing a snapshot ofthe source data at a point in time
Incremental extract = capturing changesthat have occurred since the last staticextract
Capture = extract…obtaining asnapshot of a chosen subset of thesource data for loading into the datawarehouse
Steps in data reconciliation at building
DWH – EXTRACT
Steps in data reconciliation (continued)
8/10/2019 data warhouse introduction
http://slidepdf.com/reader/full/data-warhouse-introduction 63/93
63Copyright © 2013 Tech Mahindra. All rights reserved.
Steps in data reconciliation (continued)
Scrub = cleanse…uses patternrecognition and AI techniques toupgrade data quality
Fixing errors: misspellings, erroneousdates, incorrect field usage, mismatchedaddresses, missing data, duplicate data,inconsistencies
Also: decoding, reformatting, timestamping, conversion, key generation,merging, error detection/logging, locatingmissing data
Steps in Data Reconciliation –
TRANSFORM
Steps in data reconciliation (continued)
8/10/2019 data warhouse introduction
http://slidepdf.com/reader/full/data-warhouse-introduction 64/93
64Copyright © 2013 Tech Mahindra. All rights reserved.
Steps in data reconciliation (continued)
Transform = convert data fromformat of operational system toformat of data warehouse
Record-level:
Selection – data partitioning
Joining – data combining
Aggregation – data summarization
Field-level:
single-field – from one field to one field
multi-field – from many fields to one, or one
field to many
Steps in Data Reconciliation –
TRANSFORM
Steps in data reconciliation (continued)
8/10/2019 data warhouse introduction
http://slidepdf.com/reader/full/data-warhouse-introduction 65/93
65Copyright © 2013 Tech Mahindra. All rights reserved.
Steps in data reconciliation (continued)
Load/Index= place transformed datainto the warehouse and createindexes
Refresh mode: bulk rewriting of targetdata at periodic intervals
Update mode: only changes in sourcedata are written to data warehouse
Steps in Data Reconciliation –
TRANSPOSE (LOAD)
8/10/2019 data warhouse introduction
http://slidepdf.com/reader/full/data-warhouse-introduction 66/93
66Copyright © 2013 Tech Mahindra. All rights reserved.
Transporting Phase
Insert statements create Logs
Bulk Loader is advisable
Truncate target tables before full refresh
Index Management
Drop and re-index.
8/10/2019 data warhouse introduction
http://slidepdf.com/reader/full/data-warhouse-introduction 67/93
67Copyright © 2013 Tech Mahindra. All rights reserved.
Transporting Data
Transporting data into Warehouse
Building the Transportation Process
Transporting the data
Post processing of loaded data
8/10/2019 data warhouse introduction
http://slidepdf.com/reader/full/data-warhouse-introduction 68/93
68Copyright © 2013 Tech Mahindra. All rights reserved.
Transporting Data into Warehouse
• The transformed data is transported into the data warehouse.
• The load images are transported through the loaders into the warehouse.
Data Loaders :
• Data loaders load transformed data into the data warehouse.
• Stored procedures can be used to handle the warehouse loading if the
images are available in same RDBMS engine.
8/10/2019 data warhouse introduction
http://slidepdf.com/reader/full/data-warhouse-introduction 69/93
69Copyright © 2013 Tech Mahindra. All rights reserved.
Transporting Data into Warehouse
EXTRACT load
S o u r c e
D a t a
S t a g i n
g
A r e a
W a r e h o u s e
S c h e m a
8/10/2019 data warhouse introduction
http://slidepdf.com/reader/full/data-warhouse-introduction 70/93
70Copyright © 2013 Tech Mahindra. All rights reserved.
Transporting Data into Warehouse
Warehou se Schema :
• It is nothing but the Dimensional Model(dimensions and facts)
Staging Area :
• It is nothing but workspace where data is ready after cleaning. This is for
minimizing the time required to prepare the data.
Sou rce Data :
• This can be flat file, oracle table or some other form.
8/10/2019 data warhouse introduction
http://slidepdf.com/reader/full/data-warhouse-introduction 71/93
71Copyright © 2013 Tech Mahindra. All rights reserved.
Building the Transporting Process
For Transporting Data we can use:
PL/SQL scripts
SQL Loader Routines for flat files
ETL Tool
Similarly we can use SQL Loader for directly putting the data from the flat files
to the tables
We use this for the Bulk loading.
SQL Loader can be used for loading varying length and fixed format files.
Datawarehouse Building
8/10/2019 data warhouse introduction
http://slidepdf.com/reader/full/data-warhouse-introduction 72/93
72Copyright © 2013 Tech Mahindra. All rights reserved.
Datawarehouse Building
Abstract View of a Data Warehouse
BuildingSource – A
Part – ASource – B
Part – BSource – C
Part – C
A B C
A
B
C
Analytical
Operational
User’s View
Extraction
Transformation
Categorization of
transaction data
8/10/2019 data warhouse introduction
http://slidepdf.com/reader/full/data-warhouse-introduction 73/93
73Copyright © 2013 Tech Mahindra. All rights reserved.
ETVL Tools
The following are the Popular ETL Tools
Oralce Warehouse Builder
Informatica
Sagent
SAS Warehouse Administrator
8/10/2019 data warhouse introduction
http://slidepdf.com/reader/full/data-warhouse-introduction 74/93
74Copyright © 2013 Tech Mahindra. All rights reserved.
LEAVING A METADATA TRAIL
Defining Warehouse Metadata
Developing a Metadata Strategy
Examining types of Metadata
Metadata Management Tools
Common Warehouse Metadata
8/10/2019 data warhouse introduction
http://slidepdf.com/reader/full/data-warhouse-introduction 75/93
75Copyright © 2013 Tech Mahindra. All rights reserved.
Metadata
What is Metadata?
Traditionally defined as data about data
Form of abstraction that describes the structure and contents of the data
warehouse
8/10/2019 data warhouse introduction
http://slidepdf.com/reader/full/data-warhouse-introduction 76/93
8/10/2019 data warhouse introduction
http://slidepdf.com/reader/full/data-warhouse-introduction 77/93
77Copyright © 2013 Tech Mahindra. All rights reserved.
Importance of Metadata
Metadata establish the context of the Warehouse data
Metadata facilitate the Analysis Process
Metadata are a form of Audit Trail for Data Transformation
Metadata Improve or Maintain Data Quality
Th P f tti
8/10/2019 data warhouse introduction
http://slidepdf.com/reader/full/data-warhouse-introduction 78/93
78Copyright © 2013 Tech Mahindra. All rights reserved.
It is the process inwhich extracted dataare transformed into
appropriate format
Quick RecapThe Process of gettingdata from Legacy Systemor any Data Source.
The transformeddata is loaded in tothe warehouse
ETL Process
Extracting Data
Transforming DataLoading Data
Building the Data Warehouse using
Extracting techniques
Transforming techniques
Transporting techniques
ETVL Tools
Meta Data & Importance
Meta Data is Dataabout Data and isimportant for datatransformation.
8/10/2019 data warhouse introduction
http://slidepdf.com/reader/full/data-warhouse-introduction 79/93
79Copyright © 2013 Tech Mahindra. All rights reserved.
QUIZ1. Which of the following is not an OLAP tool?Oracle Express / OWB / Cognos / Microstrategy
2. Which of the following should be the goals of the ETL
application development process? Modular and re-usable code Self documenting process flows Fully metadata aware process All of the above
3 . The information about the data i.e. Data about the Data is
kept in: RDBMS DBMS Metadata
4. Hand coded extraction techniques allow extraction incost effective mannerTrue / False
5. How do you handle slowly changing dimensions? Manually handle them Using a data staging tool Both of the above
OWB
ALL THE ABOVE
META DATA
True
BOTH
8/10/2019 data warhouse introduction
http://slidepdf.com/reader/full/data-warhouse-introduction 80/93
80Copyright © 2013 Tech Mahindra. All rights reserved.
Module – 4
Data Warehousevs
Data Mart
8/10/2019 data warhouse introduction
http://slidepdf.com/reader/full/data-warhouse-introduction 81/93
81Copyright © 2013 Tech Mahindra. All rights reserved.
Topics to be covered
This Module provides
What is a Data Mart
Data Mart-Approaches
Top-Down Approach
Bottom-Up Approach
Hybrid Approach
Conceptual Modeling of Data Warehouse with Examples
Star Schema
Snow Flake Schema
Fact Constellations
81
8/10/2019 data warhouse introduction
http://slidepdf.com/reader/full/data-warhouse-introduction 82/93
82Copyright © 2013 Tech Mahindra. All rights reserved.
Data Mart
Data mart is:
A func t ional segment of an enterprise restricted for
purposes of security, locality, performance, or business
necessity using modeling and information delivery
techn iques ident ical to data warehousing .
82
8/10/2019 data warhouse introduction
http://slidepdf.com/reader/full/data-warhouse-introduction 83/93
83Copyright © 2013 Tech Mahindra. All rights reserved.
Data Mart- Approach
Physical data warehouse (physical)
Data warehouse --> data marts
Data marts --> data warehouse
Parallel data warehouse and data marts
83
8/10/2019 data warhouse introduction
http://slidepdf.com/reader/full/data-warhouse-introduction 84/93
84Copyright © 2013 Tech Mahindra. All rights reserved.
Top-down
84
SOURCE DATA
External
Data
Operational Data
Staging Area
Data Warehouse Data Marts
Physical Data Warehouse:
Data Warehouse --> Data Marts
8/10/2019 data warhouse introduction
http://slidepdf.com/reader/full/data-warhouse-introduction 85/93
85Copyright © 2013 Tech Mahindra. All rights reserved.
Bottom-up approach
85
SOURCE DATA
ExternalData
Operational Data
Staging Area
Data Warehouse
Data Marts
Physical Data Warehouse:
Data Marts --> Data Warehouse
8/10/2019 data warhouse introduction
http://slidepdf.com/reader/full/data-warhouse-introduction 86/93
86Copyright © 2013 Tech Mahindra. All rights reserved.
Hybrid
86
SOURCE DATA
ExternalData
Operational Data
Staging Area
Data Warehouse
Data Marts
Physical Data Warehouse:Parallel Data Warehouse & Data Marts
8/10/2019 data warhouse introduction
http://slidepdf.com/reader/full/data-warhouse-introduction 87/93
87Copyright © 2013 Tech Mahindra. All rights reserved.
Conceptual Modeling of Data Warehouses
Modeling data warehouses: dimensions & measures
Star schema
Snowflake schema
Fact constellations
87
8/10/2019 data warhouse introduction
http://slidepdf.com/reader/full/data-warhouse-introduction 88/93
88Copyright © 2013 Tech Mahindra. All rights reserved.
Example of Star Schema
88
time_keyday
day_of_the_weekmonthquarter
year
time
location_keystreetcity
province_or_streetcountry
location
Sales Fact Table
time_key
item_key
branch_key
location_key
units_sold
dollars_sold
avg_sales
Measures
item_keyitem_name
brandtype
supplier_type
item
branch_keybranch_namebranch_type
branch
8/10/2019 data warhouse introduction
http://slidepdf.com/reader/full/data-warhouse-introduction 89/93
89Copyright © 2013 Tech Mahindra. All rights reserved.
Example of Snowflake Schema
89
time_keyday
day_of_the_weekmonthquarter
year
time
location_keystreet
city_key
location
Sales Fact Table
time_key
item_keybranch_key
location_key
units_sold
dollars_sold
avg_sales
Measures
item_keyitem_name
brandtype
supplier_key
item
branch_keybranch_name
branch_type
branch
supplier_keysupplier_type
supplier
city_keycity
province_or_streetcountry
city
8/10/2019 data warhouse introduction
http://slidepdf.com/reader/full/data-warhouse-introduction 90/93
90Copyright © 2013 Tech Mahindra. All rights reserved.
Example of Fact Constellation
90
time_keyday
day_of_the_weekmonthquarter
year
time
location_keystreetcity
province_or_streetcountry
location
Sales Fact Table
time_key
item_key
branch_key
location_key
units_sold
dollars_sold
avg_sales
Measures
item_keyitem_namebrandtype
supplier_type
item
branch_keybranch_namebranch_type
branch
Shipping Fact Table
time_key
item_key
shipper_key
from_location
to_location
dollars_cost
units_shipped
shipper_keyshipper_namelocation_keyshipper_type
shipper
8/10/2019 data warhouse introduction
http://slidepdf.com/reader/full/data-warhouse-introduction 91/93
91Copyright © 2013 Tech Mahindra. All rights reserved.
In this module we have seen the following topics.
What is a Data Mart?
Various approaches to build the Data MartTop-Down ApproachBottom-Up ApproachHybrid Approach
Conceptual Modeling UsingStar SchemaSnow Flake SchemaFact Constellations
Examples of the Modeling Techniques
Quick Recap
Data Warehouse toData Marts
Data Warehouse and
Data Marts are built inparallel
Data Marts toData Warehouse
Single Fact Tablesurrounded by
multiple dimensions
Single Fact tablesurrounded by
normalized DimensionsOne or More Fact tables
surrounded by Dimensions
The Subset of data
warehouse related to a
singe subject area
8/10/2019 data warhouse introduction
http://slidepdf.com/reader/full/data-warhouse-introduction 92/93
92Copyright © 2013 Tech Mahindra. All rights reserved.
QUIZ1. Which type of Data Warehouse schema normalizesdimensions to eliminate redundancy? Star Schema
Snowflake Schema
2. Data marts always have multiple subject areas True/False
3. In fact constellation, there are many fact table sharing thesame dimension tables.
False/True4. An Enterprise Warehouse can be built by combining the
Data Marts False/True
5. Which of the following are the entry points in the
Warehouse Fact tables Dimension Tables
Snow Flake Schema
False
True
True
Dimension
8/10/2019 data warhouse introduction
http://slidepdf.com/reader/full/data-warhouse-introduction 93/93
mahindrasatyam.com
Safe Harbor
This document contains forward-looking statements within the meaning of section 27A of Securities Act of 1933, as amended, andsection 21E of the Securities Exchange Act of 1934, as amended. The forward-looking statements contained herein are subject tocertain risks and uncertainties that could cause actual results to differ materially from those reflected in the forward-lookingstatements. We undertake no duty to update any forward-looking statements. For a discussion of the risks associated with ourbusiness, please see the discussions under the heading “Risk Factors” in our report on Form 6-K concerning the quarter ended
September 30, 2008, furnished to the Securities and Exchange Commission on 07 November, 2008, and the other reports filed withthe Securities and Exchange Commission from time to time. These filings are available at http://www.sec.gov
Thank you