ch 1 intro_dw

63
SUSHIL KULKARNI SUSHIL KULKARNI DATA WAREHOUSING DATA WAREHOUSING DATA WAREHOUSING

Upload: sushil-kulkarni

Post on 20-May-2015

822 views

Category:

Technology


0 download

DESCRIPTION

This gives an idea about Data Warehouse

TRANSCRIPT

Page 1: Ch 1 intro_dw

SUSHIL KULKARNISUSHIL KULKARNI

DATA WAREHOUSINGDATA WAREHOUSINGDATA WAREHOUSING

Page 2: Ch 1 intro_dw

Which are ourlowest/highest margin

customers ?

Which are ourlowest/highest margin

customers ?Who are my customers and what products are they buying?

Who are my customers and what products are they buying?

Which customersare most likely to go to the competition ?

Which customersare most likely to go to the competition ?

What impact will new products/services

have on revenue and margins?

What impact will new products/services

have on revenue and margins?

What product prom--otions have the biggest impact on revenue?

What product prom--otions have the biggest impact on revenue?

What is the most effective distribution

channel?

What is the most effective distribution

channel?

A producer wants to knowA producer wants to know……..

Page 3: Ch 1 intro_dw

Lot of data everywhereLot of data everywhere

yet ...yet ...• I can’t find the data I need

– data is scattered over the network

– many versions, subtle differences

• I can’t get the data I need

– need an expert to get the data

• I can’t understand the data I found

– available data poorly documented

• I can’t use the data I found

– results are unexpected

– data needs to be transformed from one form to other

Page 4: Ch 1 intro_dw

What is a Data Warehouse?What is a Data Warehouse?

A single, complete and consistent store of data obtained from a variety of different sources made available to end users in a what they can understand and use in a business context.

[Barry Devlin]

Page 5: Ch 1 intro_dw

What users says...What users says...

• Data should be integrated across the enterprise

• Summary data has a real value to the organization

• Historical data holds the key to understanding data over time

• What-if capabilities are required

Page 6: Ch 1 intro_dw

What is Data Warehousing?What is Data Warehousing?

A process of transforming

data into information and making it available to users in a timely enough manner to make a difference

[Forrester Research, April 1996]

Data

Page 7: Ch 1 intro_dw

EvolutionEvolution• 60’s: Batch reports

– hard to find and analyze information

– inflexible and expensive, reprogram every new request

• 70’s: Terminal-based DSS and EIS (executive information systems)

– still inflexible, not integrated with desktop tools

• 80’s: Desktop data access and analysis tools

– query tools, spreadsheets, GUIs

– easier to use, but only access operational databases

• 90’s: Data warehousing with integrated OLAP engines and tools

Page 8: Ch 1 intro_dw

Warehouses are Very Large Warehouses are Very Large

DatabasesDatabases

35%

30%

25%

20%

15%

10%

5%

0%5GB

5-9GB

10-19GB 50-99GB 250-499GB

20-49GB 100-249GB 500GB-1TB

Initial

Projected 2Q96

Source: META Group, Inc.

Respondents

Page 9: Ch 1 intro_dw

Very Large Data BasesVery Large Data Bases

• Terabytes -- 10^12 bytes:

• Petabytes -- 10^15 bytes:

• Exabytes -- 10^18 bytes:

• Zettabytes -- 10^21 bytes:

• Zottabytes -- 10^24 bytes:

Walmart -- 24 Terabytes

Geographic Information Systems

National Medical Records

Weather images

Intelligence Agency Videos

Page 10: Ch 1 intro_dw

Data Warehousing Data Warehousing ----

It is a processIt is a process• Technique for assembling and

managing data from various sources for the purpose of answering business questions. Thus making decisions that were not previous possible

• A decision support database

maintained separately from the organization’s operational database

Page 11: Ch 1 intro_dw

Data WarehouseData Warehouse

• A data warehouse is a

– subject-oriented

– integrated

– time-varying

– non-volatile

collection of data that is used primarily

in organizational decision making.

-- Bill Inmon, Building the Data Warehouse 1996

Page 12: Ch 1 intro_dw

Customers: Get information of different prices of a beer

Farmers: Harvest information from known access paths

Data Warehouse SubjectData Warehouse Subject--orientedoriented

Page 13: Ch 1 intro_dw

Students: Get information about various universities in U.K.

Data Warehouse SubjectData Warehouse Subject--orientedoriented

Explorers: Seek out the unknown and previously unsuspected rewards hiding in the detailed data

Page 14: Ch 1 intro_dw

• Focusing on the modelling and

analysis of data for decision makers,

not on daily operations or transaction

processing

• Provide a simple and concise view

around particular subject issues by

excluding data that are not useful in the

decision support process

Data Warehouse SubjectData Warehouse Subject--orientedoriented

Page 15: Ch 1 intro_dw

Customers

Etc…

Vendors Etc…

Orders

DataWarehouse

Enterprise“Database”

Transactions

Copied, organizedsummarized

Data Mining

Data Miners:

• “Farmers” – they know• “Explorers” - unpredictable

Data Warehouse SubjectData Warehouse Subject--orientedoriented

Page 16: Ch 1 intro_dw

Use to study trends and changes

Data Warehouse :Data Warehouse :Time Time -- variantvariant

Page 17: Ch 1 intro_dw

• The time horizon for the data warehouse is

significantly longer than that of operational

systems

– Data warehouse data: provide information from a historical

perspective (e.g., past 5-10 years)

– Operational database: current value data

• Every key structure in the data warehouse

– Contains an element of time explicitly or implicitly, while

the key of operational data may or may not contain “time

element”

Data Warehouse :Data Warehouse :Time Time -- variantvariant

Page 18: Ch 1 intro_dw

cannot updated by end users

Data Warehouse : Data Warehouse : NonNon--volatilevolatile

Page 19: Ch 1 intro_dw

Data Warehouse ArchitectureData Warehouse Architecture

Data Warehouse

Engine

Optimized Loader

Extraction

Cleansing

Analyze

Query

Metadata Repository

RelationalDatabases

LegacyData

Purchased Data

ERPSystems

Page 20: Ch 1 intro_dw

Data MartData Mart

• A Data Mart is a smaller, more focused

Data Warehouse – a mini-warehouse.

• A Data Mart typically reflects the business

rules of a specific business unit within an

enterprise.

Page 21: Ch 1 intro_dw

Data Warehouse to Data MartData Warehouse to Data Mart

DataWarehouse

Data Mart

Data Mart

Data Mart

DecisionSupport

Information

DecisionSupport

Information

DecisionSupport

Information

Page 22: Ch 1 intro_dw

DATA MARTSDATA MARTS

• Create many DM’s • Limited scope

Examples:

1. Financial DM2. Marketing DM3. Supply chain DM

Page 23: Ch 1 intro_dw

Generic Architecture of DataGeneric Architecture of Data

(synonym) Transaction data

Page 24: Ch 1 intro_dw

Transaction (Operational) Transaction (Operational)

DataData• Operational (production) systems create

(massive number of) transactions, such as sales, purchases, deposits, withdrawals, returns, refunds, phone calls, toll roads, web site “hits”, etc…

• Transactions are the base level of data –the raw material for understanding customer behavior

• Unfortunately, operational systems change due to changing business needs

• Fortunately, operational systems can usually be changed to support changing business needs

• Data warehousing strategies need to be aware of operational system changes

Page 25: Ch 1 intro_dw

Operational Summary DataOperational Summary Data

Summaries are for a specific time period and utilize the transaction data for that time period

Other Examples???

Page 26: Ch 1 intro_dw

Decision Support Summary DataDecision Support Summary Data

• The data that are used to help make decisions about the business– Financial Data, such as:

• Income Statements (Profit & Loss)• Balance Sheets (Assets – Liabilities = Net

Worth)

– Sales summaries– Other examples???

• Data warehouses maintain this type of data, however financial data “of record”(for audit purposes) usually comes from databases and not the data warehouse (confusing???)

• Generally, it is a bad idea to use the same system for analytic and operational purposes

Page 27: Ch 1 intro_dw

Data Warehouse for Decision Data Warehouse for Decision

Support Support

• Putting Information technology to help

the knowledge worker make faster and

better decisions

– Which of my customers are most likely to

go to the competition?

– What product promotions have the biggest

impact on revenue?

– How did the share price of software

companies correlate with profits over last

10 years?

Page 28: Ch 1 intro_dw

Decision SupportDecision Support

• Used to manage and control business

• Data is historical or point-in-time

• Optimized for inquiry rather than update

• Use of the system is loosely defined and can

be ad-hoc

• Used by managers and end-users to

understand the business and make

judgements

Page 29: Ch 1 intro_dw

Database SchemaDatabase Schema

• Database schema defines the structure of data, not the values of the data (e.g., first name, last

name = structure; Ron Norman = values of the data)

• In RDBMS:

– Columns = fields = attributes (A,B,C)

– Rows = records = tuples (1-7)

Page 30: Ch 1 intro_dw

Logical Database SchemaLogical Database Schema• Describes data in a way that is familiar to

business users

Page 31: Ch 1 intro_dw

Physical Database SchemaPhysical Database Schema• Describes the data the way it will be stored in an

RDBMS which might be different than the way the logical shows it

Page 32: Ch 1 intro_dw

MetadataMetadata

• General definition: Data about data !!!– Examples:

• A library’s card catalog (metadata) describes publications (data)

• A file system maintains permissions (metadata) about files (data)

• A form of system documentation including:– Values legally allowed in a field (e.g., AZ,

CA, OR, UT, WA, etc.)– Description of the contents of each field

(e.g., start date)– Date when data were loaded– Indication of currency of the data

(last updated)– Mappings between systems

(e.g., A.this = B.that)

• Invaluable, otherwise have to research to find it

Page 33: Ch 1 intro_dw

Business RulesBusiness Rules

• Highest level of abstraction from

operational (transaction) data

• Describes why relationships exist and

how they are applied

• Examples:

– Need to have 3 forms of ID for credit

– Only allow a maximum daily withdrawal of

$200

– After the 3rd log-in attempt, lock the log-in

screen

– Accept no bills larger than $20

– Others???

Page 34: Ch 1 intro_dw

General Architecture for Data General Architecture for Data

WarehousingWarehousing

• Source systems

• Extraction, (Clean),

Transformation, &

Load (ETL)

• Central repository

• Metadata repository

• Data marts

• Operational

feedback

• End users

(business)

Page 35: Ch 1 intro_dw

DATA WAREHOUSE SCOPEDATA WAREHOUSE SCOPE

Broad :

Required for companies, Very costly, May be divided according to Depts.

Narrow:

Required for Personal information

Page 36: Ch 1 intro_dw

Design of a Data Warehouse: A Design of a Data Warehouse: A

Business Analysis FrameworkBusiness Analysis Framework• Four views regarding the design of a data

warehouse

– Top-down view

• allows selection of the relevant information necessary for

the data warehouse

– Data source view

• exposes the information being captured, stored, and

managed by operational systems

– Data warehouse view

• consists of fact tables and dimension tables

– Business query view

• sees the perspectives of data in the warehouse from the

view of end-user

Page 37: Ch 1 intro_dw

Data Warehouse Design Process Data Warehouse Design Process

• Top-down, bottom-up approaches or a combination of both

– Top-down: Starts with overall design and planning

– Bottom-up: Starts with experiments and prototypes (rapid)

• From software engineering point of view

– Waterfall: structured and systematic analysis at each step before proceeding to the next

– Spiral: rapid generation of increasingly functional systems, short turn around time, quick turn around

• Typical data warehouse design process

– Choose a business process to model, e.g., orders, invoices, etc.

– Choose the grain (atomic level of data) of the business process

– Choose the dimensions that will apply to each fact table record

– Choose the measure that will populate each fact table record

Page 38: Ch 1 intro_dw

MultiMulti--Tiered ArchitectureTiered Architecture

Data

Warehouse

Extract

Transform

Load

Refresh

OLAP Engine

Analysis

Query

Reports

Data mining

Monitor

&

Integrator

Metadata

Data Sources Front-End Tools

Serve

Data Marts

Operational

DBs

other

sources

Data Storage

OLAP Server

Page 39: Ch 1 intro_dw

Three Data Warehouse ModelsThree Data Warehouse Models

• Enterprise warehouse

– collects all of the information about subjects spanning

the entire organization

• Data Mart

– a subset of corporate-wide data that is of value to a

specific groups of users. Its scope is confined to

specific, selected groups, such as marketing data mart

• Independent vs. dependent (directly from warehouse) data

mart

• Virtual warehouse

– A set of views over operational databases

– Only some of the possible summary views may be

materialized

Page 40: Ch 1 intro_dw

Data Mining works with Data Mining works with

Warehouse DataWarehouse Data

• Data Warehousing provides the Enterprise with a memory

• Data Mining provides the Enterprise with intelligence

Page 41: Ch 1 intro_dw

We want to know ...We want to know ...

• Given a database of 100,000 names, which persons are the least likely to default on their credit cards?

• Which types of transactions are likely to be fraudulent given the demographics and transactional history of a particular customer?

• If I raise the price of my product by Rs. 2, what is the effect on my ROI?

• If I offer only 2,500 airline miles as an incentive to purchase rather than 5,000, how many lost responses will result?

• If I emphasize ease-of-use of the product as opposed to its technical capabilities, what will be the net effect on my revenues?

• Which of my customers are likely to be the most loyal?

Data Mining helps extract such information

Page 42: Ch 1 intro_dw

Application AreasApplication Areas

Industry Application

Finance Credit Card Analysis

Insurance Claims, Fraud Analysis

Telecommunication Call record analysis

Transport Logistics management

Consumer goods promotion analysis

Data Service providers Value added data

Utilities Power usage analysis

Page 43: Ch 1 intro_dw

Data Mining in UseData Mining in Use

• Data Mining can be used to track fraud

• A Supermarket becomes an information broker

• Basketball teams use it to track game strategy

• Cross Selling

• Warranty Claims Routing

• Holding on to Good Customers

• Weeding out Bad Customers

Page 44: Ch 1 intro_dw

Two Systems Two Systems

• Operational System

• Information System

Page 45: Ch 1 intro_dw

Operational SystemsOperational Systems

• Run the business in real time

• Based on up-to-the-second data

• Optimized to handle large numbers of simple read/write transactions

• Optimized for fast response to predefined transactions

• Used by people who deal with customers, products --clerks, salespeople etc.

• They are increasingly used by customers

Page 46: Ch 1 intro_dw

It refers to a class of

systems that facilitate

and manage

transaction-oriented

applications, typically for data entry and

retrieval transaction

processing

On Line Transaction Process On Line Transaction Process

(OLTP)(OLTP)

Page 47: Ch 1 intro_dw

OLTP technology is used in a number of industries, including banking, airlines, mail order, supermarkets, and manufacturing. Applications include electronic banking, order processing, employee time clock systems, e-commerce, and eTrading. The most widely used OLTP system is probably IBM's CICS.

On Line Transaction Process On Line Transaction Process

(OLTP)(OLTP)

Page 48: Ch 1 intro_dw

What are Operational Systems?What are Operational Systems?

• They are OLTP systems

• Run mission critical

applications

• Need to work with stringent performance requirements for routine tasks

• Used to run a business!

Page 49: Ch 1 intro_dw

RDBMS used for OLTPRDBMS used for OLTP

• Database Systems have been used traditionally for OLTP– clerical data processing tasks

– detailed, up to date data

– structured repetitive tasks

– read/update a few records

– isolation, recovery and

integrity are critical

Page 50: Ch 1 intro_dw

Operational Summary DataOperational Summary Data

Summaries are for a specific time period and utilize the transaction data for that time period

Other Examples???

Page 51: Ch 1 intro_dw

Examples of Operational DataExamples of Operational Data

Data Industry Usage Technology Volumes

CustomerFile

All TrackCustomerDetails

Legacy application, flatfiles, main frames

Small-medium

AccountBalance

Finance Controlaccountactivities

Legacy applications,hierarchical databases,mainframe

Large

Point-of-Sale data

Retail Generatebills, managestock

ERP, Client/Server,relational databases

Very Large

CallRecord

Telecomm-unications

Billing Legacy application,hierarchical database,mainframe

Very Large

ProductionRecord

Manufact-uring

ControlProduction

ERP,relational databases,AS/400

Medium

Page 52: Ch 1 intro_dw

So, whatSo, what’’s different?s different?

Page 53: Ch 1 intro_dw

ApplicationApplication--Orientation vs. Orientation vs.

SubjectSubject--OrientationOrientation

Application-Orientation

Operational Database

LoansCredit Card

Trust

Savings

Subject-Orientation

DataWarehouse

Customer

Vendor

Product

Activity

Page 54: Ch 1 intro_dw

OLTP vs. Data WarehouseOLTP vs. Data Warehouse

• OLTP systems are tuned for known transactions and workloads while workload is not known a priori in a data warehouse

• Special data organization, access methods and implementation methods are needed to support

data warehouse queries (typically multidimensional queries)

– e.g., average amount spent on phone calls between

9AM-5PM in Pune during the month of December

Page 55: Ch 1 intro_dw

OLTP OLTP vsvs Data WarehouseData Warehouse

• OLTP

– Application Oriented

– Used to run business

– Detailed data

– Current up to date

– Isolated Data

– Repetitive access

– Clerical User

• Warehouse (DSS)

– Subject Oriented

– Used to analyze

business

– Summarized and refined

– Snapshot data

– Integrated Data

– Ad-hoc access

– Knowledge User

(Manager)

Page 56: Ch 1 intro_dw

OLTP OLTP vsvs Data WarehouseData Warehouse

• OLTP

– Performance Sensitive

– Few Records accessed at a time (tens)

– Read/Update Access

– No data redundancy

– Database Size 100MB -100 GB

• Data Warehouse

– Performance relaxed

– Large volumes accessed at a time(millions)

– Mostly Read (Batch Update)

– Redundancy present

– Database Size 100 GB - few terabytes

Page 57: Ch 1 intro_dw

OLTP OLTP vsvs Data WarehouseData Warehouse

• OLTP

– Transaction

throughput is the

performance metric

– Thousands of users

– Managed in entirety

• Data Warehouse

– Query throughput is

the performance

metric

– Hundreds of users

– Managed by subsets

Page 58: Ch 1 intro_dw

To summarize ...To summarize ...

• OLTP Systems are used to “run” a business

• The Data Warehouse helps to “optimize” the business

Page 59: Ch 1 intro_dw

Why Separate Data Why Separate Data

Warehouse?Warehouse?• Performance

– Op dbs designed & tuned for known txs & workloads.

– Complex OLAP queries would degrade perf. for op txs.

– Special data organization, access & implementation methods needed for multidimensional views & queries.

• Function

– Missing data: Decision support requires historical data, which op dbs do not typically maintain.

– Data consolidation: Decision support requires consolidation (aggregation, summarization) of data from many heterogeneous sources: op dbs, external sources.

– Data quality: Different sources typically use inconsistent data representations, codes, and formats which have to be reconciled.

Page 60: Ch 1 intro_dw

INFORMATION SYSTEMSINFORMATION SYSTEMS

• Designed to support decision-making based on

1. Historical data2. Prediction data.

• Designed for complex queries or data-mining applications.

Examples:

1. Sales trend analysis, 2. Customer segmentation3. Human resources planning

Page 61: Ch 1 intro_dw

INFORMATION SYSTEMSINFORMATION SYSTEMS

Page 62: Ch 1 intro_dw

DIFFERENCEDIFFERENCE

Periodical batch updates and queries requiring many or all rows

Many, constant updates and queries on one or a few table rows

Volume

Ease of flexible access and use

Performance throughput, availability

Design goal

Broad, ad hoc, complex queries and analysis

Narrow, planned, and simple updates and queries

Scope of usage

Managers, business analysts, customers

Clerks, sales-persons, administrations

Primary users

Real and analyze historical data.

Real time data entryPurpose

Informational SystemsOperational SystemsCharacteristics

Page 63: Ch 1 intro_dw

T H A N K S !T H A N K S !