dataware housing

90
DATA WAREHOUSING AND DATA MINING M.Mageshwari,Lecturer Lecturer,Department of CE M.S.P.V.L Polytechnic College

Upload: work

Post on 06-May-2015

5.358 views

Category:

Education


2 download

TRANSCRIPT

Page 1: Dataware housing

DATA WAREHOUSING AND

DATA MINING

M.Mageshwari,LecturerLecturer,Department of CE

M.S.P.V.L Polytechnic College

Page 2: Dataware housing

Course Overview

• The course: what and how

• 0. Introduction• I. Data Warehousing• II. Decision Support and

OLAP• III. Data Mining• IV. Looking Ahead• Demos and Labs

2

Page 3: Dataware housing

A producer wants to know….

3

Which are our lowest/highest margin

customers ?

Which are our lowest/highest margin

customers ?

Who are my customers and what products are they buying?

Who are my customers and what products are they buying?

Which customers are most likely to go to the competition ?

Which customers are most likely to go to the competition ?

What impact will new products/services

have on revenue and margins?

What impact will new products/services

have on revenue and margins?

What product prom--otions have the biggest

impact on revenue?

What product prom--otions have the biggest

impact on revenue?

What is the most effective distribution

channel?

What is the most effective distribution

channel?

Page 4: Dataware housing

Data, Data everywhereyet ... • I can’t find the data I need

• data is scattered over the network• many versions, subtle differences

4

I can’t get the data I need need an expert to get the data

I can’t understand the data I found available data poorly documented

I can’t use the data I found results are unexpected data needs to be transformed

from one form to other

Page 5: Dataware housing

What is a Data Warehouse?

A single, complete and consistent store of data obtained from a variety of different sources made available to end users in a what they can understand and use in a business context.

5

Page 6: Dataware housing

What are the users saying...

• Data should be integrated across the enterprise

• Summary data has a real value to the organization

• Historical data holds the key to understanding data over time

• What-if capabilities are required

6

Page 7: Dataware housing

What is Data Warehousing?

A process of transforming data into information and making it available to users in a timely enough manner to make a difference

7

Data

Information

Page 8: Dataware housing

Evolution

• 60’s: Batch reports• hard to find and analyze information• inflexible and expensive, reprogram every new request

• 70’s: Terminal-based DSS(Decision Support System and EIS (executive information systems)• still inflexible, not integrated with desktop tools

8

Page 9: Dataware housing

Data Warehouse Structure

• base customer (1985-87)• custid, from date, to date, name, phone, dob

• base customer (1988-90)• custid, from date, to date, name, credit rating,

employer

• customer activity (1986-89) -- monthly summary• customer activity detail (1987-89)

• custid, activity date, amount, clerk id, order no

• customer activity detail (1990-91)• custid, activity date, amount, line item no, order no 9

Page 10: Dataware housing

Definition of DSS

• Decision support system is defined as a system that helps the decision makers in various levels to take decisions

• This system uses data, analytical models and user friendly software for taking decision

10

Page 11: Dataware housing

Definition of EIS

• Executive information system(EIS) is defined as a system that helps the high level executives to take policy decisions.

• This system user higher level data, analytical models and user friendly software for taking decisions.

11

Page 12: Dataware housing

Evolution

• 80’s: Desktop data access and analysis tools• query tools, spreadsheets, GUIs• easier to use, but only access operational

databases

• 90’s: Data warehousing with integrated OLAP(online analytical processing)engines and tools

12

Page 13: Dataware housing

Data Warehousing -- It is a process

• Technique for assembling and managing data from various sources for the purpose of answering business questions. Thus making decisions that were not previous possible

• A decision support database maintained separately from the organization’s operational database

13

Page 14: Dataware housing

Characteristics of Data Warehouse

• A data warehouse is a

• subject-oriented

• integrated

• time-varying

• non-volatile

collection of data that is used

primarily in organizational

decision making.

14

Page 15: Dataware housing

]\

Subject-Oriented

• A data warehouse is organized around the major subjects of the organization such as customer, supplier, product, sales, etc..,

• Data warehouse provides a simple and concise view around a particular subject by excluding data that are not useful to the decision support process.

15

Page 16: Dataware housing

Integrated

• A data warehouse is constructed by integrating multiple sources of data such as relational database, flat files and on-line transaction records.

• Data cleaning and data integration techniques are applied to ensure consistency in naming conventions, encoding structures, attributes etc..,

16

Page 17: Dataware housing

Time Variant

• Data warehouse maintains records of both historical and current data.

• So it can provide information in a historical perspective

17

Page 18: Dataware housing

Non Volatile

• Once data warehouse is loaded with data, it is not possible to perform any modifications in the stored data.

18

Page 19: Dataware housing

Explorers, Farmers and Tourists

19

Explorers: Seek out the unknown and previously unsuspected rewards hiding in the detailed data

Farmers: Harvest informationfrom known access paths

Tourists: Browse information about Tourists

Page 20: Dataware housing

Application-Orientation vs. Subject-Orientation

20

Application-Orientation

Operational Database

LoansCredit Card

Trust

Savings

Subject-Orientation

DataWarehouse

Customer

VendorProduct

Activity

Page 21: Dataware housing

Functioning of Data warehousing

21

Data Source Cleaning Transformation

Data Warehouse

New Update

Page 22: Dataware housing

Collection Data

• Data warehousing collect data from various data sources such as relational data base, flat files and on-line records

• The collection of data are stored in database inside the warehouse.

• The type of data collection used depends on the architecture of the ware house.

22

Page 23: Dataware housing

Integration

• Each and every data source uses from different schema.

• Data warehouse get data from different source with different schema and convert the data from various sources into a common integrated schema.

23

Page 24: Dataware housing

Star Schema

• A single fact table and for each dimension one dimension table

• Does not capture hierarchies directly

24

T ime

prod

cust

city

fact

date, custno, prodno, cityname, ...

Page 25: Dataware housing

Snowflake schema

• Represent dimensional hierarchy directly by normalizing tables.

• Easy to maintain and saves storage

25

T ime

prod

cust

city

fact

date, custno, prodno, cityname, ...

region

Page 26: Dataware housing

Data Warehouse for Decision Support & OLAP

• Putting Information technology to help the knowledge worker make faster and better decisions• Which of my customers are most likely to go to the

competition?

• What product promotions have the biggest impact on revenue?

• How did the share price of software companies correlate with profits over last 10 years?

26

Page 27: Dataware housing

Decision Support

• Used to manage and control business

• Data is historical or point-in-time

• Optimized for inquiry rather than update

• Use of the system is loosely defined and can

be ad-hoc

• Used by managers and end-users to

understand the business and make judgments27

Page 28: Dataware housing

OLAP(Online analytical processing)

• A data warehouse stores data , but OLAP transform the data warehouse data into specific meaningful information.

• Therefore OLAP provides a user friendly environment for interactive data analysis.

28

Page 29: Dataware housing

OLAP

29

DATA WAREHOUSE

OLAP SERVER

FRONT END TOOL

User

Result

Result set

Request

SQL

Page 30: Dataware housing

OLAP OPERATION on the Multidimensional data

• Roll-up(GROUP)

• Drill down(Less)

• Slice and Dice(Pice)

• Pivot(rotate)

30

Page 31: Dataware housing

TYPES OF OLAP

• MOLAP(MULTIDIMENSIONAL OLAP)

• ROLAP(RELATIONAL ROLAP)

31

Page 32: Dataware housing

Multi-dimensional Data

• “Hey…I sold $100M worth of goods”

32MonthMonth1 1 22 3 3 4 4 776 6 5 5

Pro

du

ctP

rod

uct

Toothpaste Toothpaste

JuiceJuiceColaColaMilk Milk

CreamCream

Soap Soap

Regio

n

Regio

n

WWS S

N N

Dimensions: Dimensions: Product, Region, TimeProduct, Region, TimeHierarchical summarization pathsHierarchical summarization paths

Product Product Region Region TimeTimeIndustry Country YearIndustry Country Year

Category Region Quarter Category Region Quarter

Product City Month WeekProduct City Month Week

Office DayOffice Day

Page 33: Dataware housing

Data Warehouse Architecture

33

Data Warehouse Engine

Optimized Loader

ExtractionCleansing

AnalyzeQuery

Metadata Repository

RelationalDatabases

LegacyData

Purchased Data

ERPSystems

Page 34: Dataware housing

Architecture of data warehousing

34

External data

Data Acquisition

Data Manager

Warehouse data

External data

Data Dictionary

Information Directiory

Warehouse data

Middleware

Design

Management

Data Access

Page 35: Dataware housing

Architecture of

35

Page 36: Dataware housing

Design Component

• The data warehouse designer design the database of the data warehouse and the warehouse administrator manages the data warehouse.

• The designer and administrator use the design component to design and store data

36

Page 37: Dataware housing

Types of design

• Bottom-up design

Business value can be returned as quickly as the first data marts can be created

• Top-down design

Atomic data, that is, data at the lowest level of detail, are stored in the data warehouse.

• Hybrid design

37

Page 38: Dataware housing

Data Manager Component

• The database in the data warehouse uses the data manager component for managing and accessing the data stored in the data warehouse.

• Rdbms

• Mdbms

38

Page 39: Dataware housing

Management Component

• Administering data acquisition operation

• Managing backup copies of the data

• Recovering the lost data

• Providing security to the data stored in the data warehouse.

• Authorizing access to the data stored in the data warehouse.

39

Page 40: Dataware housing

Data Acquisition Component

• This component acquires data from various sources by using the data acquisition applications

• The data acquisition applications are based on rules that are defined by the data warehouse developers.

40

Page 41: Dataware housing

The operation performed during data clean up

• Restructuring the records and fields of the database tables.

• Removing the irrelevant and redundant data

• obtaining and adding missing data.

• Verifying integrity and consistency of the data

41

Page 42: Dataware housing

The operation performed on the data for enhancement are

• Decoding and translating the values in fields.

• Summarizing data

• Calculating the derived values.

42

Page 43: Dataware housing

Information directory Component

• This component helps the end users to know the details of the data stored in the data warehouse.

• This is done with the help of the data about the data named meta data.

• Technical data• Business data

43

Page 44: Dataware housing

Middleware Component

• This components connect to the local databases.

• Analytical server used to analyze multidimensional data.

• Intelligent data warehousing middleware to control the access to the warehouse database.

44

Page 45: Dataware housing

Data Mart• Data mart is a database that

contains data needed for a small group of users for

their own department needs.

•Dependent data mart

•Independent data mart

45

Page 46: Dataware housing

Different between data warehouse and data mart

Data warehouse Data Mart

Data mart is therefore useful for small organizations with very few departments

data warehousing is suitable to support an entire corporate environment.

If you listen to some vendors, you may be left thinking that building data warehouses is a waste of time.

data mart vendor that tells you this are looking out for their own best interests.

This supports the entire information requirement of an organization.

This support the information requirement of a department in an organization

This has large model, wider implementation, large data and more number of users.

This has small data model, shorter implementation, less data and some users. 46

Page 47: Dataware housing

Advantages of data mart

• Since each department has its own data mart, the departments can summarize, sort , select structure etc their own department’s data. This will not confused with any other department.

• The department can do whatever DSS processing they want.

• The processing cost and storage are less that the data warehouse.

• The department can select a software for their data mart. it is powerful to fit their needs.

47

Page 48: Dataware housing

Data warehousing life cycle

48

Design

Enhance prototype

Operate

deploy

Page 49: Dataware housing

Data Modeling(Multi-dimensional Database)

• “Hey…I sold $100M worth of goods”

49MonthMonth1 1 22 3 3 4 4 776 6 5 5

Pro

du

ctP

rod

uct

Toothpaste Toothpaste

JuiceJuiceColaColaMilk Milk

CreamCream

Soap Soap

Regio

n

Regio

n

WWS S

N N

Dimensions: Dimensions: Product, Region, Product, Region, periodsperiodsHierarchical summarization pathsHierarchical summarization paths

Product Product Region Region PeriodPeriodIndustry Country YearIndustry Country Year

Category Region Quarter Category Region Quarter

Product City Month WeekProduct City Month Week

Office DayOffice Day

Page 50: Dataware housing

Building of data warehouse The builder must forecast the usage of the warehouse by the users. The design should support accessing data with any meaningful

values of the attributes. To build a good data warehouse data acquisition process must

follow the steps given flowextract the data from multiple heterogeneous sourcesFormat the data for consistency within the warehouse.The data must be cleaned to ensure validityThe data must be converted from relational ,object

oriented ,hierarchy model to a multidimensional model.The data are loaded into the warehouse. Good

monitoring tools are necessary to recover from incorrect load.

50

Page 51: Dataware housing

Data warehouse and views

• Data warehouse is a permanent storage of data in multidimensional tables.

• View are temporarily created when needed using data warehouse.

• This is used for decision support system.

51

Page 52: Dataware housing

Different between Data warehouse and views

Data warehouse Views

Data warehouse is a permanent storage data.

Views are created from warehouse data when needed and it is not permanent

Data warehouse are multidimensional Views are relational

Data warehouse can be indexed to maximize performance.

Views cannot be indexed.

Data warehouse provides specific support to a functionality

Views cannot give specific support to a functionality.

Data warehouse provide large amount of data.

Views are created by extracting minimum data from data warehouse.

52

Page 53: Dataware housing

Data warehouse Future

• New techniques must be introduced in data cleaning ,indexing and partitioning.

• The manual operation involved in data acquisition ,management data quality and performance maximization must be automated.

• Proper business rules must be developed and incorporated in warehouse creation and maintenance process.

53

Page 54: Dataware housing

Data Mining

• Data mining is sorting through data to identify patterns and establish relationships.

54

Page 55: Dataware housing

Data Mining (cont.)

55

Page 56: Dataware housing

Data Mining works with Warehouse Data

• Data Warehousing provides the Enterprise with a memory

56

Data Mining provides the Enterprise with intelligence

Page 57: Dataware housing

Data Mining Motivation

“The key in business is to know something that nobody else knows.”

— Aristotle Onassis

“To understand is to perceive patterns.”

— Sir Isaiah Berlin

57

PH

OT

O: L

UC

IND

A D

OU

GL

AS

-ME

NZ

IES

PHOTO: HULTON-DEUTSCH COLL

Page 58: Dataware housing

Application Areas

58

Industry ApplicationFinance Credit Card AnalysisInsurance Claims, Fraud Analysis

Telecommunication Call record analysis

Consumer goods promotion analysisData Service providersValue added dataUtilities Power usage analysis

Page 59: Dataware housing

Data Mining in Use

• The US Government uses Data Mining to track fraud

• A Supermarket becomes an information broker• Basketball teams use it to track game strategy• Cross Selling• Warranty claims Routing• Holding on to Good Customers• Weeding out Bad Customers

59

Page 60: Dataware housing

What is data mining technology

The process of extracting or finding hidden knowledge from large database is called data mining.

Ex: Age 21------ we can understand he is major

60

data information

Page 61: Dataware housing

Data Mining Technology

61

Cleaning and

Integration

Databases

Data Warehouse

Flat Files

Patterns

Knowledge

Selection and transformation

Data Mining

Page 62: Dataware housing

Data Mining Technology various step

• Data cleaning To remove noise and inconsistent data• Data integration Data from multiple sources are combined• Data selection relevant data are retrieved from the

database for analysis• Data transformation The selected data are made for

mining by performing aggregation operations• Data mining Intelligent methods are applied to extract data

patterns• Pattern evaluation Identify the needed patterns• Knowledge presentation present the mined knowledge to

the user62

Page 63: Dataware housing

Loading the Warehouse

Cleaning the data before it is loaded

Page 64: Dataware housing

Data Integration Across Sources

64

Trust Credit cardSavings Loans

Same data different name

Different data Same name

Data found here nowhere else

Different keyssame data

Page 65: Dataware housing

Data Transformation Example

65

en

cod

ing

unit

field

appl A - balanceappl B - balappl C - currbalappl D - balcurr

appl A - pipeline - cmappl B - pipeline - inappl C - pipeline - feetappl D - pipeline - yds

appl A - m,fappl B - 1,0appl C - x,yappl D - male, female

Data Warehouse

Page 66: Dataware housing

Structuring/Modeling Issues

Page 67: Dataware housing

Data Warehouse vs. Data Marts

Page 68: Dataware housing

From the Data Warehouse to Data Marts

68

DepartmentallyStructured

IndividuallyStructured

Data WarehouseOrganizationallyStructured

Less

More

HistoryNormalizedDetailed

Data

Information

Page 69: Dataware housing

Data Warehouse and Data Marts

69

OLAPData MartLightly summarizedDepartmentally structured

Organizationally structuredAtomicDetailed Data Warehouse Data

Page 70: Dataware housing

Characteristics of the Departmental Data Mart

• OLAP• Small• Flexible• Customized by Department• Source is departmentally

structured data warehouse

70

Page 71: Dataware housing

Techniques for Creating Departmental Data Mart

• OLAP

• Subset

• Summarized

• Superset

• Indexed

• Arrayed

71

Sales Mktg.Finance

Page 72: Dataware housing

Data Mart Centric

72

Data Marts

Data Sources

Data Warehouse

Page 73: Dataware housing

True Warehouse

73

Data Marts

Data Sources

Data Warehouse

Page 74: Dataware housing

II. On-Line Analytical Processing (OLAP)

Making Decision Support Possible

Page 75: Dataware housing

What Is OLAP?

• Online Analytical Processing - coined by EF Codd in 1994 paper contracted by Arbor Software

• Generally synonymous with earlier terms such as Decisions Support, Business Intelligence, Executive Information System

• OLAP = Multidimensional Database• MOLAP: Multidimensional OLAP (Arbor Essbase, Oracle

Express)• ROLAP: Relational OLAP (Informix MetaCube,

Microstrategy DSS Agent)

75

Page 76: Dataware housing

The OLAP Market

• Rapid growth in the enterprise market• 1995: $700 Million• 1997: $2.1 Billion

• Significant consolidation activity among major DBMS vendors• 10/94: Sybase acquires ExpressWay• 7/95: Oracle acquires Express • 11/95: Informix acquires Metacube• 1/97: Arbor partners up with IBM• 10/96: Microsoft acquires Panorama

• Result: OLAP shifted from small vertical niche to mainstream DBMS category

76

Page 77: Dataware housing

Strengths of OLAP

• It is a powerful visualization paradigm

• It provides fast, interactive response

times

• It is good for analyzing time series

• It can be useful to find some clusters

and outliers

• Many vendors offer OLAP tools

77

Page 78: Dataware housing

OLAP Is FASMI

• Fast• Analysis• Shared• Multidimensional• Information

78

Page 79: Dataware housing

Data Cube Lattice

• Cube lattice• ABC

AB AC BC A B C none

• Can materialize some groupbys, compute others on demand

• Question: which groupbys to materialze?• Question: what indices to create• Question: how to organize data (chunks, etc)

79

Page 80: Dataware housing

Visualizing Neighbors is simpler

1 2 3 4 5 6 7 8AprMayJunJulAugSepOctNovDecJanFebMar

80

Month Store SalesApr 1Apr 2Apr 3Apr 4Apr 5Apr 6Apr 7Apr 8May 1May 2May 3May 4May 5May 6May 7May 8Jun 1Jun 2

Page 81: Dataware housing

A Visual Operation: Pivot (Rotate)

81

1010

4747

3030

1212

JuiceJuice

ColaCola

Milk Milk

CreaCreamm

NYNY

LALA

SFSF

3/1 3/2 3/3 3/1 3/2 3/3 3/43/4

DateDate

Month

Month

Reg

ion

Reg

ion

ProductProduct

Page 82: Dataware housing

“Slicing and Dicing”

82

Product

Sales Channel

Regio

ns

Retail Direct Special

Household

Telecomm

Video

Audio IndiaFar East

Europe

The Telecomm Slice

Page 83: Dataware housing

Roll-up and Drill Down

• Sales Channel• Region• Country• State • Location Address• Sales Representative

83

Roll

Up

Higher Level ofAggregation

Low-levelDetails

Drill-D

ow

n

Page 84: Dataware housing

Nature of OLAP Analysis

• Aggregation -- (total sales, percent-to-total)

• Comparison -- Budget vs. Expenses

• Ranking -- Top 10, quartile analysis

• Access to detailed and aggregate data

• Complex criteria specification• Visualization

84

Page 85: Dataware housing

Organizationally Structured Data

• Different Departments look at the same detailed data in different ways. Without the detailed, organizationally structured data as a foundation, there is no reconcilability of data

85

marketing

manufacturing

sales

finance

Page 86: Dataware housing

Multidimensional Spreadsheets

• Analysts need spreadsheets that support• pivot tables (cross-tabs)• drill-down and roll-up• slice and dice• sort• selections• derived attributes

• Popular in retail domain

86

Page 87: Dataware housing

OLAP Operations

© Prentice Hall 87

Single Cell Multiple Cells Slice Dice

Roll Up

Drill Down

Page 88: Dataware housing

Relational OLAP: 3 Tier DSS

88

Data Warehouse ROLAP Engine Decision Support Client

Database Layer Application Logic Layer Presentation Layer

Store atomic data in industry standard RDBMS.

Generate SQL execution plans in the ROLAP engine to obtain OLAP functionality.

Obtain multi-dimensional reports from the DSS Client.

Page 89: Dataware housing

MD-OLAP: 2 Tier DSS

89

MDDB Engine MDDB Engine Decision Support Client

Database Layer Application Logic Layer Presentation Layer

Store atomic data in a proprietary data structure (MDDB), pre-calculate as many outcomes as possible, obtain OLAP functionality via proprietary algorithms running against this data.

Obtain multi-dimensional reports from the DSS Client.

Page 90: Dataware housing

MSPVL Polytechnic CollegePavoorchatram

90