data warehousing

Post on 20-Jun-2015

552 Views

Category:

Technology

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Data Warehousing&

Data Mining

By Mandar KulkarniPRN 10030141129

MBA-ITSICSR

Contents

• Data warehousing• Understanding data warehousing• Data warehouse architecture• Data Mining• Data mining techniques

Warehouse?

Real time example?

Data Warehousing

Samsung

Mumbai

Delhi

Chennai

Banglore

SalesManager

Sales per item type per branchfor first quarter.

• Now, the sales manager wants to know the sales of first quarter.?

• Solution– Extract information from each database store it at

a single place, and process using operational systems.!

Mumbai

Delhi

Chennai

Banglore

DataWarehouse

SalesManager

Query &Analysis tools

Report

Solution

Operational Systems

• Running the business real time• Routine tasks• Decision Support Systems(DSS)– Help in taking actions!

• Used by people who deal with customers, products

• They are increasingly used by customers

Data Warehouse

• A single, complete and consistent store of data obtained from a variety of different sources made available to end users in a what they can understand and use in a business context.

• A process of transforming data into information and making it available to users in a timely enough manner to make a difference

Definition

• Integrated, Subject-Oriented, Time-Variant, Nonvolatile database that provides support for

decision making

Data warehouse architecture

External

Production

Internal

Source Data

Archived Data MartsData Staging

Metadata

Data Warehouse DBMS

MDDB

Information DeliveryManagement & Control

OLAP

Report /Query

Data Mining

Components

• Source Data • Data Staging (Data Extraction, cleaning And Loading )– Talend is the first open source ETL tool

• Data Storage • Information Delivery (EIS)• Management and control

OLAP

• Online Analytical Processing Tools• DSS tools that use multidimensional data

analysis techniques– Support for a DSS data store– Data extraction and integration filter– Specialized presentation interface

• Oracle OLAP 11G

Multidimensional analysis

OLAP architecture

12 Rules of Data Warehouse

1. Data Warehouse and Operational Environments are Separated

2. Data is integrated3. Contains historical data over a long period of

time4. Data is a snapshot data captured at a given

point in time5. Data is subject-oriented

6.Mainly read-only with periodic batch updates

7.Development Life Cycle has a data driven approach versus the traditional process-driven approach

8.Data contains several levels of detail-Current, Old, Lightly Summarized, Highly Summarized

9.Environment is characterized by Read-only transactions to very large data sets

10.System that traces data sources, transformations, and storage

11.Metadata is a critical component– Source, transformation, integration, storage, relationships,

history, etc

12.Contains a chargeback mechanism for resource usage that enforces optimal use of data by end users

OLTP v/s Data warehousing

OLTP• Application Oriented • Used to Run Business• Detailed data • Current up-to date • Isolated data• Repetitive Access• Performance Sensitive• Few records accessed• Read/Update Access

Data Warehousing • Subject Oriented• Used to analyze business• Summarized and refined• Snapshot Data • Integrated Data• Ad-Hoc Access• Performance relaxed• Large volume accessed at a

time• Mostly Read

Data Warehouse summary

• Integrated platform for OLAP and DSS

• Helps optimize business operations

• Easy access to multidimensional data

Data Mining

Why Data Mining?

Strategic decision making

Wealth generation

Analyzing trends

Security

Data Mining

• Look for hidden patterns and trends in data that is not immediately apparent from summarizing the data

• No Query…

• …But an “Interestingness criteria”

Data Mining

+ =Data

Interestingnesscriteria

Hiddenpatterns

Data Mining

+ =Data

Interestingnesscriteria

Hiddenpatterns

Type of Patterns

Data Mining

+ =Data

Interestingnesscriteria

Hiddenpatterns

Type of data Type of Interestingness criteria

Type of Data• Tabular (Ex: Transaction data)

– Relational– Multi-dimensional

• Tree (Ex: XML data)

• Graphs

• Sequence (Ex: DNA, activity logs)

• Text, Multimedia …

Type of Interestingness

• Frequency• Rarity• Correlation • Length of occurrence (for sequence and temporal data)

• Consistency • Repeating / periodicity • “Abnormal” behavior • Other patterns of interestingness…

Data Mining vs Statistical Inference

Statistics:

ConceptualModel

(Hypothesis)

StatisticalReasoning

“Proof”(Validation of Hypothesis)

Data Mining vs Statistical Inference

Data mining:

MiningAlgorithmBased on InterestingnessData

Pattern (model, rule, hypothesis)discovery

Used for..

• Data mining is used for– Frequent Item-sets– Associations– Classifications– Clustering

Techniques • Algorithms– Apriori algorithm

– Decision tree• SLIQ– Supervised Learning in QUEST– IBM

• “GROUP BY”mysql> select sum(sal),deptno from emp group by deptno;

Data Mining Summary

• Helps in pattern analysis and thus taking actions –real time and future based.

• Analyzing trends and clusters in business operations.

References

• http://www.datawarehousing.com/ • http://www.dw-institute.com/ • http://www.almaden.ibm.com/cs/quest/index.html

Thank you

Any Questions?

top related