it ready - dw: 1st day

98
Northern Region IT Professional Development Program 2010 Data Warehousing (DAY 1) Siwawong W. Project Manager 2010.05.24

Upload: siwawong-wuttipongprasert

Post on 20-May-2015

52.112 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: IT Ready - DW: 1st Day

Northern Region IT Professional Development Program 2010

Data Warehousing (DAY 1)

Siwawong W.Project Manager

2010.05.24

Page 2: IT Ready - DW: 1st Day

Northern Region IT Professional Development Program 2010

Agenda

09:00 – 09:15 Registration

09:15 – 09:30 Self-Introduction

09:30 – 10:30 Data Warehouse: Introduction

10:30 – 10:45 Break & Morning Refreshment

10:45 – 12:00 Data Warehouse: Introduction (Cont’)

12:00 – 13:00 Lunch Break

13:00 – 15:00 Review RDBMS & SQL command

15:00 – 15:15 Break

15:15 – 16:00 Case Study ~ Q/A

Page 3: IT Ready - DW: 1st Day

Northern Region IT Professional Development Program 2010

SELF-INTRODUCTION

Page 4: IT Ready - DW: 1st Day

Northern Region IT Professional Development Program 2010

About Me

• My Name: Siwawong Wuttipongprasert– Nick-name: Tae (You can call this name. it’s easier)

• My Background: – B.Eng (Computer Engineering), Chiang Mai University.

• My Career Profile: – 10+ years in IT business– 5+ years with Blue Ball Co., Ltd.– Role: Programmer, System Analysis, Consultant & Project Manager– Working Area: ERP, MRP, Retailing, Banking, Financial, E-Commerce, etc.– Working with multi-cultures: Japanese, German and Vietnamese

• Know Me More..

Page 5: IT Ready - DW: 1st Day

Northern Region IT Professional Development Program 2010

My Company: Blue Ball

Blue Ball Group is an Offshoring Company that focus totally in customer satisfaction. It takes advantage of western management combined with Asian human resources to provide high quality services

Thailand (Head Office)

Mexico (Special Developments)

Vietnam (Offshoring Center)

Page 6: IT Ready - DW: 1st Day

Northern Region IT Professional Development Program 2010

Services from My Company

Offshoring Programmers &Testers Blue Ball will get you ready to offshore successfully. No need to rush you into offshoring without you feeling confident on how to send, organize, receive, test and accept job confidently 

System Development & Embedded Solutions Solutions that combine technological expertise and deep business understanding. We only start coding once every single detail such as milestones, scheduling, contact point, communication, issue management and critical protocols are in place

Web design and E-commerce Premium web design, CMS, e-commerce solutions and SEO services. Website maintenance and copy content creation to develop marketing campaigns that SELL for discerning companies to increase the quality and reach of their marketing campaigns

Page 7: IT Ready - DW: 1st Day

Northern Region IT Professional Development Program 2010

My Clients

Page 8: IT Ready - DW: 1st Day

Northern Region IT Professional Development Program 2010

Data Warehouse: Introduction

Page 9: IT Ready - DW: 1st Day

Northern Region IT Professional Development Program 2010

Data Warehouse: Introduction

• Data Warehousing, OLAP and data mining: – what and why (now)?

• Relation to OLTP• Review RDMBS & SQL Command• A case study

Page 10: IT Ready - DW: 1st Day

Northern Region IT Professional Development Program 2010

Data Warehouse: What & Why?

Problem Statements

Page 11: IT Ready - DW: 1st Day

Northern Region IT Professional Development Program 2010

A producer wants to know….

Which are our lowest/highest margin

customers ?

Which are our lowest/highest margin

customers ?

Who are my customers and what products are they buying?

Who are my customers and what products are they buying?

Which customers are most likely to go to the competition ?

Which customers are most likely to go to the competition ?

What impact will new products/services

have on revenue and margins?

What impact will new products/services

have on revenue and margins?

What product prom--otions have the biggest

impact on revenue?

What product prom--otions have the biggest

impact on revenue?

What is the most effective distribution

channel?

What is the most effective distribution

channel?

Page 12: IT Ready - DW: 1st Day

Northern Region IT Professional Development Program 2010

Data, Data everywhere, yet ...

• I can’t find the data I need– data is scattered over the network

– many versions, subtle differences

• I can’t get the data I need– need an expert to get the data

• I can’t understand the data I found– available data poorly documented

• I can’t use the data I found– results are unexpected

– data needs to be transformed from one form to other

Page 13: IT Ready - DW: 1st Day

Northern Region IT Professional Development Program 2010

What is a Data Warehouse?

A single, complete and consistent store of

data obtained from a variety of different

sources made available to end users in a

what they can understand and use in a

business context.

[Barry Devlin]

Page 14: IT Ready - DW: 1st Day

Northern Region IT Professional Development Program 2010

What are the users saying...

• Data should be integrated across the enterprise

• Summary data has a real value to the organization

• Historical data holds the key to understanding data over time

• What-if capabilities are required

Page 15: IT Ready - DW: 1st Day

Northern Region IT Professional Development Program 2010

What is Data Warehousing?

A processprocess of transforming data into information and making it available to users in a timely enough manner to make a difference

[Forrester Research, April 1996]

Data

Information

Page 16: IT Ready - DW: 1st Day

Northern Region IT Professional Development Program 2010

Evolution

• 60’s: Batch reports– hard to find and analyze information– inflexible and expensive, reprogram every new request

• 70’s: Terminal-based DSS and EIS (executive information systems)– still inflexible, not integrated with desktop tools

• 80’s: Desktop data access and analysis tools– query tools, spreadsheets, GUIs– easier to use, but only access operational databases

• 90’s: Data warehousing with integrated OLAP engines and tools

Page 17: IT Ready - DW: 1st Day

Northern Region IT Professional Development Program 2010

Very Large Data Bases

Terabytes -- 10^12 bytes:

Petabytes -- 10^15 bytes:

Exabytes -- 10^18 bytes:

Zettabytes -- 10^21 bytes:

Zottabytes -- 10^24 bytes:

Walmart -- 24 Terabytes

Intelligence Agency Videos

Geographic Information Systems

National Medical Records

Weather images

Page 18: IT Ready - DW: 1st Day

Northern Region IT Professional Development Program 2010

Data Warehousing -- It is a process

• Technique for assembling and managing data from various sources for the purpose of answering business questions. Thus making decisions that were not previous possible

• A decision support database maintained separately from the organization’s operational database

Page 19: IT Ready - DW: 1st Day

Northern Region IT Professional Development Program 2010

Data Warehouse

• A data warehouse is a – subject-oriented: Organized based on use

– Integrated: inconsistencies remove

– time-varying: data are normally time-series

– non-volatile: store in read-only format

collection of data that is used primarily in organizational decision making.

-- Bill Inmon, Building the Data Warehouse 1996

Page 20: IT Ready - DW: 1st Day

Northern Region IT Professional Development Program 2010

Data Warehouse: Subjected-Oriented

WH is organized around the major subjects of the enterprise..rather than the major application areas..

This is reflected in the need to store decision-support data rather than application-oriented data

DBWH Sales

Subject-OrientedSubject-Oriented

OperationalDB

Order Processing

Application-OrientedApplication-Oriented

Page 21: IT Ready - DW: 1st Day

Northern Region IT Professional Development Program 2010

Data Warehouse: Integrated

Because the source data come together from different enterprise-wide applications systems.

The source data is often inconsistent using the integrated data source must be made consistent to present a unified view of the data to the users

Page 22: IT Ready - DW: 1st Day

Northern Region IT Professional Development Program 2010

Data Warehouse: time-varying

The source data in the WH is only accurate and valid at some point in time or over some time interval.

The time-variance of the data warehouse is also shown in the extended time that the data is held, the implicit or explicit association of time with all data, and the fact that the data represents a series of snapshots

Historical data is recorded

Page 23: IT Ready - DW: 1st Day

Northern Region IT Professional Development Program 2010

Data Warehouse: Non-volatile

Data is NOT update in real timeNOT update in real time but is refresh from OS on a regular basis. New data is always added as a supplement to DB, rather than replacement. The DB continually absorbs this new data, incrementally integrating it with previous data

Anyone who is using the database has confidence that a query will always produce the same result no matter how often it is run

Page 24: IT Ready - DW: 1st Day

Northern Region IT Professional Development Program 2010

Explorers, Farmers and Tourists

Explorers: Seek out the unknown and previously unsuspected rewards hiding in the detailed data

Farmers: Harvest informationfrom known access paths

Tourists: Browse information harvested by farmers

Page 25: IT Ready - DW: 1st Day

Northern Region IT Professional Development Program 2010

Data Warehouse Architecture

Data Warehouse Engine

Optimized Loader

ExtractionCleansing

AnalyzeQuery

Metadata Repository

RelationalDatabases

LegacyData

Purchased Data

ERPSystems

Page 26: IT Ready - DW: 1st Day

Northern Region IT Professional Development Program 2010

OLAP & Data Mining

Page 27: IT Ready - DW: 1st Day

Northern Region IT Professional Development Program 2010

Data Warehouse for DS & OLAP

• Putting Information technology to help the knowledge worker make faster and better decisions

– Which of my customers are most likely to go to the competition?

– What product promotions have the biggest impact on revenue?

– How did the share price of software companies correlate with profits over last 10 years?

Page 28: IT Ready - DW: 1st Day

Northern Region IT Professional Development Program 2010

Decision Support (DS)

• Used to manage and control business

• Data is historical or point-in-time

• Optimized for inquiry rather than update

• Use of the system is loosely defined and can be ad-hoc

• Used by managers and end-users to understand the business and

make judgements

Page 29: IT Ready - DW: 1st Day

Northern Region IT Professional Development Program 2010

Data Mining works with Warehouse Data

Data Warehousing provides the Enterprise

with a memory

Data Mining provides the Enterprise with

intelligence

Page 30: IT Ready - DW: 1st Day

Northern Region IT Professional Development Program 2010

We want to know ...

• Given a database of 100,000 names, which persons are the least likely to default on their credit cards?

• Which types of transactions are likely to be fraudulent given the demographics and transactional history of a particular customer?

• If I raise the price of my product by Rs. 2, what is the effect on my ROI?

• If I offer only 2,500 airline miles as an incentive to purchase rather than 5,000, how many lost responses will result?

• If I emphasize ease-of-use of the product as opposed to its technical capabilities, what will be the net effect on my revenues?

• Which of my customers are likely to be the most loyal?

Data Mining helps extract such information

Page 31: IT Ready - DW: 1st Day

Northern Region IT Professional Development Program 2010

Application Areas

Industry ApplicationFinance Credit Card AnalysisInsurance Claims, Fraud AnalysisTelecommunication Call record analysisTransport Logistics managementConsumer goods promotion analysisData Service providers Value added dataUtilities Power usage analysis

Page 32: IT Ready - DW: 1st Day

Northern Region IT Professional Development Program 2010

Data Mining in Use

• The US Government uses Data Mining to track fraud• A Supermarket becomes an information broker• Basketball teams use it to track game strategy• Cross Selling• Warranty Claims Routing• Holding on to Good Customers• Weeding out Bad Customers

Page 33: IT Ready - DW: 1st Day

Northern Region IT Professional Development Program 2010

What makes data mining possible?

• Advances in the following areas are making data mining deployable:

– data warehousing – better and more data

i.e., operational, behavioral, and demographic

– the emergence of easily deployed data mining tools and – the advent of new data mining techniques.

-- Gartner Group

Page 34: IT Ready - DW: 1st Day

Northern Region IT Professional Development Program 2010

Why Separate Data Warehouse?

• Performance– Operation DBs designed & tuned for known transaction & workloads.– Complex OLAP queries would degrade performance for operation transaction.– Special data organization, access & implementation methods needed for

multidimensional views & queries.

• Function– Missing data: Decision support requires historical data, which operation DBs

do not typically maintain.– Data consolidation: Decision support requires consolidation (aggregation,

summarization) of data from many heterogeneous sources: operation DBs, external sources.

– Data quality: Different sources typically use inconsistent data representations, codes, and formats which have to be reconciled.

Page 35: IT Ready - DW: 1st Day

Northern Region IT Professional Development Program 2010

What’s OLTP?

• DBMS built for OnLine Transaction Processing (OLTP) is generally regarded as unsuitable for data warehousing because each system is designed with a differing set of requirements in mind

• example: OLTP systems are design to maximize the transaction processing capacity, while data warehouses are designed to support ad hoc query processing

Page 36: IT Ready - DW: 1st Day

Northern Region IT Professional Development Program 2010

What are Operational Systems?

• They are OLTP systems

• Run mission critical applications

• Need to work with stringent performance requirements for routine tasks

• Used to run a business!

Page 37: IT Ready - DW: 1st Day

Northern Region IT Professional Development Program 2010

RDBMS used for OLTP

• Database Systems have been used traditionally for OLTP

– clerical data processing tasks– detailed, up to date data– structured repetitive tasks– read/update a few records– isolation, recovery and integrity are critical

Page 38: IT Ready - DW: 1st Day

Northern Region IT Professional Development Program 2010

Operational Systems

• Run the business in real time

• Based on up-to-the-second data

• Optimized to handle large numbers of simple read/write transactions

• Optimized for fast response to predefined transactions

• Used by people who deal with customers, products -- clerks, salespeople etc.

• They are increasingly used by customers

Page 39: IT Ready - DW: 1st Day

Northern Region IT Professional Development Program 2010

Examples of Operational Data

Data Industry Usage Technology Volumes

Customer File

All Track Customer Detail

Legacy application, flat files, main frames

Small-medium

Account Balance

Finance Control Account Activities

Legacy applications, hierarchical databases, mainframe

Large

Point-of- Sale data

Retail Generate bills, manage stock

ERP, Client/Server, relational databases

Very Large

Call Record Tele-Comm.

Billing Legacy application, hierarchical database, mainframe

Very Large

Production Record

Mfg. Control Production

ERP, RDBMS, AS/400 Medium

Page 40: IT Ready - DW: 1st Day

Northern Region IT Professional Development Program 2010

Related to OLTP

Page 41: IT Ready - DW: 1st Day

Northern Region IT Professional Development Program 2010

Application-Orientation vs. Subject-Orientation

Application-Orientation

Operational Database

LoansCredit Card

TrustSavings

Subject-Orientation

DataWarehouse

Customer

Vendor

Product

Activity

Page 42: IT Ready - DW: 1st Day

Northern Region IT Professional Development Program 2010

OLTP vs. Data Warehouse

• OLTP systems are tuned for known transactions and workloads while workload is not known a priori in a data warehouse

• Special data organization, access methods and implementation methods are needed to support data warehouse queries (typically multidimensional queries)

– e.g., average amount spent on phone calls between 9AM-5PM in Pune during the month of December

Page 43: IT Ready - DW: 1st Day

Northern Region IT Professional Development Program 2010

OLTP vs. Data Warehouse

• OLTP– Application Oriented

– Used to run business

– Detailed data

– Current up to date

– Isolated Data

– Repetitive access

– Clerical User

• Warehouse (DSS)– Subject Oriented

– Used to analyze business

– Summarized and refined

– Snapshot data

– Integrated Data

– Ad-hoc access

– Knowledge User (Manager)

Page 44: IT Ready - DW: 1st Day

Northern Region IT Professional Development Program 2010

OLTP vs. Data Warehouse

• OLTP– Performance Sensitive– Few Records accessed at a time

(tens)

– Read/Update Access

– No data redundancy– Database Size 100MB -100 GB

• Data Warehouse– Performance relaxed– Large volumes accessed at a

time(millions)

– Mostly Read (Batch Update)

– Redundancy present– Database Size 100 GB -

few terabytes

Page 45: IT Ready - DW: 1st Day

Northern Region IT Professional Development Program 2010

OLTP vs. Data Warehouse

• OLTP– Transaction throughput is the

performance metric

– Thousands of users

– Managed in entirety

• Data Warehouse– Query throughput is the

performance metric

– Hundreds of users

– Managed by subsets

Page 46: IT Ready - DW: 1st Day

Northern Region IT Professional Development Program 2010

To summarize ...

OLTP Systems are used to “run” a business

The Data Warehouse helps to “optimize” the business

Page 47: IT Ready - DW: 1st Day

Northern Region IT Professional Development Program 2010

Why Now?

• Data is being produced• ERP provides clean data• The computing power is available• The computing power is affordable• The competitive pressures are strong• Commercial products are available

Page 48: IT Ready - DW: 1st Day

Northern Region IT Professional Development Program 2010

Myths surrounding OLAP Servers and Data Marts

• Data marts and OLAP servers are departmental solutions supporting a handful of users

• Million dollar massively parallel hardware is needed to deliver fast time for complex queries

• OLAP servers require massive and unwieldy indices

• Complex OLAP queries clog the network with data

• Data warehouses must be at least 100 GB to be effective

» Source -- Arbor Software Home Page

Page 49: IT Ready - DW: 1st Day

Northern Region IT Professional Development Program 2010

Advantages Of Data Warehouse

• High query performance

– But not necessarily most current information

• Doesn’t interfere with local processing at sources

– Complex queries at warehouse

– OLTP at information sources

• Information copied at warehouse

– Can modify, annotate, summarize,  restructure, etc.

– Can store historical information

– Security

Page 50: IT Ready - DW: 1st Day

Northern Region IT Professional Development Program 2010

Data Warehouse: Pain Beware

• Data Integration

• Data Quality

• Data Availability

• End user education

• Proper sizing – HW and Database environment

• Lack of off-the-shelf product (mature Packaged Analytics)

• Post implementation challenges– Ensure usage– Identify new areas of Intelligence– Measure business benefits – productivity enhancements, savings, increase in revenue etc.

Page 51: IT Ready - DW: 1st Day

Northern Region IT Professional Development Program 2010

Review RDBMS & SQL statement

Page 52: IT Ready - DW: 1st Day

Northern Region IT Professional Development Program 2010

Relational DBMS: Properties

• Each relation (or table) in a database has a unique name• An entry at the intersection of each row and column is

atomic (or single-valued);– there can be no multi-valued attributes in a relation

• Each row is unique; – no two rows in a relation are identical

• Each attribute (or column) within a table has a unique name

• The sequence of columns (left to right) is insignificant; – the columns of a relation can be interchanged without changing the meaning or use of the

relation

• The sequence of rows (top to bottom) is insignificant;– rows of a relation may be interchanged or stored in any sequence

Page 53: IT Ready - DW: 1st Day

Northern Region IT Professional Development Program 2010

The Relational Model...

• The relational model of data has three major components:

Relational database objects allows to define data structures

Relational operators allows manipulation of stored data

Relational integrity constraints allows to defines business rules and ensure data integrity

Page 54: IT Ready - DW: 1st Day

Northern Region IT Professional Development Program 2010

The Relational Objects

• Location– Most RDBMS can have multiple locations, all managed by the

same database engine

Corporate Database

Accounting

Accounts Receivable

Accounts Payable

Accounting

PurchasingMarketing

Marketing

Sales Advertising

Page 55: IT Ready - DW: 1st Day

Northern Region IT Professional Development Program 2010

The Relational Objects

• Location

DBDatabase Server

Client Application Client Application Client Application

Multi-users

Page 56: IT Ready - DW: 1st Day

Northern Region IT Professional Development Program 2010

The Relational Objects...

• Database– A set of SQL objects

UPDATE UPDATE TT SET SETINSERT INTO INSERT INTO TTDELETE FROM DELETE FROM TTCALL STPROGCALL STPROG

Client Application

Database Server

StoredProcedure

BEGIN...

Table A

Table B

Table T

Update Trigger

Insert Trigger

Delete Trigger

BEGIN...

BEGIN...

BEGIN...

Page 57: IT Ready - DW: 1st Day

Northern Region IT Professional Development Program 2010

The Relational Objects...

• Database

– A collection of tables and associated indexes

Table

Department

Table

Product

Table

Customer

Table

Employee

Index

Files

Page 58: IT Ready - DW: 1st Day

Northern Region IT Professional Development Program 2010

The Relational Objects...

• Relation

– A named, two dimensional table of data

• Database– A collection of databases, tables and related objects

organised in a structured fashion.

– Several database vendors use schema interchangeably with database

Page 59: IT Ready - DW: 1st Day

Northern Region IT Professional Development Program 2010

Relational Objects...

Tables are comprised of rows and a fixed number of named columns.

Data is presented to the user as tables:

Column 1 Column 2 Column 3 Column 4

Row

Row

Row

Table

Page 60: IT Ready - DW: 1st Day

Northern Region IT Professional Development Program 2010

Relational Objects...

Columns are attributes describing an entity. Each column must have an unique name and a data type.

Data is presented to the user as tables:

Name Designation Department

Row

Row

Row

Employee

Structure of a relation (e.g. Employee)Employee(Name, Designation, Department)

Page 61: IT Ready - DW: 1st Day

Northern Region IT Professional Development Program 2010

Relational Objects...

Rows are records that present information about a particular entity occurrence

Data is presented to the user as tables:

Name Designation Department

Row

Row

Row

Employee

De Silva Manager Personnel

Perera Secretary Personnel

Dias Manager Sales

Page 62: IT Ready - DW: 1st Day

Northern Region IT Professional Development Program 2010

Relational Objects: Keys

• Key constraints– If a relation has more than one key, they are called candidate keys

– One of them is chosen as the primary key

• Relational Objects: Keys

Primary Key: An attribute (or combination of attributes) that uniquely identifies each row in a relation.

Employee(Emp_No, Emp_Name, Department)

Composite Key: A primary key that consists of more than one attribute

Salary(Emp_No, Eff_Date, Amount)

Page 63: IT Ready - DW: 1st Day

Northern Region IT Professional Development Program 2010

Relational Objects: Keys

Data is presented to the user as tables: Each table has a primary key. The primary key is a column or combination of

columns that uniquely identify each row of the table.

Primary Key

EmployeeE-No E-Name D-No

179 Silva 7857 Perera 4342 Dias 7

Primary Key

SalaryE-No Eff-Date Amt

179 1/1/98 8000857 3/7/94 9000179 1/6/97 7000342 28/1/97 7500

Page 64: IT Ready - DW: 1st Day

Northern Region IT Professional Development Program 2010

Relational Objects: Relationship

Foreign Key: An attribute in a relation of a database that serves as the primary key of another relation in the same database

Employee(Emp_No, Emp_Name, Department)

Department(Dept_No, Dept_Name, M_No)

=== works for ==>

Page 65: IT Ready - DW: 1st Day

Northern Region IT Professional Development Program 2010

Relational Objects: Foreign Key

A foreign key is a set of columns in one table that serve as the primary key in another table

Data is presented to the user as tables:

Foreign KeyPrimary Key

Primary Key

D-No D-Name M-No

4 Finance 857 7 Sales 179

Primary Key

DepartmentEmployeeE-No E-Name D-No

179 Silva 7857 Perera 4342 Dias 7

Rows in one or more tables are associated with each other solely through data values in columns (no pointers).

Page 66: IT Ready - DW: 1st Day

Northern Region IT Professional Development Program 2010

SQL

• A relational database language

– It is not a programming language but a comprehensive database sub-language language for controlling and interacting with a database management system.

• NOT a DBMS

• A powerful data manipulation language

– It has capabilities for: insertion update deletion query Protection

Page 67: IT Ready - DW: 1st Day

Northern Region IT Professional Development Program 2010

SQL (Cont’)

• Also designed for end users

• Non-procedural– We have to show ‘what is needed’ and not ‘how’, like in

‘relational algebra’– Is similar more to ‘relational calculus’

• Used in two ways:– Interactive– Programmatic: Dynamic / Embedded

Page 68: IT Ready - DW: 1st Day

Northern Region IT Professional Development Program 2010

Role of SQL

• A database programming language

• A database administration language

• A client/server language

• A distributed database language

RelationalDBMS

System Catalog User Tables

SQL

Page 69: IT Ready - DW: 1st Day

Northern Region IT Professional Development Program 2010

Role of SQL

• It is vendor independent.– If a user was dissatisfied with a particular DBMS he could switch

products easily without much overhead, as both would follow the same language standard.

• Client applications relatively portable.

• Programmer skills are portable.

• Supports many different client processes -- end-users, applications, developers, etc.

• Database servers use SQL to request services from each other.

Page 70: IT Ready - DW: 1st Day

Northern Region IT Professional Development Program 2010

SQL Basics: DDL

• CREATE TABLE Adds new table• DROP TABLE Removes existing tables• ALTER TABLE Modifies structure of tables• CREATE VIEW Adds a new view• DROP VIEW Removes a view• CREATE INDEX Build an index for a column• DROP INDEX Removes an index• CREATE SYNONYM Defines an alias for a database object• DROP SYNONYM Remove an alias• COMMENTS Describes a table or column• LABEL Defines a title for a table or column

Data Definition Language (DDL)

DDL defines the database: Physical Design

Page 71: IT Ready - DW: 1st Day

Northern Region IT Professional Development Program 2010

SQL Basics: DML

• SELECT Retrieves data

• INSERT Adds new rows of data

• DELETE Removes row of data

• UPDATE Modifies existing data

Data Manipulation Language (DML)

Page 72: IT Ready - DW: 1st Day

Northern Region IT Professional Development Program 2010

SQL Basics: DCL

• GRANT Gives user access privileges• REVOKE Removes privileges• COMMIT Ends current transaction• ROLLBACK Aborts current transaction

Data Control Language (DCL)

Page 73: IT Ready - DW: 1st Day

Northern Region IT Professional Development Program 2010

SQL Basics: Data Integrity

• Value of stored data can be lost in many ways:– Invalid data added to data base– Existing data modified to a incorrect value– Changes made lost due to system error or power failure– Changes partially applied

• Types of integrity constraints:– Required Data (NOT NULL)

– Validity Checking (CHECK)

– Entity Integrity (PRIMARY KEY & NOT NULL)

– Referential Integrity (FOREIGN KEY)– Business Rules (ASSERTION, TRIGGER)

– Consistency (CASCADE, RESTRICT, SET NULL)

Page 74: IT Ready - DW: 1st Day

Northern Region IT Professional Development Program 2010

SQL Basics: NULL values

• Null values provides a systematic way of handling missing or inapplicable data in SQL.

• It is inevitable that in real-world, some data are missing, not yet known or do not apply.

• Null value is not a real data value.

• Special Handling– Null values require special handling by SQL and the DBMS. Null values can be

handled inconsistently by various SQL products

– Example: How do we handle null values in summaries like SUM, AVERAGE, etc.?

Page 75: IT Ready - DW: 1st Day

Northern Region IT Professional Development Program 2010

Referential Integrity

Referential integrity constraints define the rules for associating rows with each other, i.e. columns which reference columns in other tables:

Every non-null value in a foreign key must have a corresponding value in the primary key which it references.

Department (Parent Table)

Dept-No

D1D3D2D7

Employee(Dependent Table)

Dept-No

D7?D1D3?D7

Emp-No

D2

INSERT ROW

UPDATE COLUMN

A row can be inserted or a column updated in the dependent table only if (1) there is a corresponding primary key value in the parent table, or (2) the foreign key value is set null.

Page 76: IT Ready - DW: 1st Day

Northern Region IT Professional Development Program 2010

Referential Integrity

Deleting parent rows

Department (Parent Table)

Dept-No

D1D3D2D7

Dept-No

D7?D1D3?D7

Emp-No

D2

DELETE ROW

CASCADE

RESTRICT

SET NULL

Database designers must explicitly declare the effect if a delete from the parent table on the dependent table:

CASCADE deletes associated dependent rows

RESTRICT will not allow delete from the parent table if there are associated dependent rows.

SET NULL sets the value of associated dependent columns to null values.

Page 77: IT Ready - DW: 1st Day

Northern Region IT Professional Development Program 2010

SQL for Data Manipulation

• Manipulation– SQL allows a user or an application program to update the

database by adding new data, removing old data, and modifying previously stored data.

• Retrieval– SQL allows a user or an application program to retrieve stored data

from the database and use it.

• Most Commonly Used Commands– SELECT INSERT– UPDATE DELETE

Page 78: IT Ready - DW: 1st Day

Northern Region IT Professional Development Program 2010

SQL for Data Manipulation

• High-level Language for data manipulation

• It does not require predefined navigation path

• It does not require knowledge of any key items

• It is uniform language for end-users and programmers

• It operates on one or more tables based on set theory, not on a record at a time.

Page 79: IT Ready - DW: 1st Day

Northern Region IT Professional Development Program 2010

Command: SELECT

• Function:– Retrieves data from one or more rows. Every SELECT statement produces a

table of query results containing one or more columns and zero or more rows.

SELECT {[ALL, DISTINCT]} select-item,), ]FROM ( table specification,){WHERE (search condition)}{GROUP BY ( group-column,)}{HAVING ( search condition)}{ORDER BY (sort specification,)}

Page 80: IT Ready - DW: 1st Day

Northern Region IT Professional Development Program 2010

Project Selected Columns

P_Id LastName FirstName Address City

1 Hansen Ola Timoteivn 10 Sandnes

2 Svendson Tove Borgvn 23 Sandnes

3 Pettersen Kari Storgt 20 Stavanger

The "Persons " table :

SELECT LastName,FirstName FROM Persons

LastName FirstName

Hansen Ola

Svendson Tove

Pettersen Kari

P_Id LastName FirstName

1 Hansen Ola

4 Nilsen Tom

3 Pettersen Kari

2 Svendson Tove

SELECT P_id, Last Name, First NameFROM PersonsORDER BY LastName

Page 81: IT Ready - DW: 1st Day

Northern Region IT Professional Development Program 2010

Restrict Rows

P_Id LastName FirstName Address City

1 Hansen Ola Timoteivn 10 Sandnes

2 Svendson Tove Borgvn 23 Sandnes

P_Id LastName FirstName Address City

1 Hansen Ola Timoteivn 10 Sandnes

2 Svendson Tove Borgvn 23 Sandnes

3 Pettersen Kari Storgt 20 Stavanger

The "Persons " table :

SELECT * FROM PersonsWHERE City='Sandnes'

Page 82: IT Ready - DW: 1st Day

Northern Region IT Professional Development Program 2010

Equal Join

P_Id LastName FirstName Address City

1 Hansen Ola Timoteivn 10 Sandnes

2 Svendson Tove Borgvn 23 Sandnes

3 Pettersen Kari Storgt 20 Stavanger

The "Persons " table :O_Id OrderNo P_Id

1 77895 3

2 44678 3

3 22456 1

4 24562 1

5 34764 15

The "Orders" table:

LastName FirstName OrderNo

Hansen Ola 22456

Hansen Ola 24562

Pettersen Kari 77895

Pettersen Kari 44678

SELECT Persons.LastName, Persons.FirstName, Orders.OrderNoFROM Persons, OrdersWHERE Persons.P_Id = Orders.P_IdORDER BY Persons.LastName

Page 83: IT Ready - DW: 1st Day

Northern Region IT Professional Development Program 2010

SQL Data Retrieval

• Comparison

– Equal to =

– Not equal to != or <> or ^=

– Less than to <

– Less than or equal to <=

– Greater than to >

– Greater than or equal to >=

Basic Search Conditions:

Page 84: IT Ready - DW: 1st Day

Northern Region IT Professional Development Program 2010

SQL Data RetrievalBasic Search Conditions:

• Range ( [NOT] BETWEEN)

– expres-1 [NOT] BETWEEN expres-2 AND expres- 3

– Example: WEIGHT BETWEEN 50 AND 60

• Set Membership ( [NOT] IN)

– Example 1: WHERE Emp_No IN (‘E1’, ‘E2’, ‘E3’)

– Example 2: WHERE Emp_No IN (Select Emp_No FROM Employee WHERE Dept_No=‘7’)

Page 85: IT Ready - DW: 1st Day

Northern Region IT Professional Development Program 2010

SQL Data Retrieval

• Pattern Matching ([NOT] LIKE)

– expres-1 [NOT] LIKE {special-register | host-variable | string-constant}

– Example: WHERE Proj_Name LIKE “INFORM%”

• Null Value (IS [NOT] NULL)

– Example: WHERE Proj_Name IS NOT NULL

Basic Search Conditions:

Page 86: IT Ready - DW: 1st Day

Northern Region IT Professional Development Program 2010

SQL Data Retrieval

• AND, OR and NOT– Example:

WHERE Proj_Name LIKE ‘INFORM%’ AND Emp_Name = ‘DIAS’

Compound Search Conditions:

Page 87: IT Ready - DW: 1st Day

Northern Region IT Professional Development Program 2010

SQL Query Features

• Summary Queries

– Summarize data from the database. In general, summary queries use SQL functions to collapse a column of data values into a single value that summarizes the column. (AVG, MIN, MAX, SUM, COUNT..)

• Sub-Queries

– Use the results of one query to help define another query

Page 88: IT Ready - DW: 1st Day

Northern Region IT Professional Development Program 2010

Summarising Data

O_Id OrderDate OrderPrice Customer

1 2008/11/12 1000 Hansen

2 2008/10/23 1600 Nilsen

3 2008/09/02 700 Hansen

4 2008/09/03 300 Hansen

5 2008/08/30 2000 Jensen

6 2008/10/04 100 Nilsen

The "Orders " table: CustomerNilsen

2

SELECT COUNT(Customer) AS CustomerNilsen FROM OrdersWHERE Customer='Nilsen'

OrderAverage

950SELECT AVG(OrderPrice) AS OrderAverage FROM Orders

Page 89: IT Ready - DW: 1st Day

Northern Region IT Professional Development Program 2010

GROUP BY

SELECT Customer,SUM(OrderPrice) FROM OrdersGROUP BY Customer

O_Id OrderDate OrderPrice Customer

1 2008/11/12 1000 Hansen

2 2008/10/23 1600 Nilsen

3 2008/09/02 700 Hansen

4 2008/09/03 300 Hansen

5 2008/08/30 2000 Jensen

6 2008/10/04 100 Nilsen

The "Orders " table: Customer SUM(OrderPrice)

Hansen 2000

Nilsen 1700

Jensen 2000

A result of a previous specified clause is grouped using the group by clause.

Page 90: IT Ready - DW: 1st Day

Northern Region IT Professional Development Program 2010

HAVING

Used for select groups that meet specified conditions.Always used with GROUP BY clause.

O_Id OrderDate OrderPrice Customer

1 2008/11/12 1000 Hansen

2 2008/10/23 1600 Nilsen

3 2008/09/02 700 Hansen

4 2008/09/03 300 Hansen

5 2008/08/30 2000 Jensen

6 2008/10/04 100 Nilsen

The "Orders " table:

Customer SUM(OrderPrice)

Nilsen 1700

SELECT Customer,SUM(OrderPrice )FROM OrdersGROUP BY CustomerHAVING SUM(OrderPrice)<2000

Page 91: IT Ready - DW: 1st Day

Northern Region IT Professional Development Program 2010

Nested Queries

A sub query is SELECT statement that nest inside the WHERE clause of another SELECT statement. The results are need in solving the main query.

O_Id OrderDate OrderPrice Customer

1 2008/11/12 1000 Hansen

2 2008/10/23 1600 Nilsen

3 2008/09/02 700 Hansen

4 2008/09/03 300 Hansen

5 2008/08/30 2000 Jensen

6 2008/10/04 100 Nilsen

The "Orders " table: Customer

Hansen

Nilsen

Jensen

SELECT Customer FROM OrdersWHERE OrderPrice>(SELECT AVG(OrderPrice )

FROM Orders )

Page 92: IT Ready - DW: 1st Day

Northern Region IT Professional Development Program 2010

Case Study

Page 93: IT Ready - DW: 1st Day

Northern Region IT Professional Development Program 2010

Case Study: Wal*Mart

• Founded by Sam Walton• One the largest Super Market Chains in the US

• Wal*Mart: 2000+ Retail Stores • SAM's Clubs 100+Wholesalers Stores

This case study is from Felipe Carino’s (NCR Teradata) presentation made at Stanford Database Seminar

Page 94: IT Ready - DW: 1st Day

Northern Region IT Professional Development Program 2010

Old Retail Paradigm

• Wal*Mart– Inventory Management

– Merchandise Accounts Payable

– Purchasing

– Supplier Promotions: National, Region, Store Level

• Suppliers – Accept Orders

– Promote Products

– Provide special Incentives

– Monitor and Track The Incentives

– Bill and Collect Receivables

– Estimate Retailer Demands

Page 95: IT Ready - DW: 1st Day

Northern Region IT Professional Development Program 2010

New (Just-In-Time) Retail Paradigm

• No more deals• Shelf-Pass Through (POS Application)

– One Unit Price Suppliers paid once a week on ACTUAL items sold

– Wal*Mart Manager Daily Inventory Restock Suppliers (sometimes SameDay) ship to Wal*Mart

• Warehouse-Pass Through– Stock some Large Items

Delivery may come from supplier

– Distribution Center Supplier’s merchandise unloaded directly onto Wal*Mart Trucks

Page 96: IT Ready - DW: 1st Day

Northern Region IT Professional Development Program 2010

Wal*Mart System

• NCR 5100M 96 Nodes: 24 TB Raw Disk; 700 - 1000 Pentium CPUs

• Number of Rows: > 5 Billions

• Historical Data: 65 weeks (5 Quarters)

• New Daily Volume: Current Apps: 75 Million

New Apps: 100 Million +

• Number of Users: Thousands

• Number of Queries: 60,000 per week

Page 97: IT Ready - DW: 1st Day

Northern Region IT Professional Development Program 2010

References/External Links

(1) Data Warehousing & Data Mining S. Sudarshan Krithi Ramamritham IIT Bombay

(2) Data Warehousing Hu Yan e-mail: [email protected]

(3) What is a Data Warehouse? http://blog.maia-intelligence.com/2008/04/29/what-is-a-data-warehouse/

(4) Database Management Systems (DBMS) http://www.bit.lk/teachingmaterial/IT2302/index.htm

(5) SQL Tutorial http://www.w3schools.com/sql/default.asp

Page 98: IT Ready - DW: 1st Day

Northern Region IT Professional Development Program 2010

Thank you for your attention!

[email protected]