Download - IT Ready - DW: 1st Day
Northern Region IT Professional Development Program 2010
Data Warehousing (DAY 1)
Siwawong W.Project Manager
2010.05.24
Northern Region IT Professional Development Program 2010
Agenda
09:00 – 09:15 Registration
09:15 – 09:30 Self-Introduction
09:30 – 10:30 Data Warehouse: Introduction
10:30 – 10:45 Break & Morning Refreshment
10:45 – 12:00 Data Warehouse: Introduction (Cont’)
12:00 – 13:00 Lunch Break
13:00 – 15:00 Review RDBMS & SQL command
15:00 – 15:15 Break
15:15 – 16:00 Case Study ~ Q/A
Northern Region IT Professional Development Program 2010
SELF-INTRODUCTION
Northern Region IT Professional Development Program 2010
About Me
• My Name: Siwawong Wuttipongprasert– Nick-name: Tae (You can call this name. it’s easier)
• My Background: – B.Eng (Computer Engineering), Chiang Mai University.
• My Career Profile: – 10+ years in IT business– 5+ years with Blue Ball Co., Ltd.– Role: Programmer, System Analysis, Consultant & Project Manager– Working Area: ERP, MRP, Retailing, Banking, Financial, E-Commerce, etc.– Working with multi-cultures: Japanese, German and Vietnamese
• Know Me More..
Northern Region IT Professional Development Program 2010
My Company: Blue Ball
Blue Ball Group is an Offshoring Company that focus totally in customer satisfaction. It takes advantage of western management combined with Asian human resources to provide high quality services
Thailand (Head Office)
Mexico (Special Developments)
Vietnam (Offshoring Center)
Northern Region IT Professional Development Program 2010
Services from My Company
Offshoring Programmers &Testers Blue Ball will get you ready to offshore successfully. No need to rush you into offshoring without you feeling confident on how to send, organize, receive, test and accept job confidently
System Development & Embedded Solutions Solutions that combine technological expertise and deep business understanding. We only start coding once every single detail such as milestones, scheduling, contact point, communication, issue management and critical protocols are in place
Web design and E-commerce Premium web design, CMS, e-commerce solutions and SEO services. Website maintenance and copy content creation to develop marketing campaigns that SELL for discerning companies to increase the quality and reach of their marketing campaigns
Northern Region IT Professional Development Program 2010
My Clients
Northern Region IT Professional Development Program 2010
Data Warehouse: Introduction
Northern Region IT Professional Development Program 2010
Data Warehouse: Introduction
• Data Warehousing, OLAP and data mining: – what and why (now)?
• Relation to OLTP• Review RDMBS & SQL Command• A case study
Northern Region IT Professional Development Program 2010
Data Warehouse: What & Why?
Problem Statements
Northern Region IT Professional Development Program 2010
A producer wants to know….
Which are our lowest/highest margin
customers ?
Which are our lowest/highest margin
customers ?
Who are my customers and what products are they buying?
Who are my customers and what products are they buying?
Which customers are most likely to go to the competition ?
Which customers are most likely to go to the competition ?
What impact will new products/services
have on revenue and margins?
What impact will new products/services
have on revenue and margins?
What product prom--otions have the biggest
impact on revenue?
What product prom--otions have the biggest
impact on revenue?
What is the most effective distribution
channel?
What is the most effective distribution
channel?
Northern Region IT Professional Development Program 2010
Data, Data everywhere, yet ...
• I can’t find the data I need– data is scattered over the network
– many versions, subtle differences
• I can’t get the data I need– need an expert to get the data
• I can’t understand the data I found– available data poorly documented
• I can’t use the data I found– results are unexpected
– data needs to be transformed from one form to other
Northern Region IT Professional Development Program 2010
What is a Data Warehouse?
A single, complete and consistent store of
data obtained from a variety of different
sources made available to end users in a
what they can understand and use in a
business context.
[Barry Devlin]
Northern Region IT Professional Development Program 2010
What are the users saying...
• Data should be integrated across the enterprise
• Summary data has a real value to the organization
• Historical data holds the key to understanding data over time
• What-if capabilities are required
Northern Region IT Professional Development Program 2010
What is Data Warehousing?
A processprocess of transforming data into information and making it available to users in a timely enough manner to make a difference
[Forrester Research, April 1996]
Data
Information
Northern Region IT Professional Development Program 2010
Evolution
• 60’s: Batch reports– hard to find and analyze information– inflexible and expensive, reprogram every new request
• 70’s: Terminal-based DSS and EIS (executive information systems)– still inflexible, not integrated with desktop tools
• 80’s: Desktop data access and analysis tools– query tools, spreadsheets, GUIs– easier to use, but only access operational databases
• 90’s: Data warehousing with integrated OLAP engines and tools
Northern Region IT Professional Development Program 2010
Very Large Data Bases
Terabytes -- 10^12 bytes:
Petabytes -- 10^15 bytes:
Exabytes -- 10^18 bytes:
Zettabytes -- 10^21 bytes:
Zottabytes -- 10^24 bytes:
Walmart -- 24 Terabytes
Intelligence Agency Videos
Geographic Information Systems
National Medical Records
Weather images
Northern Region IT Professional Development Program 2010
Data Warehousing -- It is a process
• Technique for assembling and managing data from various sources for the purpose of answering business questions. Thus making decisions that were not previous possible
• A decision support database maintained separately from the organization’s operational database
Northern Region IT Professional Development Program 2010
Data Warehouse
• A data warehouse is a – subject-oriented: Organized based on use
– Integrated: inconsistencies remove
– time-varying: data are normally time-series
– non-volatile: store in read-only format
collection of data that is used primarily in organizational decision making.
-- Bill Inmon, Building the Data Warehouse 1996
Northern Region IT Professional Development Program 2010
Data Warehouse: Subjected-Oriented
WH is organized around the major subjects of the enterprise..rather than the major application areas..
This is reflected in the need to store decision-support data rather than application-oriented data
DBWH Sales
Subject-OrientedSubject-Oriented
OperationalDB
Order Processing
Application-OrientedApplication-Oriented
Northern Region IT Professional Development Program 2010
Data Warehouse: Integrated
Because the source data come together from different enterprise-wide applications systems.
The source data is often inconsistent using the integrated data source must be made consistent to present a unified view of the data to the users
Northern Region IT Professional Development Program 2010
Data Warehouse: time-varying
The source data in the WH is only accurate and valid at some point in time or over some time interval.
The time-variance of the data warehouse is also shown in the extended time that the data is held, the implicit or explicit association of time with all data, and the fact that the data represents a series of snapshots
Historical data is recorded
Northern Region IT Professional Development Program 2010
Data Warehouse: Non-volatile
Data is NOT update in real timeNOT update in real time but is refresh from OS on a regular basis. New data is always added as a supplement to DB, rather than replacement. The DB continually absorbs this new data, incrementally integrating it with previous data
Anyone who is using the database has confidence that a query will always produce the same result no matter how often it is run
Northern Region IT Professional Development Program 2010
Explorers, Farmers and Tourists
Explorers: Seek out the unknown and previously unsuspected rewards hiding in the detailed data
Farmers: Harvest informationfrom known access paths
Tourists: Browse information harvested by farmers
Northern Region IT Professional Development Program 2010
Data Warehouse Architecture
Data Warehouse Engine
Optimized Loader
ExtractionCleansing
AnalyzeQuery
Metadata Repository
RelationalDatabases
LegacyData
Purchased Data
ERPSystems
Northern Region IT Professional Development Program 2010
OLAP & Data Mining
Northern Region IT Professional Development Program 2010
Data Warehouse for DS & OLAP
• Putting Information technology to help the knowledge worker make faster and better decisions
– Which of my customers are most likely to go to the competition?
– What product promotions have the biggest impact on revenue?
– How did the share price of software companies correlate with profits over last 10 years?
Northern Region IT Professional Development Program 2010
Decision Support (DS)
• Used to manage and control business
• Data is historical or point-in-time
• Optimized for inquiry rather than update
• Use of the system is loosely defined and can be ad-hoc
• Used by managers and end-users to understand the business and
make judgements
Northern Region IT Professional Development Program 2010
Data Mining works with Warehouse Data
Data Warehousing provides the Enterprise
with a memory
Data Mining provides the Enterprise with
intelligence
Northern Region IT Professional Development Program 2010
We want to know ...
• Given a database of 100,000 names, which persons are the least likely to default on their credit cards?
• Which types of transactions are likely to be fraudulent given the demographics and transactional history of a particular customer?
• If I raise the price of my product by Rs. 2, what is the effect on my ROI?
• If I offer only 2,500 airline miles as an incentive to purchase rather than 5,000, how many lost responses will result?
• If I emphasize ease-of-use of the product as opposed to its technical capabilities, what will be the net effect on my revenues?
• Which of my customers are likely to be the most loyal?
Data Mining helps extract such information
Northern Region IT Professional Development Program 2010
Application Areas
Industry ApplicationFinance Credit Card AnalysisInsurance Claims, Fraud AnalysisTelecommunication Call record analysisTransport Logistics managementConsumer goods promotion analysisData Service providers Value added dataUtilities Power usage analysis
Northern Region IT Professional Development Program 2010
Data Mining in Use
• The US Government uses Data Mining to track fraud• A Supermarket becomes an information broker• Basketball teams use it to track game strategy• Cross Selling• Warranty Claims Routing• Holding on to Good Customers• Weeding out Bad Customers
Northern Region IT Professional Development Program 2010
What makes data mining possible?
• Advances in the following areas are making data mining deployable:
– data warehousing – better and more data
i.e., operational, behavioral, and demographic
– the emergence of easily deployed data mining tools and – the advent of new data mining techniques.
-- Gartner Group
Northern Region IT Professional Development Program 2010
Why Separate Data Warehouse?
• Performance– Operation DBs designed & tuned for known transaction & workloads.– Complex OLAP queries would degrade performance for operation transaction.– Special data organization, access & implementation methods needed for
multidimensional views & queries.
• Function– Missing data: Decision support requires historical data, which operation DBs
do not typically maintain.– Data consolidation: Decision support requires consolidation (aggregation,
summarization) of data from many heterogeneous sources: operation DBs, external sources.
– Data quality: Different sources typically use inconsistent data representations, codes, and formats which have to be reconciled.
Northern Region IT Professional Development Program 2010
What’s OLTP?
• DBMS built for OnLine Transaction Processing (OLTP) is generally regarded as unsuitable for data warehousing because each system is designed with a differing set of requirements in mind
• example: OLTP systems are design to maximize the transaction processing capacity, while data warehouses are designed to support ad hoc query processing
Northern Region IT Professional Development Program 2010
What are Operational Systems?
• They are OLTP systems
• Run mission critical applications
• Need to work with stringent performance requirements for routine tasks
• Used to run a business!
Northern Region IT Professional Development Program 2010
RDBMS used for OLTP
• Database Systems have been used traditionally for OLTP
– clerical data processing tasks– detailed, up to date data– structured repetitive tasks– read/update a few records– isolation, recovery and integrity are critical
Northern Region IT Professional Development Program 2010
Operational Systems
• Run the business in real time
• Based on up-to-the-second data
• Optimized to handle large numbers of simple read/write transactions
• Optimized for fast response to predefined transactions
• Used by people who deal with customers, products -- clerks, salespeople etc.
• They are increasingly used by customers
Northern Region IT Professional Development Program 2010
Examples of Operational Data
Data Industry Usage Technology Volumes
Customer File
All Track Customer Detail
Legacy application, flat files, main frames
Small-medium
Account Balance
Finance Control Account Activities
Legacy applications, hierarchical databases, mainframe
Large
Point-of- Sale data
Retail Generate bills, manage stock
ERP, Client/Server, relational databases
Very Large
Call Record Tele-Comm.
Billing Legacy application, hierarchical database, mainframe
Very Large
Production Record
Mfg. Control Production
ERP, RDBMS, AS/400 Medium
Northern Region IT Professional Development Program 2010
Related to OLTP
Northern Region IT Professional Development Program 2010
Application-Orientation vs. Subject-Orientation
Application-Orientation
Operational Database
LoansCredit Card
TrustSavings
Subject-Orientation
DataWarehouse
Customer
Vendor
Product
Activity
Northern Region IT Professional Development Program 2010
OLTP vs. Data Warehouse
• OLTP systems are tuned for known transactions and workloads while workload is not known a priori in a data warehouse
• Special data organization, access methods and implementation methods are needed to support data warehouse queries (typically multidimensional queries)
– e.g., average amount spent on phone calls between 9AM-5PM in Pune during the month of December
Northern Region IT Professional Development Program 2010
OLTP vs. Data Warehouse
• OLTP– Application Oriented
– Used to run business
– Detailed data
– Current up to date
– Isolated Data
– Repetitive access
– Clerical User
• Warehouse (DSS)– Subject Oriented
– Used to analyze business
– Summarized and refined
– Snapshot data
– Integrated Data
– Ad-hoc access
– Knowledge User (Manager)
Northern Region IT Professional Development Program 2010
OLTP vs. Data Warehouse
• OLTP– Performance Sensitive– Few Records accessed at a time
(tens)
– Read/Update Access
– No data redundancy– Database Size 100MB -100 GB
• Data Warehouse– Performance relaxed– Large volumes accessed at a
time(millions)
– Mostly Read (Batch Update)
– Redundancy present– Database Size 100 GB -
few terabytes
Northern Region IT Professional Development Program 2010
OLTP vs. Data Warehouse
• OLTP– Transaction throughput is the
performance metric
– Thousands of users
– Managed in entirety
• Data Warehouse– Query throughput is the
performance metric
– Hundreds of users
– Managed by subsets
Northern Region IT Professional Development Program 2010
To summarize ...
OLTP Systems are used to “run” a business
The Data Warehouse helps to “optimize” the business
Northern Region IT Professional Development Program 2010
Why Now?
• Data is being produced• ERP provides clean data• The computing power is available• The computing power is affordable• The competitive pressures are strong• Commercial products are available
Northern Region IT Professional Development Program 2010
Myths surrounding OLAP Servers and Data Marts
• Data marts and OLAP servers are departmental solutions supporting a handful of users
• Million dollar massively parallel hardware is needed to deliver fast time for complex queries
• OLAP servers require massive and unwieldy indices
• Complex OLAP queries clog the network with data
• Data warehouses must be at least 100 GB to be effective
» Source -- Arbor Software Home Page
Northern Region IT Professional Development Program 2010
Advantages Of Data Warehouse
• High query performance
– But not necessarily most current information
• Doesn’t interfere with local processing at sources
– Complex queries at warehouse
– OLTP at information sources
• Information copied at warehouse
– Can modify, annotate, summarize, restructure, etc.
– Can store historical information
– Security
Northern Region IT Professional Development Program 2010
Data Warehouse: Pain Beware
• Data Integration
• Data Quality
• Data Availability
• End user education
• Proper sizing – HW and Database environment
• Lack of off-the-shelf product (mature Packaged Analytics)
• Post implementation challenges– Ensure usage– Identify new areas of Intelligence– Measure business benefits – productivity enhancements, savings, increase in revenue etc.
Northern Region IT Professional Development Program 2010
Review RDBMS & SQL statement
Northern Region IT Professional Development Program 2010
Relational DBMS: Properties
• Each relation (or table) in a database has a unique name• An entry at the intersection of each row and column is
atomic (or single-valued);– there can be no multi-valued attributes in a relation
• Each row is unique; – no two rows in a relation are identical
• Each attribute (or column) within a table has a unique name
• The sequence of columns (left to right) is insignificant; – the columns of a relation can be interchanged without changing the meaning or use of the
relation
• The sequence of rows (top to bottom) is insignificant;– rows of a relation may be interchanged or stored in any sequence
Northern Region IT Professional Development Program 2010
The Relational Model...
• The relational model of data has three major components:
Relational database objects allows to define data structures
Relational operators allows manipulation of stored data
Relational integrity constraints allows to defines business rules and ensure data integrity
Northern Region IT Professional Development Program 2010
The Relational Objects
• Location– Most RDBMS can have multiple locations, all managed by the
same database engine
Corporate Database
Accounting
Accounts Receivable
Accounts Payable
Accounting
PurchasingMarketing
Marketing
Sales Advertising
Northern Region IT Professional Development Program 2010
The Relational Objects
• Location
DBDatabase Server
Client Application Client Application Client Application
Multi-users
Northern Region IT Professional Development Program 2010
The Relational Objects...
• Database– A set of SQL objects
UPDATE UPDATE TT SET SETINSERT INTO INSERT INTO TTDELETE FROM DELETE FROM TTCALL STPROGCALL STPROG
Client Application
Database Server
StoredProcedure
BEGIN...
Table A
Table B
Table T
Update Trigger
Insert Trigger
Delete Trigger
BEGIN...
BEGIN...
BEGIN...
Northern Region IT Professional Development Program 2010
The Relational Objects...
• Database
– A collection of tables and associated indexes
Table
Department
Table
Product
Table
Customer
Table
Employee
Index
Files
Northern Region IT Professional Development Program 2010
The Relational Objects...
• Relation
– A named, two dimensional table of data
• Database– A collection of databases, tables and related objects
organised in a structured fashion.
– Several database vendors use schema interchangeably with database
Northern Region IT Professional Development Program 2010
Relational Objects...
Tables are comprised of rows and a fixed number of named columns.
Data is presented to the user as tables:
Column 1 Column 2 Column 3 Column 4
Row
Row
Row
Table
Northern Region IT Professional Development Program 2010
Relational Objects...
Columns are attributes describing an entity. Each column must have an unique name and a data type.
Data is presented to the user as tables:
Name Designation Department
Row
Row
Row
Employee
Structure of a relation (e.g. Employee)Employee(Name, Designation, Department)
Northern Region IT Professional Development Program 2010
Relational Objects...
Rows are records that present information about a particular entity occurrence
Data is presented to the user as tables:
Name Designation Department
Row
Row
Row
Employee
De Silva Manager Personnel
Perera Secretary Personnel
Dias Manager Sales
Northern Region IT Professional Development Program 2010
Relational Objects: Keys
• Key constraints– If a relation has more than one key, they are called candidate keys
– One of them is chosen as the primary key
• Relational Objects: Keys
Primary Key: An attribute (or combination of attributes) that uniquely identifies each row in a relation.
Employee(Emp_No, Emp_Name, Department)
Composite Key: A primary key that consists of more than one attribute
Salary(Emp_No, Eff_Date, Amount)
Northern Region IT Professional Development Program 2010
Relational Objects: Keys
Data is presented to the user as tables: Each table has a primary key. The primary key is a column or combination of
columns that uniquely identify each row of the table.
Primary Key
EmployeeE-No E-Name D-No
179 Silva 7857 Perera 4342 Dias 7
Primary Key
SalaryE-No Eff-Date Amt
179 1/1/98 8000857 3/7/94 9000179 1/6/97 7000342 28/1/97 7500
Northern Region IT Professional Development Program 2010
Relational Objects: Relationship
Foreign Key: An attribute in a relation of a database that serves as the primary key of another relation in the same database
Employee(Emp_No, Emp_Name, Department)
Department(Dept_No, Dept_Name, M_No)
=== works for ==>
Northern Region IT Professional Development Program 2010
Relational Objects: Foreign Key
A foreign key is a set of columns in one table that serve as the primary key in another table
Data is presented to the user as tables:
Foreign KeyPrimary Key
Primary Key
D-No D-Name M-No
4 Finance 857 7 Sales 179
Primary Key
DepartmentEmployeeE-No E-Name D-No
179 Silva 7857 Perera 4342 Dias 7
Rows in one or more tables are associated with each other solely through data values in columns (no pointers).
Northern Region IT Professional Development Program 2010
SQL
• A relational database language
– It is not a programming language but a comprehensive database sub-language language for controlling and interacting with a database management system.
• NOT a DBMS
• A powerful data manipulation language
– It has capabilities for: insertion update deletion query Protection
Northern Region IT Professional Development Program 2010
SQL (Cont’)
• Also designed for end users
• Non-procedural– We have to show ‘what is needed’ and not ‘how’, like in
‘relational algebra’– Is similar more to ‘relational calculus’
• Used in two ways:– Interactive– Programmatic: Dynamic / Embedded
Northern Region IT Professional Development Program 2010
Role of SQL
• A database programming language
• A database administration language
• A client/server language
• A distributed database language
RelationalDBMS
System Catalog User Tables
SQL
Northern Region IT Professional Development Program 2010
Role of SQL
• It is vendor independent.– If a user was dissatisfied with a particular DBMS he could switch
products easily without much overhead, as both would follow the same language standard.
• Client applications relatively portable.
• Programmer skills are portable.
• Supports many different client processes -- end-users, applications, developers, etc.
• Database servers use SQL to request services from each other.
Northern Region IT Professional Development Program 2010
SQL Basics: DDL
• CREATE TABLE Adds new table• DROP TABLE Removes existing tables• ALTER TABLE Modifies structure of tables• CREATE VIEW Adds a new view• DROP VIEW Removes a view• CREATE INDEX Build an index for a column• DROP INDEX Removes an index• CREATE SYNONYM Defines an alias for a database object• DROP SYNONYM Remove an alias• COMMENTS Describes a table or column• LABEL Defines a title for a table or column
Data Definition Language (DDL)
DDL defines the database: Physical Design
Northern Region IT Professional Development Program 2010
SQL Basics: DML
• SELECT Retrieves data
• INSERT Adds new rows of data
• DELETE Removes row of data
• UPDATE Modifies existing data
Data Manipulation Language (DML)
Northern Region IT Professional Development Program 2010
SQL Basics: DCL
• GRANT Gives user access privileges• REVOKE Removes privileges• COMMIT Ends current transaction• ROLLBACK Aborts current transaction
Data Control Language (DCL)
Northern Region IT Professional Development Program 2010
SQL Basics: Data Integrity
• Value of stored data can be lost in many ways:– Invalid data added to data base– Existing data modified to a incorrect value– Changes made lost due to system error or power failure– Changes partially applied
• Types of integrity constraints:– Required Data (NOT NULL)
– Validity Checking (CHECK)
– Entity Integrity (PRIMARY KEY & NOT NULL)
– Referential Integrity (FOREIGN KEY)– Business Rules (ASSERTION, TRIGGER)
– Consistency (CASCADE, RESTRICT, SET NULL)
Northern Region IT Professional Development Program 2010
SQL Basics: NULL values
• Null values provides a systematic way of handling missing or inapplicable data in SQL.
• It is inevitable that in real-world, some data are missing, not yet known or do not apply.
• Null value is not a real data value.
• Special Handling– Null values require special handling by SQL and the DBMS. Null values can be
handled inconsistently by various SQL products
– Example: How do we handle null values in summaries like SUM, AVERAGE, etc.?
Northern Region IT Professional Development Program 2010
Referential Integrity
Referential integrity constraints define the rules for associating rows with each other, i.e. columns which reference columns in other tables:
Every non-null value in a foreign key must have a corresponding value in the primary key which it references.
Department (Parent Table)
Dept-No
D1D3D2D7
Employee(Dependent Table)
Dept-No
D7?D1D3?D7
Emp-No
D2
INSERT ROW
UPDATE COLUMN
A row can be inserted or a column updated in the dependent table only if (1) there is a corresponding primary key value in the parent table, or (2) the foreign key value is set null.
Northern Region IT Professional Development Program 2010
Referential Integrity
Deleting parent rows
Department (Parent Table)
Dept-No
D1D3D2D7
Dept-No
D7?D1D3?D7
Emp-No
D2
DELETE ROW
CASCADE
RESTRICT
SET NULL
Database designers must explicitly declare the effect if a delete from the parent table on the dependent table:
CASCADE deletes associated dependent rows
RESTRICT will not allow delete from the parent table if there are associated dependent rows.
SET NULL sets the value of associated dependent columns to null values.
Northern Region IT Professional Development Program 2010
SQL for Data Manipulation
• Manipulation– SQL allows a user or an application program to update the
database by adding new data, removing old data, and modifying previously stored data.
• Retrieval– SQL allows a user or an application program to retrieve stored data
from the database and use it.
• Most Commonly Used Commands– SELECT INSERT– UPDATE DELETE
Northern Region IT Professional Development Program 2010
SQL for Data Manipulation
• High-level Language for data manipulation
• It does not require predefined navigation path
• It does not require knowledge of any key items
• It is uniform language for end-users and programmers
• It operates on one or more tables based on set theory, not on a record at a time.
Northern Region IT Professional Development Program 2010
Command: SELECT
• Function:– Retrieves data from one or more rows. Every SELECT statement produces a
table of query results containing one or more columns and zero or more rows.
SELECT {[ALL, DISTINCT]} select-item,), ]FROM ( table specification,){WHERE (search condition)}{GROUP BY ( group-column,)}{HAVING ( search condition)}{ORDER BY (sort specification,)}
Northern Region IT Professional Development Program 2010
Project Selected Columns
P_Id LastName FirstName Address City
1 Hansen Ola Timoteivn 10 Sandnes
2 Svendson Tove Borgvn 23 Sandnes
3 Pettersen Kari Storgt 20 Stavanger
The "Persons " table :
SELECT LastName,FirstName FROM Persons
LastName FirstName
Hansen Ola
Svendson Tove
Pettersen Kari
P_Id LastName FirstName
1 Hansen Ola
4 Nilsen Tom
3 Pettersen Kari
2 Svendson Tove
SELECT P_id, Last Name, First NameFROM PersonsORDER BY LastName
Northern Region IT Professional Development Program 2010
Restrict Rows
P_Id LastName FirstName Address City
1 Hansen Ola Timoteivn 10 Sandnes
2 Svendson Tove Borgvn 23 Sandnes
P_Id LastName FirstName Address City
1 Hansen Ola Timoteivn 10 Sandnes
2 Svendson Tove Borgvn 23 Sandnes
3 Pettersen Kari Storgt 20 Stavanger
The "Persons " table :
SELECT * FROM PersonsWHERE City='Sandnes'
Northern Region IT Professional Development Program 2010
Equal Join
P_Id LastName FirstName Address City
1 Hansen Ola Timoteivn 10 Sandnes
2 Svendson Tove Borgvn 23 Sandnes
3 Pettersen Kari Storgt 20 Stavanger
The "Persons " table :O_Id OrderNo P_Id
1 77895 3
2 44678 3
3 22456 1
4 24562 1
5 34764 15
The "Orders" table:
LastName FirstName OrderNo
Hansen Ola 22456
Hansen Ola 24562
Pettersen Kari 77895
Pettersen Kari 44678
SELECT Persons.LastName, Persons.FirstName, Orders.OrderNoFROM Persons, OrdersWHERE Persons.P_Id = Orders.P_IdORDER BY Persons.LastName
Northern Region IT Professional Development Program 2010
SQL Data Retrieval
• Comparison
– Equal to =
– Not equal to != or <> or ^=
– Less than to <
– Less than or equal to <=
– Greater than to >
– Greater than or equal to >=
Basic Search Conditions:
Northern Region IT Professional Development Program 2010
SQL Data RetrievalBasic Search Conditions:
• Range ( [NOT] BETWEEN)
– expres-1 [NOT] BETWEEN expres-2 AND expres- 3
– Example: WEIGHT BETWEEN 50 AND 60
• Set Membership ( [NOT] IN)
– Example 1: WHERE Emp_No IN (‘E1’, ‘E2’, ‘E3’)
– Example 2: WHERE Emp_No IN (Select Emp_No FROM Employee WHERE Dept_No=‘7’)
Northern Region IT Professional Development Program 2010
SQL Data Retrieval
• Pattern Matching ([NOT] LIKE)
– expres-1 [NOT] LIKE {special-register | host-variable | string-constant}
– Example: WHERE Proj_Name LIKE “INFORM%”
• Null Value (IS [NOT] NULL)
– Example: WHERE Proj_Name IS NOT NULL
Basic Search Conditions:
Northern Region IT Professional Development Program 2010
SQL Data Retrieval
• AND, OR and NOT– Example:
WHERE Proj_Name LIKE ‘INFORM%’ AND Emp_Name = ‘DIAS’
Compound Search Conditions:
Northern Region IT Professional Development Program 2010
SQL Query Features
• Summary Queries
– Summarize data from the database. In general, summary queries use SQL functions to collapse a column of data values into a single value that summarizes the column. (AVG, MIN, MAX, SUM, COUNT..)
• Sub-Queries
– Use the results of one query to help define another query
Northern Region IT Professional Development Program 2010
Summarising Data
O_Id OrderDate OrderPrice Customer
1 2008/11/12 1000 Hansen
2 2008/10/23 1600 Nilsen
3 2008/09/02 700 Hansen
4 2008/09/03 300 Hansen
5 2008/08/30 2000 Jensen
6 2008/10/04 100 Nilsen
The "Orders " table: CustomerNilsen
2
SELECT COUNT(Customer) AS CustomerNilsen FROM OrdersWHERE Customer='Nilsen'
OrderAverage
950SELECT AVG(OrderPrice) AS OrderAverage FROM Orders
Northern Region IT Professional Development Program 2010
GROUP BY
SELECT Customer,SUM(OrderPrice) FROM OrdersGROUP BY Customer
O_Id OrderDate OrderPrice Customer
1 2008/11/12 1000 Hansen
2 2008/10/23 1600 Nilsen
3 2008/09/02 700 Hansen
4 2008/09/03 300 Hansen
5 2008/08/30 2000 Jensen
6 2008/10/04 100 Nilsen
The "Orders " table: Customer SUM(OrderPrice)
Hansen 2000
Nilsen 1700
Jensen 2000
A result of a previous specified clause is grouped using the group by clause.
Northern Region IT Professional Development Program 2010
HAVING
Used for select groups that meet specified conditions.Always used with GROUP BY clause.
O_Id OrderDate OrderPrice Customer
1 2008/11/12 1000 Hansen
2 2008/10/23 1600 Nilsen
3 2008/09/02 700 Hansen
4 2008/09/03 300 Hansen
5 2008/08/30 2000 Jensen
6 2008/10/04 100 Nilsen
The "Orders " table:
Customer SUM(OrderPrice)
Nilsen 1700
SELECT Customer,SUM(OrderPrice )FROM OrdersGROUP BY CustomerHAVING SUM(OrderPrice)<2000
Northern Region IT Professional Development Program 2010
Nested Queries
A sub query is SELECT statement that nest inside the WHERE clause of another SELECT statement. The results are need in solving the main query.
O_Id OrderDate OrderPrice Customer
1 2008/11/12 1000 Hansen
2 2008/10/23 1600 Nilsen
3 2008/09/02 700 Hansen
4 2008/09/03 300 Hansen
5 2008/08/30 2000 Jensen
6 2008/10/04 100 Nilsen
The "Orders " table: Customer
Hansen
Nilsen
Jensen
SELECT Customer FROM OrdersWHERE OrderPrice>(SELECT AVG(OrderPrice )
FROM Orders )
Northern Region IT Professional Development Program 2010
Case Study
Northern Region IT Professional Development Program 2010
Case Study: Wal*Mart
• Founded by Sam Walton• One the largest Super Market Chains in the US
• Wal*Mart: 2000+ Retail Stores • SAM's Clubs 100+Wholesalers Stores
This case study is from Felipe Carino’s (NCR Teradata) presentation made at Stanford Database Seminar
Northern Region IT Professional Development Program 2010
Old Retail Paradigm
• Wal*Mart– Inventory Management
– Merchandise Accounts Payable
– Purchasing
– Supplier Promotions: National, Region, Store Level
• Suppliers – Accept Orders
– Promote Products
– Provide special Incentives
– Monitor and Track The Incentives
– Bill and Collect Receivables
– Estimate Retailer Demands
Northern Region IT Professional Development Program 2010
New (Just-In-Time) Retail Paradigm
• No more deals• Shelf-Pass Through (POS Application)
– One Unit Price Suppliers paid once a week on ACTUAL items sold
– Wal*Mart Manager Daily Inventory Restock Suppliers (sometimes SameDay) ship to Wal*Mart
• Warehouse-Pass Through– Stock some Large Items
Delivery may come from supplier
– Distribution Center Supplier’s merchandise unloaded directly onto Wal*Mart Trucks
Northern Region IT Professional Development Program 2010
Wal*Mart System
• NCR 5100M 96 Nodes: 24 TB Raw Disk; 700 - 1000 Pentium CPUs
• Number of Rows: > 5 Billions
• Historical Data: 65 weeks (5 Quarters)
• New Daily Volume: Current Apps: 75 Million
New Apps: 100 Million +
• Number of Users: Thousands
• Number of Queries: 60,000 per week
Northern Region IT Professional Development Program 2010
References/External Links
(1) Data Warehousing & Data Mining S. Sudarshan Krithi Ramamritham IIT Bombay
(2) Data Warehousing Hu Yan e-mail: [email protected]
(3) What is a Data Warehouse? http://blog.maia-intelligence.com/2008/04/29/what-is-a-data-warehouse/
(4) Database Management Systems (DBMS) http://www.bit.lk/teachingmaterial/IT2302/index.htm
(5) SQL Tutorial http://www.w3schools.com/sql/default.asp
Northern Region IT Professional Development Program 2010
Thank you for your attention!