Download - Lecture 3 Data Warehouse Structures
-
7/31/2019 Lecture 3 Data Warehouse Structures
1/22
Rajeev Tiwari
Lecture 3
Data warehouse structures
-
7/31/2019 Lecture 3 Data Warehouse Structures
2/22
[1] Data Mining Concepts and Techniques Jiawei Han and Micheline
Kamber
[2] http://www.daneil-lemire.com
[3] http://www.kalmstrom.nu
2
References
http://www.daneil-lemire.com/http://www.kalmstrom.nu/http://www.kalmstrom.nu/http://www.daneil-lemire.com/http://www.daneil-lemire.com/http://www.daneil-lemire.com/ -
7/31/2019 Lecture 3 Data Warehouse Structures
3/22
What is Data Warehouse?
o Defined in many different ways.
A decision support database that is maintained separately from the
organizations operational database.
Support information processing by providing a solid platform of
consolidated, historical data for analysis.
o A data warehouse is a subject-oriented, integrated, time-variant,
and nonvolatile collection of data in support of managements
decision-making process.W. H. Inmon
o Data warehousing:
The process of constructing and using data warehouses
-
7/31/2019 Lecture 3 Data Warehouse Structures
4/22
Data Warehouse Subject Oriented
o Organized around major subjects, such as customer, product,
sales.
o Focused on the modeling and analysis of data for decision makers,
not on daily operations
o Provide a simple and concise view around particular subject issues
by excluding data that are not useful in the decision support
process.
-
7/31/2019 Lecture 3 Data Warehouse Structures
5/22
Data Warehouse - Integrated
5
o Constructed by integrating multiple, heterogeneous data sources
relational databases, flat files, on-line transaction records
o Data cleaning and data integration techniques are applied.
Ensure consistency in naming conventions, encoding structures,attribute measures, etc. among different data sources
When data is moved to the warehouse, it is converted.
o Eg: Sales data may be on RDB, customer information in flat files.
-
7/31/2019 Lecture 3 Data Warehouse Structures
6/22
Data Warehouse - Time Variant
o The time horizon for the data warehouse is significantly longer than
that of operational database systems
Operational database: current value
Data warehouse data: provide information from a historicalperspective (e.g., past 5-10 years)
o Every key structure in the data warehouse
Contains an element of time, explicitly or implicitly
But the key of operational data may or may not contain time
element
-
7/31/2019 Lecture 3 Data Warehouse Structures
7/22
Data Warehouse - Nonvolatile
o A physically separate store of data, transformed from the operational
environment
o Operational update of data does not occur in the data warehouse
environment
Does not require transaction processing, recovery, and
concurrency control mechanisms
Requires only two operations in data accessing:
initial loading of dataand access of data
-
7/31/2019 Lecture 3 Data Warehouse Structures
8/22
Heterogeneous Databases
8
o Consists of a set of interconnected, autonomous databases.
o Objects in one database may differ from objects in other
databases.o Information exchange across such databases is difficult.
-
7/31/2019 Lecture 3 Data Warehouse Structures
9/22
Data Warehouse vs. HeterogeneousDBMS
9
o Heterogeneous DBMS: A query driven approach
Build wrappers/mediators on top of heterogeneous databases
A meta-dictionary is used to translate the query into queries appropriate for
individual heterogeneous sites.
The results are integrated into a global answer set.
This approach involves complex information filtering.
Inefficient and potentially expensive.
o Data warehouse: update-driven, high performance
Information from heterogeneous sources is integrated in advance and stored
in warehouses for direct query and analysis
-
7/31/2019 Lecture 3 Data Warehouse Structures
10/22
Operational DBMS
10
o They consist of tables with a set of attributes and stores alarge set of tuples.
o They use the Entity-Relationship (ER) data model.
o They are used to store transactional data.o They contain the most current information.
o Thus known as Online Transaction Processing (OLTP)systems.
-
7/31/2019 Lecture 3 Data Warehouse Structures
11/22
Data Warehouse vs. Operational DBMS
11
o User and system orientation
customer vs. market
o Data contents
current, detailed vs. historical, consolidated
o Database design
ER + application vs. star + subject
o View
current, local vs. evolutionary, integrated
o Access patterns
update vs. read-only but complex queries
-
7/31/2019 Lecture 3 Data Warehouse Structures
12/22
OLTP vs. OLAPOLTP( online transactionprocessing)
OLAP(online analyticalprocessing)
users clerk, IT professional knowledge worker
function day to day operations decision support
DB design application-oriented subject-oriented
data current, up-to-date
detailed, flat relationalisolated
historical,
summarized, multidimensionalintegrated, consolidated
usage repetitive ad-hoc
access read/write
index/hash on prim. key
lots of scans
unit of work short, simple transaction complex query
# records accessed tens millions
#users thousands hundreds
DB size 100MB-GB 100GB-TB
metric transaction throughput query throughput, response
-
7/31/2019 Lecture 3 Data Warehouse Structures
13/22
Why Separate Data
Warehouse?
13
o High performance for both systems
DBMS - Tuned for Online Transaction Processing Systems
Warehouse - Tuned for Online Analytical Processing systems
involving complex OLAP queries
Processing OLAP queries would degrade DBMS performance of
operational tasks.
o Decision support requires historical data which operational
Databases do not typically maintain.o Decision Support requires consolidation of data from
heterogeneous sources.
o Solution
To maintain separate database systems which support special
primitives and structures suitable to store, access and process
-
7/31/2019 Lecture 3 Data Warehouse Structures
14/22
Multidimensional Data Model
o A Data warehouse is based on multidimensional data model,which views data in the form of a data cube.
o Data cube models n-D data, defined by dimensions and facts.
Dimensions: They are entities with respect to which an
organization wants to keep records such as items(item_name).
Facts: It is a subject of decision oriented analysis such asdollars_sold or units_sold.
Facts are numerical measures.
Quantities by which we want to analyze relationshipbetween dimensions.
Contains key to each of the related dimension tables.
o A multidimensional data model is typically organized around a
central theme, like sales, and is represented by a fact table.
S l l f ti f
-
7/31/2019 Lecture 3 Data Warehouse Structures
15/22
Sales volume as a function ofproduct, Date, Country
Total annualsales
of TV in U.S.A.
Date
Country
sum
sumTV
VCR
PC
1Qtr 2Qtr 3Qtr 4Qtr
U.S.A
Canada
Mexico
sum
Dimensions: Product, Location, Time
Hierarchical summarization paths
Industry Region Year
Category Country Quarter
Product City Month
Office Week
Day
-
7/31/2019 Lecture 3 Data Warehouse Structures
16/22
Cube: A Lattice of Cuboids
se
all
time item location supplier
time,location
time,supplier
item,location
item,supplier
location,supplier
time,item,supplier
time,location,supplier
item,location,supplier
0-D(apex) cuboid
1-D cuboids
2-D cuboids
3-D cuboids
4-D(base) cuboid
-
7/31/2019 Lecture 3 Data Warehouse Structures
17/22
Schemas for Multidimensional
Databases Multidimensional model exists in form of
1. Star Schema: A fact table in the middle connected to a set ofdimension tables.
time_key
dayday_of_the_week
month
quarter
year
timetime_key
item_key
branch_key
location_key
units_sold
dollars_sold
avg_salesbranch_keybranch_name
branch_type
branch
item_key
item_name
brand
type
supplier_type
item
location_key
street
city
state_or_province
country
location
Sales Fact Table
-
7/31/2019 Lecture 3 Data Warehouse Structures
18/22
The Star SchemaEmployee_DimEmployeeKey
EmployeeID...
Time_DimTimeKey
TheDate...
Product_DimProductKey
ProductID...
Customer_DimCustomerKey
CustomerID
...
Shipper_DimShipperKey
ShipperID
...
Sales_Fact
TimeKeyEmployeeKeyProductKeyCustomerKeyShipperKey
Sales Amount
Unit Sales ...
-
7/31/2019 Lecture 3 Data Warehouse Structures
19/22
2. Snowflake schema: A refinement of star schema where somedimensional hierarchy is normalized into a set of smaller dimensiontables, forming a shape similar to snowflake.
time_key
day
day_of_the_week
month
quarter
year
time
branch_key
branch_name
branch_type
branch
time_key
item_key
branch_key
location_key
units_sold
dollars_sold
avg_sales
item_keyitem_name
brand
type
supplier_key
item
location_key
street
city_key
location
city_key
city
state_or_province
country
citySales Fact Table
-
7/31/2019 Lecture 3 Data Warehouse Structures
20/22
Snowflakes are conglomerations of frozen icecrystals which fall through the Earth's atmosphere.They begin as two snow crystals which develop
when microscopic supercooled clouddroplets freeze.
http://en.wikipedia.org/wiki/Earth's_atmospherehttp://en.wikipedia.org/wiki/Snowhttp://en.wikipedia.org/wiki/Supercoolinghttp://en.wikipedia.org/wiki/Freezinghttp://en.wikipedia.org/wiki/Freezinghttp://en.wikipedia.org/wiki/Supercoolinghttp://en.wikipedia.org/wiki/Snowhttp://en.wikipedia.org/wiki/Earth's_atmospherehttp://en.wikipedia.org/wiki/Earth's_atmospherehttp://en.wikipedia.org/wiki/Earth's_atmosphere -
7/31/2019 Lecture 3 Data Warehouse Structures
21/22
3. Fact Constellation: Multiple facts tables share dimension tables, viewed ascollection of stars, therefore called galaxy schema or fact constellation.
qq
time_keyday
day_of_the_week
month
quarter
year
time
branch_key
branch_name
branch_type
branchlocation_key
street
city
province_or_state
country
location
time_key
item_key
branch_key
location_key
units_sold
dollars_sold
avg_sales
item_key
item_name
brand
type
supplier_type
item
time_key
item_key
shipper_keyfrom_location
to_location
dollars_cost
units_shipped
shipper_key
shipper_name
location_key
shipper_type
shipperSales Fact Table
Shipping Fact Table
-
7/31/2019 Lecture 3 Data Warehouse Structures
22/22
THANKS