datawarehouse (3).pptx
TRANSCRIPT
-
8/20/2019 dataWarehouse (3).pptx
1/68
An Introduction toData Warehousing
-
8/20/2019 dataWarehouse (3).pptx
2/68
2
Data, Data everywhereyet ...
• I can’t fnd the data I need – data is scattered over the
network
– many versions, subtle
dierences I can’t get the data I need
need an expert to get the data
I can’t understand the data I ound
available data poorly
documented I can’t use the data I ound
results are unexpected
data needs to be transormedrom one orm to other
-
8/20/2019 dataWarehouse (3).pptx
3/68
!o "hat Is a Data"arehouse#
Defnition$ % single, complete and consistent
store o data obtained rom a variety o dierentsources made available to end users in a whatthey can understand and use in a businesscontext. &'arry Devlin(
•'y comparison$ an )*+ -online transactionprocessor/ or operational system is used to dealwith the everyday running o one aspect o anenterprise.
• )*+ systems are usually designed
independently o each other and it is di0cult orthem to share inormation.
-
8/20/2019 dataWarehouse (3).pptx
4/68
"hy Do "e 1eed Data"arehouses#
• onsolidation o inormationresources
• Improved 3uery perormance
• !eparate research and decisionsupport unctions rom the
operational systems• 4oundation or data mining, data
visuali5ation, advanced reporting
and )*% tools
-
8/20/2019 dataWarehouse (3).pptx
5/68
6
"hich are our lowest7highest margin
customers #
"hich are our lowest7highest margin
customers #
"ho are my customers
and what productsare they buying#
"ho are my customersand what products
are they buying#
"hich customers are most likely to goto the competition #
"hich customers
are most likely to goto the competition #
"hat impact willnew products7services
have on revenueand margins#
"hat impact willnew products7services
have on revenue
and margins#
"hat product promotions have the biggest
impact on revenue#
"hat product prom
otions have the biggestimpact on revenue#
"hat is the mosteective distribution
channel#
"hat is the mosteective distribution
channel#
"hy Data "arehousing#
-
8/20/2019 dataWarehouse (3).pptx
6/68
"hat Is a Data "arehouse 8sedor#
• 9nowledge discovery – :aking consolidated reports
– 4inding relationships and correlations
– Data mining
– ;xamples• 'anks identiying credit risks
• Insurance companies searching or raud• :edical research
-
8/20/2019 dataWarehouse (3).pptx
7/68
•
-
8/20/2019 dataWarehouse (3).pptx
8/68
omparison hart o Database +ypes
Data warehouse Operational system
Subject oriented Transaction oriented
Large (hundreds of GB up to severalTB) Small (MB up to several GB)
Historic data Current data
e!normali"ed table structure (fe#tables$ man% columns per table)
&ormali"ed table structure (man%tables$ fe# columns per table)
Batch updates Continuous updates
'suall% ver% comple ueries Simple to comple ueries
-
8/20/2019 dataWarehouse (3).pptx
9/68
Design Dierences
!tar !chema
Data WarehouseOperational System
;= Diagram
-
8/20/2019 dataWarehouse (3).pptx
10/68
!upporting a omplete!olution
Operational System-
Data Entry
Data Warehouse-Data Retrieval
-
8/20/2019 dataWarehouse (3).pptx
11/68
Data "arehouses, Data :arts, and )perational Data!tores
• Data "arehouse > +he 3ueryable source o datain the enterprise. It is comprised o the union oall o its constituent data marts.
• Data :art > % logical subset o the complete data
warehouse. )ten viewed as a restriction o thedata warehouse to a single business process or toa group o related business processes targetedtoward a particular business group.
• )perational Data !tore -)D!/ > % point ointegration or operational systems thatdeveloped independent o each other. !ince an)D! supports day to day operations, it needs tobe continually updated.
-
8/20/2019 dataWarehouse (3).pptx
12/68
?2
Decision !upport
• 8sed to manage and control business
• Data is historical or pointintime
• )ptimi5ed or in3uiry rather thanupdate
• 8se o the system is loosely defnedand can be adhoc
• 8sed by managers and endusers tounderstand the business and make
@udgements
-
8/20/2019 dataWarehouse (3).pptx
13/68
?A
"hat are the userssaying...
• Data should be integratedacross the enterprise
• !ummary data had a realvalue to the organi5ation
• Bistorical data held thekey to understanding data
over time• "hati capabilities are
re3uired
-
8/20/2019 dataWarehouse (3).pptx
14/68
?C
Data "arehousing It is a process
• +echni3ue or assembling andmanaging data rom varioussources or the purpose oanswering business 3uestions. +hus making decisions thatwere not previous possible
• % decision support databasemaintained separately romthe organi5ation’s operationaldatabase
-
8/20/2019 dataWarehouse (3).pptx
15/68
?6
Data "arehouse%rchitecture
Relational
Databases
Legacy
Data
Purchased
Data
Data Warehouse
Engine
Optimized Loader
Extraction
Cleansing
nalyze
!uery
"etadata Repository
-
8/20/2019 dataWarehouse (3).pptx
16/68
?
4rom the Data "arehouse to Data:arts
Departmentally!tructured
Individually!tructured
Data "arehouse)rgani5ationally!tructured
*ess
:ore
Bistory1ormali5edDetailed
Data
Inormation
-
8/20/2019 dataWarehouse (3).pptx
17/68
?E
8sers have dierent viewso Data
)rgani5ationallystructured
)*%
;xplorers$ !eek out the unknownand previously unsuspected rewardshiding in the detailed data
4armers$ Barvest inormationrom known access paths
+ourists$ 'rowseinormation harvestedby armers
-
8/20/2019 dataWarehouse (3).pptx
18/68
?F
"alG:art ase !tudy
• 4ounded by !am "alton
• )ne the largest !uper :arket hainsin the 8!
• "alG:art$ 2HHH =etail !tores
• !%:Js lubs ?HH"holesalers !tores• +his case study is rom 4elipe arino’s -1= +eradata/
presentation made at !tanord Database !eminar
-
8/20/2019 dataWarehouse (3).pptx
19/68
?K
)ld =etail aradigm
• "alG:art – Inventory :anagement
– :erchandise %ccountsayable
– urchasing
– !upplier romotions$1ational, =egion, !tore*evel
• !uppliers – %ccept )rders
– romote roducts
– rovide specialIncentives
– :onitor and +rack +he Incentives
– 'ill and ollect
=eceivables – ;stimate =etailer
Demands
-
8/20/2019 dataWarehouse (3).pptx
20/68
2H
1ew -LustIn+ime/ =etailaradigm
• 1o more deals• !helass +hrough -)! %pplication/
– )ne 8nit rice• !uppliers paid once a week on %+8%* items sold
– "alG:art :anager• Daily Inventory =estock• !uppliers -sometimes !ameDay/ ship to "alG:art
• "arehouseass +hrough – !tock some *arge Items
• Delivery may come rom supplier – Distribution enter
• !upplier’s merchandise unloaded directly onto "alG:art +rucks
-
8/20/2019 dataWarehouse (3).pptx
21/68
2?
Inormation as a !trategic"eapon
• Daily !ummary o all !ales Inormation• =egional %nalysis o all !tores in a logical
area
• !pecifc roduct !ales• !pecifc !upplies !ales• +rend %nalysis, etc.• "alG:art uses inormation when
negotiating with – !uppliers – %dvertisers etc.
-
8/20/2019 dataWarehouse (3).pptx
22/68
22
!chema Design
• Database organi5ation – must look like business – must be recogni5able by business user
– approachable by business user – :ust be simple
• !chema +ypes
– !tar !chema – 4act onstellation !chema – !nowMake schema
-
8/20/2019 dataWarehouse (3).pptx
23/68
2A
!tar !chema
• % single act table and or eachdimension one dimension table
• Does not capture hierarchies directly
T
i
me
p
r
o
d
c
u
s
t
c
i
t
y
f
a
c
t
date, custno, prodno, cityname, sales
-
8/20/2019 dataWarehouse (3).pptx
24/68
2C
Dimension +ables
• Dimension tables – Defne business in terms already amiliar
to users
– "ide rows with lots o descriptive text – !mall tables -about a million rows/ – Loined to act table by a oreign key – heavily indexed – typical dimensions
• time periods, geographic region -markets,cities/, products, customers, salesperson,etc.
-
8/20/2019 dataWarehouse (3).pptx
25/68
26
4act +able
• entral table – +ypical example$ individual sales
records
– mostly raw numeric items – narrow rows, a ew columns at most
– large number o rows -millions to a
billion/ – %ccess via dimensions
-
8/20/2019 dataWarehouse (3).pptx
26/68
2
!nowMake schema
• =epresent dimensional hierarchy directlyby normali5ing tables.
• ;asy to maintain and saves storage
T
i
me
p
r
o
d
c
u
s
t
c
i
t
y
f
a
ct
date, custno, prodno, cityname, ...
r
e
g
i
on
-
8/20/2019 dataWarehouse (3).pptx
27/68
2E
4act onstellation
• 4act onstellation – :ultiple act tables that share many
dimension tables
– 'ooking and heckout may share manydimension tables in the hotel industry
Hotels
Travel Agents
Promotion
Room Type
Customer
Booking Checkout
-
8/20/2019 dataWarehouse (3).pptx
28/68
2F
Data
-
8/20/2019 dataWarehouse (3).pptx
29/68
2K
-
8/20/2019 dataWarehouse (3).pptx
30/68
AH
*evels o
-
8/20/2019 dataWarehouse (3).pptx
31/68
A?
Data Integration %cross!ources
+rust redit card!avings *oans
!ame datadierent name
Dierent data!ame name
Data ound herenowhere else
Dierent keyssame data
-
8/20/2019 dataWarehouse (3).pptx
32/68
A2
Data +ransormation
• Data transormation is the oundationor achieving single version o the truth
• :a@or concern or I+• Data warehouse can ail i appropriatedata transormation strategy is notdeveloped
Sequential Legacy Relational ExternalOperational
Source Data
Data
!rans"ormation
#ccessing $apturing Extracting %ouseholding &iltering
Reconciling $onditioning Loading 'alidating Scoring
-
8/20/2019 dataWarehouse (3).pptx
33/68
AA
Data +ransormation;xample
e n
c o
d i
n g
u n
i t
" i
e l
d
appl # - (alance
appl ) - (al
appl $ - curr(al
appl D - (alcurr
appl # - pipeline - cm
appl ) - pipeline - in
appl $ - pipeline - "eet
appl D - pipeline - yds
appl # - m,"
appl ) - *,+
appl $ - x,yappl D - male, "emale
Data Warehouse
-
8/20/2019 dataWarehouse (3).pptx
34/68
AC
Data Integrity roblems
• !ame person, dierent spellings – %garwal, %grawal, %ggarwal etc...
• :ultiple ways to denote company name – ersistent !ystems, !*, ersistent vt. *+D.
• 8se o dierent names – mumbai, bombay
• Dierent account numbers generated by dierentapplications or the same customer
• =e3uired felds let blank
• Invalid product codes collected at point o sale – manual entry leads to mistakes – Pin case o a problem use KKKKKKKQ
-
8/20/2019 dataWarehouse (3).pptx
35/68
A6
Data +ransormation +erms
• ;xtracting
• onditioning
•!crubbing• :erging
• Bouseholding
• ;nrichment
• !coring• *oading
• Ralidating
• Delta 8pdating
-
8/20/2019 dataWarehouse (3).pptx
36/68
A
Data +ransormation +erms
• Bouseholding – Identiying all members o a household
-living at the same address/
– ;nsures only one mail is sent to ahousehold
– an result in substantial savings$ ?
million catalogues at =s. 6H each costs=s. 6H million . % 2N savings would save=s. ? million
-
8/20/2019 dataWarehouse (3).pptx
37/68
AE
=eresh
• ropagate updates on source data tothe warehouse
• Issues$ – when to reresh
– how to reresh incremental rereshtechni3ues
-
8/20/2019 dataWarehouse (3).pptx
38/68
AF
"hen to =eresh#
• periodically -e.g., every night, everyweek/ or ater signifcant events
• on every update$ not warrantedunless warehouse data re3uirecurrent data -up to the minute stock3uotes/
• reresh policy set by administratorbased on user needs and tra0c
• possibly dierent policies or
dierent sources
-
8/20/2019 dataWarehouse (3).pptx
39/68
AK
=eresh techni3ues
• Incremental techni3ues – detect changes on base tables$
replication servers -e.g., !ybase, )racle,
I': Data ropagator/• snapshots -)racle/
• transaction shipping -!ybase/
– compute changes to derived andsummary tables
– maintain transactional correctness orincremental load
-
8/20/2019 dataWarehouse (3).pptx
40/68
CH
Bow +o Detect hanges
• reate a snapshot log table to recordids o updated rows o source dataand timestamp
• Detect changes by$ – Defning ater row triggers to update
snapshot log when source table changes
– 8sing regular transaction log to detectchanges to source data
-
8/20/2019 dataWarehouse (3).pptx
41/68
C?
Suerying Data "arehouses
• !S* ;xtensions
• :ultidimensional modeling o data – )*%
– :ore on )*% later T
-
8/20/2019 dataWarehouse (3).pptx
42/68
C2
!S* ;xtensions
• ;xtended amily o aggregateunctions – rank -top ?H customers/
– percentile -top AHN o customers/
– median, mode
– )b@ect =elational !ystems allow addition
o new aggregate unctions• =eporting eatures
– running total, cumulative totals
-
8/20/2019 dataWarehouse (3).pptx
43/68
CA
=eporting +ools
• %ndyne omputing
-
8/20/2019 dataWarehouse (3).pptx
44/68
CC
)om(ay (ranch Delhi (ranch $alcutta (ranch
$ensus
data
Operational data
Detailed
transactional data
Data warehouse
Merge
CleanSummarie
Direct
uery
Reporting
tools
ining
toolsOL#
Decision support tools
Oracle S#S
Relational
D)S/e.g. Red(ric0
1S
$rystal reports Ess(ase1ntelligent iner
21S
data
D l i D t
-
8/20/2019 dataWarehouse (3).pptx
45/68
C6
Deploying Data"arehouses
• "hat business inormationkeeps you in businesstoday# "hat businessinormation can put you outo business tomorrow#
• "hat business inormation
should be a mouse clickaway#
• "hat business conditionsare the driving the need or
business inormation#
-
8/20/2019 dataWarehouse (3).pptx
46/68
C
ultural onsiderations
• 1ot @ust a technologypro@ect
• 1ew way o using
inormation to supportdaily activities anddecision making
• are must be taken to
prepare organi5ation orchange
• :ust have organi5ationalbacking and support
-
8/20/2019 dataWarehouse (3).pptx
47/68
CE
8ser +raining
• 8sers must have a higher level o I+profciency than or operationalsystems
• +raining to help users analy5e data inthe warehouse eectively
-
8/20/2019 dataWarehouse (3).pptx
48/68
!ummary$ 'uilding a Data"arehouse
– %nalysis
– Design – Import data
– Install rontend
tools – +est and deploy
Data Warehouse Li"ecycle
% case the !+)=;+ entral
-
8/20/2019 dataWarehouse (3).pptx
49/68
% case the !+)=;+ entral"arehouse
• Improved perormance and asterdata retrieval
• %bility to produce larger reports• %bility to provide more data 3uery
options
• !treamlined application navigation
)ld "eb %pplication 4low
-
8/20/2019 dataWarehouse (3).pptx
50/68
)ld "eb %pplication 4low
entral "arehouse %pplication
-
8/20/2019 dataWarehouse (3).pptx
51/68
entral "arehouse %pplication4low
Search $riteria
Selection
Report Si3e &eed(ac0
Report $ustomi3ation
Report 2eneration
-
8/20/2019 dataWarehouse (3).pptx
52/68
http$77epa.gov7storet7dwUhome.html
S!ORE! $entral Warehouse4
"eb %pplication Demo
http://epa.gov/storet/dw_home.htmlhttp://epa.gov/storet/dw_home.html
-
8/20/2019 dataWarehouse (3).pptx
53/68
!+)=;+ entral "arehouse >otential 4uture ;nhancements
• :ore 3uery unctionality
• %dditional report types
• "eb !ervices• %dditional source systems#
#$ORE$
#tate
#ystem
#tate#ystem %
Data "arehouse
-
8/20/2019 dataWarehouse (3).pptx
54/68
Data "arehouseomponents
!)8=;$ =alph 9imball
D t " h t
-
8/20/2019 dataWarehouse (3).pptx
55/68
Data "arehouse omponents >Detailed
!)8=;$ =alph 9imball
-
8/20/2019 dataWarehouse (3).pptx
56/68
6
)nline analyticalprocessing
-)*%/
-
8/20/2019 dataWarehouse (3).pptx
57/68
6E
1ature o )*% %nalysis
• %ggregation -total sales, percenttototal/
• omparison 'udget vs. ;xpenses• =anking +op ?H, 3uartile analysis
• %ccess to detailed and aggregate
data• omplex criteria specifcation
• Risuali5ation
• 1eed interactive response to aggregate
-
8/20/2019 dataWarehouse (3).pptx
58/68
6F
Month
1 2 3 4 765
P r o d u c t
Toothpaste
JuiceCola
Milk
Cream
Soap
R e i
o n
!S
"
#imensions$ Product% Reion% Time
&ierarchical summari'ation paths
Product Reion Time
(ndustr) Countr) *ear
Cateor) Reion +uarter
Product Cit) Month ,eek
-..ice #a)
:ultidimensional Data
• :easure sales -actual, plan, variance/
-
8/20/2019 dataWarehouse (3).pptx
59/68
6K
onceptual :odel or )*%
• 1umeric measures to be analy5ed – e.g. !ales -=s/, sales -volume/, budget,
revenue, inventory
• Dimensions – other attributes o data, defne the
space
– e.g., store, product, dateosale – hierarchies on dimensions
• e.g. branch V city V state
-
8/20/2019 dataWarehouse (3).pptx
60/68
H
)perations
• =ollup$ summari5e data – e.g., given sales data, summari5e sales
or last year by product category and
region• Drill down$ get more details
– e.g., given summari5ed sales as above,
fnd breakup o sales by city within eachregion, or within the %ndhra region
-
8/20/2019 dataWarehouse (3).pptx
61/68
?
:ore )*% )perations
• Bypothesis driven search$ ;.g.actors aecting deaulters – view deaulting rate on age aggregated over
other dimensions – or particular age segment detail along
proession
• 1eed interactive response to aggregate
3ueries – WV precompute various aggregates
-
8/20/2019 dataWarehouse (3).pptx
62/68
2
:)*% vs =)*%
• :)*%$ :ultidimensional array )*%
• =)*%$ =elational )*%
!ype Si3e $olour #mount!hirt ! 'lue ?H
!hirt * 'lue 26
!hirt %** 'lue A6
!hirt ! =ed A
!hirt * =ed E
!hirt %** =ed ?H
!hirt %** %** C6
T T T T
%** %** %** ?2KH
-
8/20/2019 dataWarehouse (3).pptx
63/68
A
!S* ;xtensions
• ube operator – group by on all subsets o a set o
attributes -month,city/
– redundant scan and sorting o data canbe avoided
• Rarious other nonstandard !S*
extensions by vendors
-
8/20/2019 dataWarehouse (3).pptx
64/68
C
)*%$ A +ier D!!
#ata !arehouse
#ata/ase 0a)er
!tore atomic datain industrystandard Data"arehouse.
-0P nine
pplication 0oic 0a)er
-
8/20/2019 dataWarehouse (3).pptx
65/68
6
!trengths o )*%
• It is a powerulvisuali5ation tool
• It provides ast,
interactive responsetimes
• It is good or analy5ingtime series
• It can be useul to fndsome clusters andoutliners
• :any vendors oer )*%tools
'rie Bistory
-
8/20/2019 dataWarehouse (3).pptx
66/68
'rie Bistory
• ;xpress and !ystem " D!!• )nline %nalytical rocessing coined by
;4 odd in ?KKC white paper by%rbor !otware
•
-
8/20/2019 dataWarehouse (3).pptx
67/68
E
)*% and ;xecutive Inormation!ystems
• %ndyne omputing ablo
• %rbor !otware ;ssbase
• ognos owerlay• omshare
ommander )*%• Bolistic !ystems
Bolos• Inormation %dvantage
%X!Y!, "eb)*%• Inormix :etacube• :icrostrategies
D!!7%gent
• )racle ;xpress• ilot *ight!hip• lanning !ciences
-
8/20/2019 dataWarehouse (3).pptx
68/68
:icrosot )*% strategy
• lato$ )*% server$ powerul, integratingvarious operational sources
• )*;D' or )*%$ emerging industry standardbased on :DX V extension o !S* or )*%
• ivottable services$ integrate with )0ce 2HHH – ;very desktop will have )*% capability.
• lient side caching and calculations
• artitioned and virtual cube
• Bybrid relational and multidimensional storage