datawarehouse (3).pptx

Upload: mypmpbooks

Post on 07-Aug-2018

223 views

Category:

Documents


1 download

TRANSCRIPT

  • 8/20/2019 dataWarehouse (3).pptx

    1/68

    An Introduction toData Warehousing

  • 8/20/2019 dataWarehouse (3).pptx

    2/68

    2

    Data, Data everywhereyet ...

    • I can’t fnd the data I need – data is scattered over the

    network

     – many versions, subtle

    dierences I can’t get the data I need

    need an expert to get the data

    I can’t understand the data I ound

    available data poorly

    documented I can’t use the data I ound

    results are unexpected

    data needs to be transormedrom one orm to other

  • 8/20/2019 dataWarehouse (3).pptx

    3/68

    !o "hat Is a Data"arehouse#

    Defnition$ % single, complete and consistent

    store o data obtained rom a variety o dierentsources made available to end users in a whatthey can understand and use in a businesscontext. &'arry Devlin(

    •'y comparison$ an )*+ -online transactionprocessor/ or operational system is used to dealwith the everyday running o one aspect o anenterprise.

    • )*+ systems are usually designed

    independently o each other and it is di0cult orthem to share inormation.

  • 8/20/2019 dataWarehouse (3).pptx

    4/68

    "hy Do "e 1eed Data"arehouses#

    • onsolidation o inormationresources

    • Improved 3uery perormance

    • !eparate research and decisionsupport unctions rom the

    operational systems• 4oundation or data mining, data

    visuali5ation, advanced reporting

    and )*% tools

  • 8/20/2019 dataWarehouse (3).pptx

    5/68

    6

    "hich are our lowest7highest margin

    customers #

    "hich are our lowest7highest margin

    customers #

    "ho are my customers

    and what productsare they buying#

    "ho are my customersand what products

    are they buying#

    "hich customers are most likely to goto the competition # 

    "hich customers

     are most likely to goto the competition # 

    "hat impact willnew products7services

    have on revenueand margins#

    "hat impact willnew products7services

    have on revenue

    and margins#

    "hat product promotions have the biggest

    impact on revenue#

    "hat product prom

    otions have the biggestimpact on revenue#

    "hat is the mosteective distribution

    channel#

    "hat is the mosteective distribution

    channel#

    "hy Data "arehousing#

  • 8/20/2019 dataWarehouse (3).pptx

    6/68

    "hat Is a Data "arehouse 8sedor#

    • 9nowledge discovery – :aking consolidated reports

     – 4inding relationships and correlations

     – Data mining

     – ;xamples• 'anks identiying credit risks

    • Insurance companies searching or raud• :edical research

  • 8/20/2019 dataWarehouse (3).pptx

    7/68

  • 8/20/2019 dataWarehouse (3).pptx

    8/68

    omparison hart o Database +ypes

    Data warehouse Operational system

    Subject oriented Transaction oriented

    Large (hundreds of GB up to severalTB) Small (MB up to several GB)

    Historic data Current data

    e!normali"ed table structure (fe#tables$ man% columns per table)

    &ormali"ed table structure (man%tables$ fe# columns per table)

    Batch updates Continuous updates

    'suall% ver% comple ueries Simple to comple ueries

  • 8/20/2019 dataWarehouse (3).pptx

    9/68

    Design Dierences

    !tar !chema

    Data WarehouseOperational System

    ;= Diagram

  • 8/20/2019 dataWarehouse (3).pptx

    10/68

    !upporting a omplete!olution

    Operational System-

    Data Entry

    Data Warehouse-Data Retrieval

  • 8/20/2019 dataWarehouse (3).pptx

    11/68

    Data "arehouses, Data :arts, and )perational Data!tores

    • Data "arehouse > +he 3ueryable source o datain the enterprise. It is comprised o the union oall o its constituent data marts.

    • Data :art > % logical subset o the complete data

    warehouse. )ten viewed as a restriction o thedata warehouse to a single business process or toa group o related business processes targetedtoward a particular business group.

    • )perational Data !tore -)D!/ > % point ointegration or operational systems thatdeveloped independent o each other. !ince an)D! supports day to day operations, it needs tobe continually updated.

  • 8/20/2019 dataWarehouse (3).pptx

    12/68

    ?2

    Decision !upport

    • 8sed to manage and control business

    • Data is historical or pointintime

    • )ptimi5ed or in3uiry rather thanupdate

    • 8se o the system is loosely defnedand can be adhoc

    • 8sed by managers and endusers tounderstand the business and make

     @udgements

  • 8/20/2019 dataWarehouse (3).pptx

    13/68

    ?A

    "hat are the userssaying...

    • Data should be integratedacross the enterprise

    • !ummary data had a realvalue to the organi5ation

    • Bistorical data held thekey to understanding data

    over time• "hati capabilities are

    re3uired

  • 8/20/2019 dataWarehouse (3).pptx

    14/68

    ?C

    Data "arehousing It is a process

    •  +echni3ue or assembling andmanaging data rom varioussources or the purpose oanswering business 3uestions. +hus making decisions thatwere not previous possible

    • % decision support databasemaintained separately romthe organi5ation’s operationaldatabase

  • 8/20/2019 dataWarehouse (3).pptx

    15/68

    ?6

    Data "arehouse%rchitecture

    Relational

    Databases

    Legacy

    Data

    Purchased

    Data

    Data Warehouse

    Engine

    Optimized Loader 

    Extraction

    Cleansing

     nalyze

    !uery

    "etadata Repository

  • 8/20/2019 dataWarehouse (3).pptx

    16/68

    ?

    4rom the Data "arehouse to Data:arts

    Departmentally!tructured

    Individually!tructured

    Data "arehouse)rgani5ationally!tructured

    *ess

    :ore

    Bistory1ormali5edDetailed

    Data

    Inormation

  • 8/20/2019 dataWarehouse (3).pptx

    17/68

    ?E

    8sers have dierent viewso Data

    )rgani5ationallystructured

    )*%

    ;xplorers$ !eek out the unknownand previously unsuspected rewardshiding in the detailed data

    4armers$ Barvest inormationrom known access paths

     +ourists$ 'rowseinormation harvestedby armers

  • 8/20/2019 dataWarehouse (3).pptx

    18/68

    ?F

    "alG:art ase !tudy

    • 4ounded by !am "alton

    • )ne the largest !uper :arket hainsin the 8!

    • "alG:art$ 2HHH =etail !tores

    • !%:Js lubs ?HH"holesalers !tores•  +his case study is rom 4elipe arino’s -1= +eradata/

    presentation made at !tanord Database !eminar

  • 8/20/2019 dataWarehouse (3).pptx

    19/68

    ?K

    )ld =etail aradigm

    • "alG:art – Inventory :anagement

     – :erchandise %ccountsayable

     – urchasing

     – !upplier romotions$1ational, =egion, !tore*evel

    • !uppliers – %ccept )rders

     – romote roducts

     – rovide specialIncentives

     – :onitor and +rack +he Incentives

     – 'ill and ollect

    =eceivables – ;stimate =etailer

    Demands

  • 8/20/2019 dataWarehouse (3).pptx

    20/68

    2H

    1ew -LustIn+ime/ =etailaradigm

    • 1o more deals• !helass +hrough -)! %pplication/

     – )ne 8nit rice• !uppliers paid once a week on %+8%* items sold

     – "alG:art :anager• Daily Inventory =estock• !uppliers -sometimes !ameDay/ ship to "alG:art

    • "arehouseass +hrough – !tock some *arge Items

    • Delivery may come rom supplier – Distribution enter

    • !upplier’s merchandise unloaded directly onto "alG:art +rucks

  • 8/20/2019 dataWarehouse (3).pptx

    21/68

    2?

    Inormation as a !trategic"eapon

    • Daily !ummary o all !ales Inormation• =egional %nalysis o all !tores in a logical

    area

    • !pecifc roduct !ales• !pecifc !upplies !ales•  +rend %nalysis, etc.• "alG:art uses inormation when

    negotiating with – !uppliers – %dvertisers etc.

  • 8/20/2019 dataWarehouse (3).pptx

    22/68

    22

    !chema Design

    • Database organi5ation – must look like business – must be recogni5able by business user

     – approachable by business user – :ust be simple

    • !chema +ypes

     – !tar !chema – 4act onstellation !chema – !nowMake schema

  • 8/20/2019 dataWarehouse (3).pptx

    23/68

    2A

    !tar !chema

    • % single act table and or eachdimension one dimension table

    • Does not capture hierarchies directly

     i

    me

     p

    o

    c

    u

     s

    c

    i

     y

     f 

    a

    c

    date, custno, prodno, cityname, sales

  • 8/20/2019 dataWarehouse (3).pptx

    24/68

    2C

    Dimension +ables

    • Dimension tables – Defne business in terms already amiliar

    to users

     – "ide rows with lots o descriptive text – !mall tables -about a million rows/ – Loined to act table by a oreign key – heavily indexed – typical dimensions

    • time periods, geographic region -markets,cities/, products, customers, salesperson,etc.

  • 8/20/2019 dataWarehouse (3).pptx

    25/68

    26

    4act +able

    • entral table – +ypical example$ individual sales

    records

     – mostly raw numeric items – narrow rows, a ew columns at most

     – large number o rows -millions to a

    billion/ – %ccess via dimensions

  • 8/20/2019 dataWarehouse (3).pptx

    26/68

    2

    !nowMake schema

    • =epresent dimensional hierarchy directlyby normali5ing tables.

    • ;asy to maintain and saves storage

     i

    me

     p

    o

    c

    u

     s

    c

    i

     y

     f 

    a

    ct 

    date, custno, prodno, cityname, ...

    e

     g 

    i

    on

  • 8/20/2019 dataWarehouse (3).pptx

    27/68

    2E

    4act onstellation

    • 4act onstellation – :ultiple act tables that share many

    dimension tables

     – 'ooking and heckout may share manydimension tables in the hotel industry

     Hotels

    Travel Agents

     Promotion

     Room Type

    Customer 

     Booking Checkout 

  • 8/20/2019 dataWarehouse (3).pptx

    28/68

    2F

    Data

  • 8/20/2019 dataWarehouse (3).pptx

    29/68

    2K

  • 8/20/2019 dataWarehouse (3).pptx

    30/68

    AH

    *evels o

  • 8/20/2019 dataWarehouse (3).pptx

    31/68

    A?

    Data Integration %cross!ources

     +rust redit card!avings *oans

    !ame datadierent name

    Dierent data!ame name

    Data ound herenowhere else

    Dierent keyssame data

  • 8/20/2019 dataWarehouse (3).pptx

    32/68

    A2

    Data +ransormation

    • Data transormation is the oundationor achieving single version o the truth

    • :a@or concern or I+• Data warehouse can ail i appropriatedata transormation strategy is notdeveloped

    Sequential Legacy Relational ExternalOperational

    Source Data

      Data

    !rans"ormation

    #ccessing $apturing Extracting %ouseholding &iltering

    Reconciling $onditioning Loading 'alidating Scoring

  • 8/20/2019 dataWarehouse (3).pptx

    33/68

    AA

    Data +ransormation;xample

         e      n

         c       o 

           d         i

         n     g   

         u      n

            i       t 

           "        i

         e  l

           d 

    appl # - (alance

    appl ) - (al

    appl $ - curr(al

    appl D - (alcurr 

    appl # - pipeline - cm

    appl ) - pipeline - in

    appl $ - pipeline - "eet

    appl D - pipeline - yds

    appl # - m," 

    appl ) - *,+

    appl $ - x,yappl D - male, "emale

    Data Warehouse

  • 8/20/2019 dataWarehouse (3).pptx

    34/68

    AC

    Data Integrity roblems

    • !ame person, dierent spellings – %garwal, %grawal, %ggarwal etc...

    • :ultiple ways to denote company name – ersistent !ystems, !*, ersistent vt. *+D.

    • 8se o dierent names – mumbai, bombay

    • Dierent account numbers generated by dierentapplications or the same customer

    • =e3uired felds let blank

    • Invalid product codes collected at point o sale – manual entry leads to mistakes – Pin case o a problem use KKKKKKKQ

  • 8/20/2019 dataWarehouse (3).pptx

    35/68

    A6

    Data +ransormation +erms

    • ;xtracting

    • onditioning

    •!crubbing• :erging

    • Bouseholding

    • ;nrichment

    • !coring• *oading

    • Ralidating

    • Delta 8pdating

  • 8/20/2019 dataWarehouse (3).pptx

    36/68

    A

    Data +ransormation +erms

    • Bouseholding – Identiying all members o a household

    -living at the same address/

     – ;nsures only one mail is sent to ahousehold

     – an result in substantial savings$ ?

    million catalogues at =s. 6H each costs=s. 6H million . % 2N savings would save=s. ? million

  • 8/20/2019 dataWarehouse (3).pptx

    37/68

    AE

    =eresh

    • ropagate updates on source data tothe warehouse

    • Issues$ – when to reresh

     – how to reresh incremental rereshtechni3ues

  • 8/20/2019 dataWarehouse (3).pptx

    38/68

    AF

    "hen to =eresh#

    • periodically -e.g., every night, everyweek/ or ater signifcant events

    • on every update$ not warrantedunless warehouse data re3uirecurrent data -up to the minute stock3uotes/

    • reresh policy set by administratorbased on user needs and tra0c

    • possibly dierent policies or

    dierent sources

  • 8/20/2019 dataWarehouse (3).pptx

    39/68

    AK

    =eresh techni3ues

    • Incremental techni3ues – detect changes on base tables$

    replication servers -e.g., !ybase, )racle,

    I': Data ropagator/• snapshots -)racle/

    • transaction shipping -!ybase/

     – compute changes to derived andsummary tables

     – maintain transactional correctness orincremental load

  • 8/20/2019 dataWarehouse (3).pptx

    40/68

    CH

    Bow +o Detect hanges

    • reate a snapshot log table to recordids o updated rows o source dataand timestamp

    • Detect changes by$ – Defning ater row triggers to update

    snapshot log when source table changes

     – 8sing regular transaction log to detectchanges to source data

  • 8/20/2019 dataWarehouse (3).pptx

    41/68

    C?

    Suerying Data "arehouses

    • !S* ;xtensions

    • :ultidimensional modeling o data – )*%

     – :ore on )*% later T

  • 8/20/2019 dataWarehouse (3).pptx

    42/68

    C2

    !S* ;xtensions

    • ;xtended amily o aggregateunctions – rank -top ?H customers/

     – percentile -top AHN o customers/

     – median, mode

     – )b@ect =elational !ystems allow addition

    o new aggregate unctions• =eporting eatures

     – running total, cumulative totals

  • 8/20/2019 dataWarehouse (3).pptx

    43/68

    CA

    =eporting +ools

    • %ndyne omputing

  • 8/20/2019 dataWarehouse (3).pptx

    44/68

    CC

    )om(ay (ranch Delhi (ranch $alcutta (ranch

    $ensus

    data

    Operational data

     Detailed

    transactional data

    Data warehouse

     Merge

    CleanSummarie

    Direct

    uery

    Reporting

    tools

    ining

    toolsOL#

    Decision support tools

    Oracle S#S

    Relational

    D)S/e.g. Red(ric0 

    1S

    $rystal reports Ess(ase1ntelligent iner 

    21S

    data

    D l i D t

  • 8/20/2019 dataWarehouse (3).pptx

    45/68

    C6

    Deploying Data"arehouses

    • "hat business inormationkeeps you in businesstoday# "hat businessinormation can put you outo business tomorrow#

    • "hat business inormation

    should be a mouse clickaway#

    • "hat business conditionsare the driving the need or

    business inormation#

  • 8/20/2019 dataWarehouse (3).pptx

    46/68

    C

    ultural onsiderations

    • 1ot @ust a technologypro@ect

    • 1ew way o using

    inormation to supportdaily activities anddecision making

    • are must be taken to

    prepare organi5ation orchange

    • :ust have organi5ationalbacking and support

  • 8/20/2019 dataWarehouse (3).pptx

    47/68

    CE

    8ser +raining

    • 8sers must have a higher level o I+profciency than or operationalsystems

    •  +raining to help users analy5e data inthe warehouse eectively

  • 8/20/2019 dataWarehouse (3).pptx

    48/68

    !ummary$ 'uilding a Data"arehouse

     – %nalysis

     – Design – Import data

     – Install rontend

    tools – +est and deploy

    Data Warehouse Li"ecycle

    % case the !+)=;+ entral

  • 8/20/2019 dataWarehouse (3).pptx

    49/68

    % case the !+)=;+ entral"arehouse

    • Improved perormance and asterdata retrieval

    • %bility to produce larger reports• %bility to provide more data 3uery

    options

    • !treamlined application navigation

    )ld "eb %pplication 4low

  • 8/20/2019 dataWarehouse (3).pptx

    50/68

    )ld "eb %pplication 4low

    entral "arehouse %pplication

  • 8/20/2019 dataWarehouse (3).pptx

    51/68

    entral "arehouse %pplication4low

    Search $riteria

     Selection

    Report Si3e &eed(ac0

    Report $ustomi3ation

    Report 2eneration

  • 8/20/2019 dataWarehouse (3).pptx

    52/68

    http$77epa.gov7storet7dwUhome.html

    S!ORE! $entral Warehouse4

    "eb %pplication Demo

    http://epa.gov/storet/dw_home.htmlhttp://epa.gov/storet/dw_home.html

  • 8/20/2019 dataWarehouse (3).pptx

    53/68

    !+)=;+ entral "arehouse >otential 4uture ;nhancements

    • :ore 3uery unctionality

    • %dditional report types

    • "eb !ervices• %dditional source systems#

    #$ORE$

    #tate

    #ystem

    #tate#ystem %

    Data "arehouse

  • 8/20/2019 dataWarehouse (3).pptx

    54/68

    Data "arehouseomponents

    !)8=;$ =alph 9imball

    D t " h t

  • 8/20/2019 dataWarehouse (3).pptx

    55/68

    Data "arehouse omponents >Detailed

    !)8=;$ =alph 9imball

  • 8/20/2019 dataWarehouse (3).pptx

    56/68

    6

    )nline analyticalprocessing

    -)*%/

  • 8/20/2019 dataWarehouse (3).pptx

    57/68

    6E

    1ature o )*% %nalysis

    • %ggregation -total sales, percenttototal/

    • omparison 'udget vs. ;xpenses• =anking +op ?H, 3uartile analysis

    • %ccess to detailed and aggregate

    data• omplex criteria specifcation

    • Risuali5ation

    • 1eed interactive response to aggregate

  • 8/20/2019 dataWarehouse (3).pptx

    58/68

    6F

    Month

    1 2 3 4 765

          P     r     o      d     u     c      t

    Toothpaste

    JuiceCola

    Milk

    Cream

    Soap

       R  e      i

      o  n

    !S

    "

    #imensions$ Product% Reion% Time

    &ierarchical summari'ation paths

    Product Reion Time

    (ndustr) Countr) *ear 

    Cateor) Reion +uarter

    Product Cit) Month ,eek

     

    -..ice #a)

    :ultidimensional Data

    • :easure sales -actual, plan, variance/

  • 8/20/2019 dataWarehouse (3).pptx

    59/68

    6K

    onceptual :odel or )*%

    • 1umeric measures to be analy5ed – e.g. !ales -=s/, sales -volume/, budget,

    revenue, inventory

    • Dimensions – other attributes o data, defne the

    space

     – e.g., store, product, dateosale – hierarchies on dimensions

    • e.g. branch V city V state

  • 8/20/2019 dataWarehouse (3).pptx

    60/68

    H

    )perations

    • =ollup$ summari5e data – e.g., given sales data, summari5e sales

    or last year by product category and

    region• Drill down$ get more details

     – e.g., given summari5ed sales as above,

    fnd breakup o sales by city within eachregion, or within the %ndhra region

  • 8/20/2019 dataWarehouse (3).pptx

    61/68

    ?

    :ore )*% )perations

    • Bypothesis driven search$ ;.g.actors aecting deaulters – view deaulting rate on age aggregated over

    other dimensions – or particular age segment detail along

    proession

    • 1eed interactive response to aggregate

    3ueries – WV precompute various aggregates

  • 8/20/2019 dataWarehouse (3).pptx

    62/68

    2

    :)*% vs =)*%

    • :)*%$ :ultidimensional array )*%

    • =)*%$ =elational )*%

     

    !ype Si3e $olour #mount!hirt ! 'lue ?H

    !hirt * 'lue 26

    !hirt %** 'lue A6

    !hirt ! =ed A

    !hirt * =ed E

    !hirt %** =ed ?H

    !hirt %** %** C6

    T T T T

    %** %** %** ?2KH

  • 8/20/2019 dataWarehouse (3).pptx

    63/68

    A

    !S* ;xtensions

    • ube operator – group by on all subsets o a set o

    attributes -month,city/

     – redundant scan and sorting o data canbe avoided

    • Rarious other nonstandard !S*

    extensions by vendors

  • 8/20/2019 dataWarehouse (3).pptx

    64/68

    C

    )*%$ A +ier D!!

    #ata !arehouse

    #ata/ase 0a)er 

    !tore atomic datain industrystandard Data"arehouse.

    -0P nine

    pplication 0oic 0a)er 

  • 8/20/2019 dataWarehouse (3).pptx

    65/68

    6

    !trengths o )*%

    • It is a powerulvisuali5ation tool

    • It provides ast,

    interactive responsetimes

    • It is good or analy5ingtime series

    • It can be useul to fndsome clusters andoutliners

    • :any vendors oer )*%tools

    'rie Bistory

  • 8/20/2019 dataWarehouse (3).pptx

    66/68

    'rie Bistory

    • ;xpress and !ystem " D!!• )nline %nalytical rocessing coined by

    ;4 odd in ?KKC white paper by%rbor !otware

  • 8/20/2019 dataWarehouse (3).pptx

    67/68

    E

    )*% and ;xecutive Inormation!ystems

    • %ndyne omputing ablo

    • %rbor !otware ;ssbase

    • ognos owerlay• omshare

    ommander )*%• Bolistic !ystems

    Bolos• Inormation %dvantage

    %X!Y!, "eb)*%• Inormix :etacube• :icrostrategies

    D!!7%gent

    • )racle ;xpress• ilot *ight!hip• lanning !ciences

  • 8/20/2019 dataWarehouse (3).pptx

    68/68

    :icrosot )*% strategy

    • lato$ )*% server$ powerul, integratingvarious operational sources

    • )*;D' or )*%$ emerging industry standardbased on :DX V extension o !S* or )*%

    • ivottable services$ integrate with )0ce 2HHH – ;very desktop will have )*% capability.

    • lient side caching and calculations

    • artitioned and virtual cube

    • Bybrid relational and multidimensional storage