data warehouse

48
Decision Support, Data Decision Support, Data Warehousing, and OLAP Warehousing, and OLAP Anindya Datta Anindya Datta Director, iXL Center for Director, iXL Center for E-Commerce E-Commerce Georgia Institute of Georgia Institute of Technology Technology [email protected] [email protected]

Upload: ganblues

Post on 27-Jan-2015

1.400 views

Category:

Technology


1 download

DESCRIPTION

 

TRANSCRIPT

Page 1: Data Warehouse

Decision Support, Data Decision Support, Data Warehousing, and OLAPWarehousing, and OLAP

Anindya DattaAnindya Datta

Director, iXL Center for E-Director, iXL Center for E-CommerceCommerce

Georgia Institute of Georgia Institute of TechnologyTechnology

[email protected]@cc.gatech.edu

Page 2: Data Warehouse

OutlineOutline

Terminology: OLAP vs. OLTPTerminology: OLAP vs. OLTP Data Warehousing ArchitectureData Warehousing Architecture TechnologiesTechnologies ProductsProducts Research IssuesResearch Issues ReferencesReferences

Page 3: Data Warehouse

Decision Support and Decision Support and OLAPOLAP

Information technology to help the knowledge worker Information technology to help the knowledge worker (executive, manager, analyst) make faster and better (executive, manager, analyst) make faster and better decisions.decisions.• What were the sales volumes by region and product category What were the sales volumes by region and product category

for the last year?for the last year?

• How did the share price of computer manufacturers correlate How did the share price of computer manufacturers correlate with quarterly profits over the past 10 years?with quarterly profits over the past 10 years?

• Which orders should we fill to maximize revenues?Which orders should we fill to maximize revenues?

• Will a 10% discount increase sales volume sufficiently?Will a 10% discount increase sales volume sufficiently?

• Which of two new medications will result in the best Which of two new medications will result in the best outcome: higher recovery rate & shorter hospital stay?outcome: higher recovery rate & shorter hospital stay?

On-Line Analytical Processing (OLAP) is an element of On-Line Analytical Processing (OLAP) is an element of decision support systmes (DSS). decision support systmes (DSS).

Page 4: Data Warehouse

EvolutionEvolution

60’s: Batch reports60’s: Batch reports• hard to find and analyze informationhard to find and analyze information

• inflexible and expensive, reprogram every new requestinflexible and expensive, reprogram every new request

70’s: Terminal-based DSS and EIS (executive information 70’s: Terminal-based DSS and EIS (executive information systems)systems)• still inflexible, not integrated with desktop toolsstill inflexible, not integrated with desktop tools

80’s: Desktop data access and analysis tools80’s: Desktop data access and analysis tools• query tools, spreadsheets, GUIsquery tools, spreadsheets, GUIs

• easier to use, but only access operational databaseseasier to use, but only access operational databases

90’s: Data warehousing with integrated OLAP engines and 90’s: Data warehousing with integrated OLAP engines and toolstools

Page 5: Data Warehouse

OLTP vs. OLAPOLTP vs. OLAP

Clerk, IT ProfessionalClerk, IT Professional Day to day operationsDay to day operations

Application-oriented (E-R Application-oriented (E-R based)based)

Current, IsolatedCurrent, Isolated Detailed, Flat relationalDetailed, Flat relational Structured, RepetitiveStructured, Repetitive Short, Simple transactionShort, Simple transaction Read/writeRead/write Index/hash on prim. KeyIndex/hash on prim. Key TensTens ThousandsThousands 100 MB-GB100 MB-GB Trans. throughputTrans. throughput

Knowledge workerKnowledge worker Decision supportDecision support

Subject-oriented (Star, Subject-oriented (Star, snowflake)snowflake)

Historical, ConsolidatedHistorical, Consolidated Summarized, MultidimensionalSummarized, Multidimensional Ad hocAd hoc Complex queryComplex query Read MostlyRead Mostly Lots of ScansLots of Scans MillionsMillions HundredsHundreds 100GB-TB100GB-TB Query throughput, responseQuery throughput, response

User

Function

DB Design

Data

View

Usage

Unit of work

Access

Operations

# Records accessed

#Users

Db size

Metric

OLTPOLTP OLAPOLAP

Page 6: Data Warehouse

Data WarehouseData Warehouse

A decision support database that is maintained separately A decision support database that is maintained separately from the organization’s operational databases.from the organization’s operational databases.

A data warehouse is a A data warehouse is a

• subject-oriented,subject-oriented,

• integrated,integrated,

• time-varying,time-varying,

• non-volatilenon-volatile

collection of data that is used primarily in organizational collection of data that is used primarily in organizational decision makingdecision making

Page 7: Data Warehouse

Why Separate Data Why Separate Data Warehouse?Warehouse?

PerformancePerformance• Op dbs designed & tuned for known txs & workloads.Op dbs designed & tuned for known txs & workloads.

• Complex OLAP queries would degrade perf. For op txs.Complex OLAP queries would degrade perf. For op txs.

• Special data organization, access & implementation methods Special data organization, access & implementation methods needed for multidimensional views & queries. needed for multidimensional views & queries.

FunctionFunction• Missing data: Decision support requires historical data, which op dbs do not typically Missing data: Decision support requires historical data, which op dbs do not typically

maintain.maintain.• Data consolidation: Decision support requires consolidation (aggregation, summarization) of Data consolidation: Decision support requires consolidation (aggregation, summarization) of

data from many heterogeneous sources: op dbs, external sources. data from many heterogeneous sources: op dbs, external sources. • Data quality: Different sources typically use inconsistent data representations, codes, and Data quality: Different sources typically use inconsistent data representations, codes, and

formats which have to be reconciled.formats which have to be reconciled.

Page 8: Data Warehouse

Data Warehousing MarketData Warehousing Market

Hardware: servers, storage, clientsHardware: servers, storage, clients Warehouse DBMsWarehouse DBMs ToolsTools Market growing from Market growing from

• $2B in 1995 to $8 B in 1998 (Meta Group)$2B in 1995 to $8 B in 1998 (Meta Group)

• 1.5B today to $6.9B in 1999 (Gartner Group)1.5B today to $6.9B in 1999 (Gartner Group)

Systems integration & ConsultingSystems integration & Consulting Already deployed in many industries: manufacturing, Already deployed in many industries: manufacturing,

retail, financial, insurance, transportation, telecom., retail, financial, insurance, transportation, telecom., utilities, healthcare.utilities, healthcare.

Page 9: Data Warehouse

Data Warehousing Data Warehousing ArchitectureArchitecture

Monitoring & Monitoring & AdministrationAdministration

Metadata Metadata RepositoryRepository

ExtractExtractTransformTransformLoadLoadRefreshRefresh

Data Data MartsMarts

External External SourcesSources

OperationOperational dbsal dbs

ServServee

OLAP OLAP serversservers

AnalysisAnalysis

Query/ Query/ ReportinReportingg

Data Data MiningMining

Page 10: Data Warehouse

Three-Tier ArchitectureThree-Tier Architecture

Warehouse database serverWarehouse database server• Almost always a relational DBMS; rarely flat filesAlmost always a relational DBMS; rarely flat files

OLAP serversOLAP servers• Relational OLAP (ROLAP): extended relational DBMS that maps Relational OLAP (ROLAP): extended relational DBMS that maps

operations on multidimensional data to standard relational operations.operations on multidimensional data to standard relational operations.

• Multidimensional OLAP (MOLAP): special purpose server that Multidimensional OLAP (MOLAP): special purpose server that directly implements multidimensional data and operations.directly implements multidimensional data and operations.

ClientsClients• Query and reporting tools.Query and reporting tools.

• Analysis toolsAnalysis tools

• Data mining tools (e.g., trend analysis, prediction) Data mining tools (e.g., trend analysis, prediction)

Page 11: Data Warehouse

Data Warehouse vs. Data Data Warehouse vs. Data MartsMarts

Enterprise warehouse: collects all information about subjects Enterprise warehouse: collects all information about subjects (customers, products, sales, assets, personnel) that span the entire (customers, products, sales, assets, personnel) that span the entire organization.organization.• Requires extensive business modelingRequires extensive business modeling

• May take years to design and buildMay take years to design and build Data Marts: Departmental subsets that focus on selected subjects: Data Marts: Departmental subsets that focus on selected subjects:

Marketing data mart: customer, products, sales.Marketing data mart: customer, products, sales.• Faster roll out, but complex integration in the long run.Faster roll out, but complex integration in the long run.

Virtual warehouse: views over operational dbsVirtual warehouse: views over operational dbs• Materialize some summary views for efficient query processingMaterialize some summary views for efficient query processing

• Easier to buildEasier to build

• Requisite excess capcaity on operational db serversRequisite excess capcaity on operational db servers

Page 12: Data Warehouse

Design & Operational Design & Operational ProcessProcess

Define architecture. Do capacity planning.Define architecture. Do capacity planning. Integrate db and OLAP servers, storage and client tools.Integrate db and OLAP servers, storage and client tools. Design warehouse schema, views.Design warehouse schema, views. Design physical warehouse organization: data placement, Design physical warehouse organization: data placement,

partitioning, access methods.partitioning, access methods. Connect sources: gateways, ODBC drivers, wrappers.Connect sources: gateways, ODBC drivers, wrappers. Design & implement scripts for data extract, load refresh.Design & implement scripts for data extract, load refresh. Define metadata and populate repository.Define metadata and populate repository. Design & implement end-user applications.Design & implement end-user applications. Roll out warehouse and applications.Roll out warehouse and applications. Monitor the warehouse.Monitor the warehouse.

Page 13: Data Warehouse

OLAP for Decision SupportOLAP for Decision Support

Goal of OLAP is to support ad-hoc querying for the Goal of OLAP is to support ad-hoc querying for the business analystbusiness analyst

Business analysts are familiar with spreadsheetsBusiness analysts are familiar with spreadsheets Extend spreadsheet analysis model to work with Extend spreadsheet analysis model to work with

warehouse datawarehouse data• Large data setLarge data set

• Semantically enriched to understand business terms (e.g., time, Semantically enriched to understand business terms (e.g., time, geography)geography)

• Combined with reporting featuresCombined with reporting features

Multidimensional Multidimensional view of data is the foundation of OLAP view of data is the foundation of OLAP

Page 14: Data Warehouse

Multidimensional Data Multidimensional Data ModelModel Database is a set ofDatabase is a set of facts facts (points) in a multidimensional space (points) in a multidimensional space A fact has a A fact has a measuremeasure dimension dimension

• quantity that is analyzed, e.g., sale, budgetquantity that is analyzed, e.g., sale, budget

A set of A set of dimensionsdimensions on which data is analyzed on which data is analyzed• e.g. , store, product, date associated with a sale amounte.g. , store, product, date associated with a sale amount

Dimensions form a sparsely populated coordinate systemDimensions form a sparsely populated coordinate system Each dimension has a set of Each dimension has a set of attributesattributes

• e.g., owner city and county of storee.g., owner city and county of store

Attributes of a dimension may be related by partial orderAttributes of a dimension may be related by partial order• HierarchyHierarchy: e.g., street > county >city: e.g., street > county >city

• LatticeLattice: e.g., date> month>year, date>week>year : e.g., date> month>year, date>week>year

Page 15: Data Warehouse

Multidimensional DataMultidimensional Data

1010

4747

3030

1212

JuiceJuice

ColaCola

Milk Milk

CreaCreamm

NYNY

LALA

SFSF

Sales Sales Volume Volume as a as a functiofunction of n of time, time, city city and and productproduct3/1 3/2 3/3 3/1 3/2 3/3

3/43/4

DateDate

Page 16: Data Warehouse

Operations in Operations in Multidimensional Data Multidimensional Data ModelModel

Aggregation (Aggregation (roll-uproll-up))• dimension reduction: e.g., total sales by citydimension reduction: e.g., total sales by city• summarization over aggregate hierarchy: e.g., total sales by summarization over aggregate hierarchy: e.g., total sales by

city and year -> total sales by region and by yearcity and year -> total sales by region and by year

Selection (Selection (sliceslice) defines a subcube) defines a subcube• e.g., sales where city = Palo Alto and date = 1/15/96e.g., sales where city = Palo Alto and date = 1/15/96

Navigation to detailed data (Navigation to detailed data (drill-downdrill-down))• e.g., (sales - expense) by city, top 3% of cities by average e.g., (sales - expense) by city, top 3% of cities by average

incomeincome

Visualization Operations (e.g., Pivot)Visualization Operations (e.g., Pivot)

Page 17: Data Warehouse

A Visual Operation: Pivot A Visual Operation: Pivot (Rotate)(Rotate)

1010

4747

3030

1212

JuiceJuice

ColaCola

Milk Milk

CreaCreamm

NYNY

LALA

SFSF

3/1 3/2 3/3 3/1 3/2 3/3 3/43/4

DateDate

Month

Month

Reg

ion

Reg

ion

ProductProduct

Page 18: Data Warehouse

Approaches to OLAP Approaches to OLAP ServersServers

Relational OLAP (ROLAP)Relational OLAP (ROLAP)• Relational and Specialized Relational DBMS to store and manage warehouse Relational and Specialized Relational DBMS to store and manage warehouse

datadata• OLAP middleware to support missing piecesOLAP middleware to support missing pieces

– Optimize for each DBMS backendOptimize for each DBMS backend

– Aggregation Navigation LogicAggregation Navigation Logic

– Additional tools and servicesAdditional tools and services

• Example: Microstrategy, MetaCube (Informix)Example: Microstrategy, MetaCube (Informix)

Multidimensional OLAP (MOLAP)Multidimensional OLAP (MOLAP)• Array-based storage structuresArray-based storage structures• Direct access to array data structuresDirect access to array data structures• Example: Essbase (Arbor), Accumate (Kenan)Example: Essbase (Arbor), Accumate (Kenan)

Domain-specific enrichmentDomain-specific enrichment

Page 19: Data Warehouse

Relational DBMS as Relational DBMS as Warehouse ServerWarehouse Server

Schema designSchema design Specialized scan, indexing and join techniquesSpecialized scan, indexing and join techniques Handling of aggregate views (querying and Handling of aggregate views (querying and

materialization)materialization) Supporting query language extensions beyond Supporting query language extensions beyond

SQLSQL Complex query processing and optimizationComplex query processing and optimization Data partitioning and parallelismData partitioning and parallelism

Page 20: Data Warehouse

Warehouse Database Warehouse Database SchemaSchema

ER design techniques not appropriateER design techniques not appropriate Design should reflect multidimensional Design should reflect multidimensional

viewview• Star SchemaStar Schema• Snowflake SchemaSnowflake Schema• Fact Constellation SchemaFact Constellation Schema

Page 21: Data Warehouse

Example of a Star SchemaExample of a Star Schema

Order NoOrder No

Order DateOrder Date

Customer NoCustomer No

Customer Customer NameName

Customer Customer AddressAddress

CityCity

SalespersonIDSalespersonID

SalespersonNaSalespersonNameme

CityCity

QuotaQuota

OrderNOOrderNO

SalespersonIDSalespersonID

CustomerNOCustomerNO

ProdNoProdNo

DateKeyDateKey

CityNameCityName

QuantityQuantity

Total Price

ProductNOProductNO

ProdNameProdName

ProdDescrProdDescr

CategoryCategory

CategoryDescriptionCategoryDescription

UnitPriceUnitPrice

DateKeyDateKey

DateDate

CityNameCityName

StateState

CountryCountry

OrderOrder

CustomerCustomer

SalespersoSalespersonn

CityCity

DateDate

ProductProduct

Fact TableFact Table

Page 22: Data Warehouse

Star SchemaStar Schema

A single fact table and a single table for each dimensionA single fact table and a single table for each dimension Every fact points to one tuple in each of the dimensions Every fact points to one tuple in each of the dimensions

and has additional attributesand has additional attributes Does not capture hierarchies directlyDoes not capture hierarchies directly Generated keys are used for performance and maintenance Generated keys are used for performance and maintenance

reasonsreasons Fact constellation: Multiple Fact tables that share many Fact constellation: Multiple Fact tables that share many

dimension tablesdimension tables• Example: Projected expense and the actual expense may share Example: Projected expense and the actual expense may share

dimensional tablesdimensional tables

Page 23: Data Warehouse

Example of a Snowflake Example of a Snowflake SchemaSchema

Order NoOrder No

Order DateOrder Date

Customer NoCustomer No

Customer Customer NameName

Customer Customer AddressAddress

CityCity

SalespersonIDSalespersonID

SalespersonNaSalespersonNameme

CityCity

QuotaQuota

OrderNOOrderNO

SalespersonIDSalespersonID

CustomerNOCustomerNO

ProdNoProdNo

DateKeyDateKey

CityNameCityName

QuantityQuantity

Total Price

ProductNOProductNO

ProdNameProdName

ProdDescrProdDescr

CategoryCategory

CategoryCategory

UnitPriceUnitPrice

DateKeyDateKey

DateDate

MonthMonth

CityNameCityName

StateState

CountryCountry

OrderOrder

CustomerCustomer

SalespersoSalespersonn

CityCity

DateDate

ProductProduct

Fact TableFact Table

CategoryNaCategoryNameme

CategoryDeCategoryDescrscr

MontMonthh

YearYear YearYear

StateNameStateName

CountryCountry

CategoryCategory

StateState

MonthMonthYearYear

Page 24: Data Warehouse

Snowflake SchemaSnowflake Schema

Represent dimensional hierarchy directly by Represent dimensional hierarchy directly by normalizing the dimension tablesnormalizing the dimension tables

Easy to maintainEasy to maintain Saves storage, but is alleged that it reduces Saves storage, but is alleged that it reduces

effectiveness of browsing (Kimball)effectiveness of browsing (Kimball)

Page 25: Data Warehouse

Indexing TechniquesIndexing Techniques

Exploiting indexes to reduce scanning of Exploiting indexes to reduce scanning of data is of crucial importancedata is of crucial importance

Bitmap IndexesBitmap Indexes Join IndexesJoin Indexes Other IssuesOther Issues

• Text indexingText indexing• Parallelizing and sequencing of index builds and Parallelizing and sequencing of index builds and

incremental updatesincremental updates

Page 26: Data Warehouse

BitMap IndexesBitMap Indexes

An alternative representation of RID-listAn alternative representation of RID-list Specially advantageous for low-cardinality domainsSpecially advantageous for low-cardinality domains Represent each row of a table by a bit and the table Represent each row of a table by a bit and the table

as a bit vectoras a bit vector There is a distinct bit vector Bv for each value v for There is a distinct bit vector Bv for each value v for

the domainthe domain Example: the attribute sex has values M and F. A Example: the attribute sex has values M and F. A

table of 100 million people needs 2 lists of 100 table of 100 million people needs 2 lists of 100 million bitsmillion bits

Page 27: Data Warehouse

Bit Map IndexBit Map Index

Cust Region RatingC1 N HC2 S MC3 W LC4 W HC5 S LC6 W LC7 N H

Base Base TableTable

Row ID N S E W1 1 0 0 02 0 1 0 03 0 0 0 14 0 0 0 15 0 1 0 06 0 0 0 17 1 0 0 0

Row ID H M L1 1 0 02 0 1 03 0 0 04 0 0 05 0 1 06 0 0 07 1 0 0

Rating IndexRating Index

Region Region IndexIndex

Customers whereCustomers where Region = Region = WW

Rating = Rating = 11AndAnd

Page 28: Data Warehouse

BitMap IndexesBitMap Indexes

Comparison, join and aggregation operations are Comparison, join and aggregation operations are reduced to bit arithmetic with dramatic reduced to bit arithmetic with dramatic improvement in processing timeimprovement in processing time

Significant reduction in space and I/O (30:1)Significant reduction in space and I/O (30:1) Adapted for higher cardinality domains as well.Adapted for higher cardinality domains as well. Compression (e.g., run-length encoding) exploitedCompression (e.g., run-length encoding) exploited Products that support bitmaps: Model 204, Products that support bitmaps: Model 204,

TargetIndex (Redbrick), IQ (Sybase), Oracle 7.3TargetIndex (Redbrick), IQ (Sybase), Oracle 7.3

Page 29: Data Warehouse

Issues in Handling of Issues in Handling of Aggregate ViewsAggregate Views

Important component for ROLAP ServersImportant component for ROLAP Servers Representation in the context of star schemaRepresentation in the context of star schema

• Query ExpressionsQuery Expressions• Materialized ViewsMaterialized Views

Logic for Aggregation NavigationLogic for Aggregation Navigation• make optimum use of materialized aggregates to answer a make optimum use of materialized aggregates to answer a

queryquery

Choice of aggregate views to materializeChoice of aggregate views to materialize HP Intelligent Warehouse pioneered some of the HP Intelligent Warehouse pioneered some of the

techniquestechniques

Page 30: Data Warehouse

SQL Extensions for Front SQL Extensions for Front End ToolsEnd Tools

Extended Family of Aggregate functionsExtended Family of Aggregate functions• rank (top 10) and N-Tile (“top 30%” of all products)rank (top 10) and N-Tile (“top 30%” of all products)

• Median, mode…..Median, mode…..

Reporting FeaturesReporting Features• running total, cumulative totalsrunning total, cumulative totals

Results of multiple group by:Results of multiple group by:• total sales by month and total sales by producttotal sales by month and total sales by product

SQL comes in the way of sequential processing and SQL comes in the way of sequential processing and columnar aggregationscolumnar aggregations• changes in total sale from 1994 to 1996, aggregated by brandchanges in total sale from 1994 to 1996, aggregated by brand

Page 31: Data Warehouse

Query Processing in Query Processing in MOLAP ServersMOLAP Servers

The storage model is an n-dimensional arrayThe storage model is an n-dimensional array Front end multidimensional queries map to Front end multidimensional queries map to

server capabilities in a straightforward wayserver capabilities in a straightforward way Direct Addressing abilitiesDirect Addressing abilities A straightforward array representation has A straightforward array representation has

good indexing properties but very poor good indexing properties but very poor storage utilization when the data is sparsestorage utilization when the data is sparse

Page 32: Data Warehouse

Query Processing in Query Processing in MOLAP ServersMOLAP Servers

2-dimensional dense arrays indexed by 2-dimensional dense arrays indexed by B-TreesB-Trees

Traditional Traditional indexing indexing structurestructure

2-dimensional 2-dimensional dense arraysdense arrays

Page 33: Data Warehouse

Population & Refreshing Population & Refreshing the Warehousethe Warehouse

Data extractionData extraction Data cleaningData cleaning Data transformationData transformation

• Convert from legacy/host format to warehouse formatConvert from legacy/host format to warehouse format

Load Load • Sort, summarize, consolidate, compute views, check Sort, summarize, consolidate, compute views, check

integrity, build indexes, partitionintegrity, build indexes, partition

RefreshRefresh• Propogate updates from sources to the warehousePropogate updates from sources to the warehouse

Page 34: Data Warehouse

Data CleaningData Cleaning

Why?Why?• Data warehouse contains data that is analyzed for business decisionsData warehouse contains data that is analyzed for business decisions

• More data and multiple sources could mean more errors in the data More data and multiple sources could mean more errors in the data and harder to trace such errorsand harder to trace such errors

• Results in incorrect analysisResults in incorrect analysis

Detecting data anomalies and rectifying them early has huge Detecting data anomalies and rectifying them early has huge payoffspayoffs

Important to identify tools that work together wellImportant to identify tools that work together well Long Term SolutionLong Term Solution

• Change business practices and data entry toolsChange business practices and data entry tools

• Repository for meta-dataRepository for meta-data

Page 35: Data Warehouse

Data Cleaning TechniquesData Cleaning Techniques

Transformation RulesTransformation Rules• Example: translate “gender” to “sex”Example: translate “gender” to “sex”

• Products: Warehouse Manger (Prism), Extract (ETI)Products: Warehouse Manger (Prism), Extract (ETI)

Uses domain-specific knowledge to do scrubbingUses domain-specific knowledge to do scrubbing Parsing and fuzzy matchingParsing and fuzzy matching

• Multiple data sources (can designate a preferred source)Multiple data sources (can designate a preferred source)

• Products: Integrity (Vality), TrillumProducts: Integrity (Vality), Trillum

Discover facts that flag unusual patterns (auditing)Discover facts that flag unusual patterns (auditing)• Some dealer has never received a single complaintSome dealer has never received a single complaint

• Products: QDB, SBStar, WizRuleProducts: QDB, SBStar, WizRule

Page 36: Data Warehouse

LoadLoad

Issues:Issues:• huge volumes of data to be loadedhuge volumes of data to be loaded• small time window (usually at night) when the warehouse can be taken off-small time window (usually at night) when the warehouse can be taken off-

lineline• When to build indexes and summary tablesWhen to build indexes and summary tables• allow system administrator to monitor status, cancel suspend, resume load, allow system administrator to monitor status, cancel suspend, resume load,

or change load rateor change load rate• restart after failure with no loss of data integrityrestart after failure with no loss of data integrity

Techniques:Techniques:• batch load utility: sort input records on clustering key and use sequential batch load utility: sort input records on clustering key and use sequential

I/O; build indexes and derived tablesI/O; build indexes and derived tables• sequential loads still too long (~100 days for TB)sequential loads still too long (~100 days for TB)• use parallelism and incremental techniquesuse parallelism and incremental techniques

Page 37: Data Warehouse

RefreshRefresh

Issues:Issues:• when to refreshwhen to refresh

– on every update: too expensive, only necessary if OLAP on every update: too expensive, only necessary if OLAP queries need current data queries need current data (e.g., up-the-minute stock quotes)(e.g., up-the-minute stock quotes)

– periodically (e.g., every 24 hours, every week) or after periodically (e.g., every 24 hours, every week) or after “significant” events“significant” events

– refresh policy set by administrator based on user needs and refresh policy set by administrator based on user needs and traffictraffic

– possibly different policies for different sourcespossibly different policies for different sources

• how to refreshhow to refresh

Page 38: Data Warehouse

Refresh TechniquesRefresh Techniques

Full extract from base tablesFull extract from base tables• read entire source table or database: expensiveread entire source table or database: expensive• may be the only choice for legacy databases or files.may be the only choice for legacy databases or files.

Incremental techniques (related to work on active dbs)Incremental techniques (related to work on active dbs)• detect & propagate changes on base tables: replication servers (e.g., detect & propagate changes on base tables: replication servers (e.g.,

Sybase, Oracle, IBM Data Propagator)Sybase, Oracle, IBM Data Propagator)– snapshots & triggers (Oracle)snapshots & triggers (Oracle)– transaction shipping (Sybase)transaction shipping (Sybase)

• Logical correctnessLogical correctness– computing changes to star tablescomputing changes to star tables– computing changes to derived and summary tablescomputing changes to derived and summary tables– optimization: only significant changesoptimization: only significant changes

• transactional correctness: incremental loadtransactional correctness: incremental load

Page 39: Data Warehouse

Metadata RepositoryMetadata Repository

Administrative metadataAdministrative metadata• source databases and their contentssource databases and their contents• gateway descriptionsgateway descriptions• warehouse schema, view & derived data definitionswarehouse schema, view & derived data definitions• dimensions, hierarchiesdimensions, hierarchies• pre-defined queries and reportspre-defined queries and reports• data mart locations and contentsdata mart locations and contents• data partitionsdata partitions• data extraction, cleansing, transformation rules, defaultsdata extraction, cleansing, transformation rules, defaults• data refresh and purging rulesdata refresh and purging rules• user profiles, user groupsuser profiles, user groups• security: user authorization, access controlsecurity: user authorization, access control

Page 40: Data Warehouse

Metdata Repository .. 2Metdata Repository .. 2

Business dataBusiness data• business terms and definitionsbusiness terms and definitions• ownership of dataownership of data• charging policiescharging policies

operational metadataoperational metadata• data lineage: history of migrated data and sequence of data lineage: history of migrated data and sequence of

transformations appliedtransformations applied• currency of data: active, archived, purgedcurrency of data: active, archived, purged• monitoring information: warehouse usage statistics, error monitoring information: warehouse usage statistics, error

reports, audit trails.reports, audit trails.

Page 41: Data Warehouse

Warehouse Design ToolsWarehouse Design Tools

Creating and managing a warehouse is hard.Creating and managing a warehouse is hard. Development toolsDevelopment tools

• defining & editing metadata repository contents (schemas, scripts, defining & editing metadata repository contents (schemas, scripts, rules).rules).

• Queries and reportsQueries and reports

• Shipping metadata to and from RDBMS catalogue (e.g., Prism Shipping metadata to and from RDBMS catalogue (e.g., Prism Warehouse Manager).Warehouse Manager).

Planning & analysis toolsPlanning & analysis tools• impact of schema changesimpact of schema changes

• capacity planningcapacity planning

• refresh performance: changing refresh rates or time windowsrefresh performance: changing refresh rates or time windows

Page 42: Data Warehouse

Warehouse Management Warehouse Management ToolsTools

Monitoring and reporting tools (e.g., HP Intelligent Warehouse Monitoring and reporting tools (e.g., HP Intelligent Warehouse Advisor)Advisor)• which partitions, summary tables, columns are used which partitions, summary tables, columns are used • query execution timesquery execution times• for summary tables, types & frequencies of roll downsfor summary tables, types & frequencies of roll downs• warehouse usage over time (detect peak periods)warehouse usage over time (detect peak periods)

Systems and network management tools (e.g., HP OpenView, Systems and network management tools (e.g., HP OpenView, IBM NetView, Tivoli): traffic, utilizationIBM NetView, Tivoli): traffic, utilization

Exception reporting/alerting tools 9e.g., DB2 Event Alerters, Exception reporting/alerting tools 9e.g., DB2 Event Alerters, Information Advantage InfoAgents & InfoAlert)Information Advantage InfoAgents & InfoAlert)• runaway queriesrunaway queries

Analysis/Visualization tools: OLAP on metadataAnalysis/Visualization tools: OLAP on metadata

Page 43: Data Warehouse

State of Commercial State of Commercial PracticePractice

Products and Vendors [Datamation, May 15, 1996; R.C. Barquin, H.A. Edelstein: Planning and Designin Products and Vendors [Datamation, May 15, 1996; R.C. Barquin, H.A. Edelstein: Planning and Designin gthe Data Warehous. Prentice Hall. 1997]gthe Data Warehous. Prentice Hall. 1997]

• Connectivity to sourcesConnectivity to sources– ApertusApertus CA-Ingres GatewayCA-Ingres Gateway– Information Builders EDA/SQLInformation Builders EDA/SQL IBM Data JionerIBM Data Jioner– Informix Enterprise GatewayInformix Enterprise Gateway Microsoft ODBCMicrosoft ODBC– Oracle Open ConnectOracle Open Connect Platinum InfohubPlatinum Infohub– SAS ConnectSAS Connect Software AG EntireSoftware AG Entire– Sybase Enterprise ConnectSybase Enterprise Connect Trinzic InfoHubTrinzic InfoHub

• Data extract, clean, transfomr, refreshData extract, clean, transfomr, refresh– CA-Ingres ReplicatorCA-Ingres Replicator Carleton PassportCarleton Passport– Evolutionary Tech Inc. ETI-ExtractEvolutionary Tech Inc. ETI-Extract Harte-Hanks TrilliumHarte-Hanks Trillium– IBM Data Joiner, Data PropagatorIBM Data Joiner, Data Propagator Oracle 7Oracle 7– Platinum InfoRefiner, InfroPumpPlatinum InfoRefiner, InfroPump Praxis OmniReplicatorPraxis OmniReplicator– Prism Warehouse ManagerPrism Warehouse Manager Redbrick TMURedbrick TMU– SAS AccessSAS Access Software AG SouorcepointSoftware AG Souorcepoint– Sybase Replication ServerSybase Replication Server Trinzic InfoPumpTrinzic InfoPump

Page 44: Data Warehouse

State of Commercial State of Commercial Practice..2Practice..2

Multidimensional Database EnginesMultidimensional Database EnginesArbor EssbaseArbor Essbase Comshare Commander OLAPComshare Commander OLAP

Oracle IRI ExpressOracle IRI Express SAS SystemSAS System Warehouse Data ServersWarehouse Data Servers

• CA-IngresCA-Ingres IBM DB2IBM DB2

• Information Builders FocusInformation Builders Focus InformixInformix

• OracleOracle Praxiz Model 204Praxiz Model 204

• RedbrickRedbrick Software AG ADABASSoftware AG ADABAS

• Sybase MPPSybase MPP TandemTandem

• TerdataTerdata ROLAP ServersROLAP Servers

• HP Intelligent WarehouseHP Intelligent Warehouse Information Advantage AsxysInformation Advantage Asxys

• Informix MetacubeInformix Metacube MicroStrategy DSS ServerMicroStrategy DSS Server

Page 45: Data Warehouse

State of Commercial State of Commercial Practice..3Practice..3

Query/Reporting EnvironmentsQuery/Reporting EnvironmentsBrio/QueryBrio/Query Business ObjectsBusiness Objects

Cognos ImpromptuCognos Impromptu CA Visual ExpressCA Visual Express

IBM DataGuideIBM DataGuide Information Builders Focus SixInformation Builders Focus Six

Informix ViewPointInformix ViewPoint Platinum Forest & TreesPlatinum Forest & Trees

SAS AccessSAS Access Software AG EsperantSoftware AG Esperant Multidimensional AnalysisMultidimensional Analysis

Andydne PabloAndydne Pablo Arbor Essbase Analysis ServerArbor Essbase Analysis Server

Business ObjectsBusiness Objects Cognos PowerPlayCognos PowerPlay

Dimensional Insight Cross TargetDimensional Insight Cross Target Holistic Systems HOLOSHolistic Systems HOLOS

Information Advantage Decision SuiteInformation Advantage Decision Suite IQ Software IQ/VisionIQ Software IQ/Vision

Kenan System AcumateKenan System Acumate Lotus 123Lotus 123

Microsoft ExcelMicrosoft Excel Microstrategy DSSMicrostrategy DSS

Pilot LightshipPilot Lightship Platinum Forest & TreesPlatinum Forest & Trees

Prodea BeaconProdea Beacon SAS OLAP ++SAS OLAP ++

Stanford Technology Group MetacubeStanford Technology Group Metacube

Page 46: Data Warehouse

State of Commercial State of Commercial Practice..4Practice..4

Metadata ManagementMetadata Management• HP Intelligent WarehouseHP Intelligent Warehouse IBM Data GuideIBM Data Guide

• Platinum Repository Platinum Repository Prism Directory ManagerPrism Directory Manager System ManagementSystem Management

• CA UnicenterCA Unicenter HP OpenViewHP OpenView

• IBM DataHub, NetViewIBM DataHub, NetView Information Builder Site AnalyzerInformation Builder Site Analyzer

• Prism Warehouse ManagerPrism Warehouse Manager SAS CPESAS CPE

• TivoliTivoli Software AG Source PointSoftware AG Source Point

• Redbrick Enterprise Control and CoordinationRedbrick Enterprise Control and Coordination Process Management Process Management

• At& T TOPENDAt& T TOPEND HP Intelligent WarehouseHP Intelligent Warehouse

• IBM FlowMarkIBM FlowMark Platinum RepositoryPlatinum Repository

• Prism Warehouse ManagerPrism Warehouse Manager Software AG Source PointSoftware AG Source Point Systems integration and consultingSystems integration and consulting

Page 47: Data Warehouse

Research IssuesResearch Issues

Data cleaningData cleaning• focus on data inconsistencies, not schema differencesfocus on data inconsistencies, not schema differences

• data mining techniquesdata mining techniques Physical DesignPhysical Design

• design of summary tables, partitions, indexesdesign of summary tables, partitions, indexes

• tradeoffs in use of different indexestradeoffs in use of different indexes Query processingQuery processing

• selecting appropriate summary tablesselecting appropriate summary tables

• dynamic optimization with feedbackdynamic optimization with feedback

• acid test for query optimization: cost estimation, use of transformations, acid test for query optimization: cost estimation, use of transformations, search strategiessearch strategies

• partitioning query processing between OLAP server and backend server.partitioning query processing between OLAP server and backend server.

Page 48: Data Warehouse

Research Issues .. 2Research Issues .. 2

Warehouse ManagementWarehouse Management• detecting runaway queriesdetecting runaway queries

• resource managementresource management

• incremental refresh techniquesincremental refresh techniques

• computing summary tables during loadcomputing summary tables during load

• failure recovery during load and refreshfailure recovery during load and refresh

• process management: scheduling queries, load and process management: scheduling queries, load and refreshrefresh

• use of workflow technology for process managementuse of workflow technology for process management