data warehouse
DESCRIPTION
TRANSCRIPT
Decision Support, Data Decision Support, Data Warehousing, and OLAPWarehousing, and OLAP
Anindya DattaAnindya Datta
Director, iXL Center for E-Director, iXL Center for E-CommerceCommerce
Georgia Institute of Georgia Institute of TechnologyTechnology
[email protected]@cc.gatech.edu
OutlineOutline
Terminology: OLAP vs. OLTPTerminology: OLAP vs. OLTP Data Warehousing ArchitectureData Warehousing Architecture TechnologiesTechnologies ProductsProducts Research IssuesResearch Issues ReferencesReferences
Decision Support and Decision Support and OLAPOLAP
Information technology to help the knowledge worker Information technology to help the knowledge worker (executive, manager, analyst) make faster and better (executive, manager, analyst) make faster and better decisions.decisions.• What were the sales volumes by region and product category What were the sales volumes by region and product category
for the last year?for the last year?
• How did the share price of computer manufacturers correlate How did the share price of computer manufacturers correlate with quarterly profits over the past 10 years?with quarterly profits over the past 10 years?
• Which orders should we fill to maximize revenues?Which orders should we fill to maximize revenues?
• Will a 10% discount increase sales volume sufficiently?Will a 10% discount increase sales volume sufficiently?
• Which of two new medications will result in the best Which of two new medications will result in the best outcome: higher recovery rate & shorter hospital stay?outcome: higher recovery rate & shorter hospital stay?
On-Line Analytical Processing (OLAP) is an element of On-Line Analytical Processing (OLAP) is an element of decision support systmes (DSS). decision support systmes (DSS).
EvolutionEvolution
60’s: Batch reports60’s: Batch reports• hard to find and analyze informationhard to find and analyze information
• inflexible and expensive, reprogram every new requestinflexible and expensive, reprogram every new request
70’s: Terminal-based DSS and EIS (executive information 70’s: Terminal-based DSS and EIS (executive information systems)systems)• still inflexible, not integrated with desktop toolsstill inflexible, not integrated with desktop tools
80’s: Desktop data access and analysis tools80’s: Desktop data access and analysis tools• query tools, spreadsheets, GUIsquery tools, spreadsheets, GUIs
• easier to use, but only access operational databaseseasier to use, but only access operational databases
90’s: Data warehousing with integrated OLAP engines and 90’s: Data warehousing with integrated OLAP engines and toolstools
OLTP vs. OLAPOLTP vs. OLAP
Clerk, IT ProfessionalClerk, IT Professional Day to day operationsDay to day operations
Application-oriented (E-R Application-oriented (E-R based)based)
Current, IsolatedCurrent, Isolated Detailed, Flat relationalDetailed, Flat relational Structured, RepetitiveStructured, Repetitive Short, Simple transactionShort, Simple transaction Read/writeRead/write Index/hash on prim. KeyIndex/hash on prim. Key TensTens ThousandsThousands 100 MB-GB100 MB-GB Trans. throughputTrans. throughput
Knowledge workerKnowledge worker Decision supportDecision support
Subject-oriented (Star, Subject-oriented (Star, snowflake)snowflake)
Historical, ConsolidatedHistorical, Consolidated Summarized, MultidimensionalSummarized, Multidimensional Ad hocAd hoc Complex queryComplex query Read MostlyRead Mostly Lots of ScansLots of Scans MillionsMillions HundredsHundreds 100GB-TB100GB-TB Query throughput, responseQuery throughput, response
User
Function
DB Design
Data
View
Usage
Unit of work
Access
Operations
# Records accessed
#Users
Db size
Metric
OLTPOLTP OLAPOLAP
Data WarehouseData Warehouse
A decision support database that is maintained separately A decision support database that is maintained separately from the organization’s operational databases.from the organization’s operational databases.
A data warehouse is a A data warehouse is a
• subject-oriented,subject-oriented,
• integrated,integrated,
• time-varying,time-varying,
• non-volatilenon-volatile
collection of data that is used primarily in organizational collection of data that is used primarily in organizational decision makingdecision making
Why Separate Data Why Separate Data Warehouse?Warehouse?
PerformancePerformance• Op dbs designed & tuned for known txs & workloads.Op dbs designed & tuned for known txs & workloads.
• Complex OLAP queries would degrade perf. For op txs.Complex OLAP queries would degrade perf. For op txs.
• Special data organization, access & implementation methods Special data organization, access & implementation methods needed for multidimensional views & queries. needed for multidimensional views & queries.
FunctionFunction• Missing data: Decision support requires historical data, which op dbs do not typically Missing data: Decision support requires historical data, which op dbs do not typically
maintain.maintain.• Data consolidation: Decision support requires consolidation (aggregation, summarization) of Data consolidation: Decision support requires consolidation (aggregation, summarization) of
data from many heterogeneous sources: op dbs, external sources. data from many heterogeneous sources: op dbs, external sources. • Data quality: Different sources typically use inconsistent data representations, codes, and Data quality: Different sources typically use inconsistent data representations, codes, and
formats which have to be reconciled.formats which have to be reconciled.
Data Warehousing MarketData Warehousing Market
Hardware: servers, storage, clientsHardware: servers, storage, clients Warehouse DBMsWarehouse DBMs ToolsTools Market growing from Market growing from
• $2B in 1995 to $8 B in 1998 (Meta Group)$2B in 1995 to $8 B in 1998 (Meta Group)
• 1.5B today to $6.9B in 1999 (Gartner Group)1.5B today to $6.9B in 1999 (Gartner Group)
Systems integration & ConsultingSystems integration & Consulting Already deployed in many industries: manufacturing, Already deployed in many industries: manufacturing,
retail, financial, insurance, transportation, telecom., retail, financial, insurance, transportation, telecom., utilities, healthcare.utilities, healthcare.
Data Warehousing Data Warehousing ArchitectureArchitecture
Monitoring & Monitoring & AdministrationAdministration
Metadata Metadata RepositoryRepository
ExtractExtractTransformTransformLoadLoadRefreshRefresh
Data Data MartsMarts
External External SourcesSources
OperationOperational dbsal dbs
ServServee
OLAP OLAP serversservers
AnalysisAnalysis
Query/ Query/ ReportinReportingg
Data Data MiningMining
Three-Tier ArchitectureThree-Tier Architecture
Warehouse database serverWarehouse database server• Almost always a relational DBMS; rarely flat filesAlmost always a relational DBMS; rarely flat files
OLAP serversOLAP servers• Relational OLAP (ROLAP): extended relational DBMS that maps Relational OLAP (ROLAP): extended relational DBMS that maps
operations on multidimensional data to standard relational operations.operations on multidimensional data to standard relational operations.
• Multidimensional OLAP (MOLAP): special purpose server that Multidimensional OLAP (MOLAP): special purpose server that directly implements multidimensional data and operations.directly implements multidimensional data and operations.
ClientsClients• Query and reporting tools.Query and reporting tools.
• Analysis toolsAnalysis tools
• Data mining tools (e.g., trend analysis, prediction) Data mining tools (e.g., trend analysis, prediction)
Data Warehouse vs. Data Data Warehouse vs. Data MartsMarts
Enterprise warehouse: collects all information about subjects Enterprise warehouse: collects all information about subjects (customers, products, sales, assets, personnel) that span the entire (customers, products, sales, assets, personnel) that span the entire organization.organization.• Requires extensive business modelingRequires extensive business modeling
• May take years to design and buildMay take years to design and build Data Marts: Departmental subsets that focus on selected subjects: Data Marts: Departmental subsets that focus on selected subjects:
Marketing data mart: customer, products, sales.Marketing data mart: customer, products, sales.• Faster roll out, but complex integration in the long run.Faster roll out, but complex integration in the long run.
Virtual warehouse: views over operational dbsVirtual warehouse: views over operational dbs• Materialize some summary views for efficient query processingMaterialize some summary views for efficient query processing
• Easier to buildEasier to build
• Requisite excess capcaity on operational db serversRequisite excess capcaity on operational db servers
Design & Operational Design & Operational ProcessProcess
Define architecture. Do capacity planning.Define architecture. Do capacity planning. Integrate db and OLAP servers, storage and client tools.Integrate db and OLAP servers, storage and client tools. Design warehouse schema, views.Design warehouse schema, views. Design physical warehouse organization: data placement, Design physical warehouse organization: data placement,
partitioning, access methods.partitioning, access methods. Connect sources: gateways, ODBC drivers, wrappers.Connect sources: gateways, ODBC drivers, wrappers. Design & implement scripts for data extract, load refresh.Design & implement scripts for data extract, load refresh. Define metadata and populate repository.Define metadata and populate repository. Design & implement end-user applications.Design & implement end-user applications. Roll out warehouse and applications.Roll out warehouse and applications. Monitor the warehouse.Monitor the warehouse.
OLAP for Decision SupportOLAP for Decision Support
Goal of OLAP is to support ad-hoc querying for the Goal of OLAP is to support ad-hoc querying for the business analystbusiness analyst
Business analysts are familiar with spreadsheetsBusiness analysts are familiar with spreadsheets Extend spreadsheet analysis model to work with Extend spreadsheet analysis model to work with
warehouse datawarehouse data• Large data setLarge data set
• Semantically enriched to understand business terms (e.g., time, Semantically enriched to understand business terms (e.g., time, geography)geography)
• Combined with reporting featuresCombined with reporting features
Multidimensional Multidimensional view of data is the foundation of OLAP view of data is the foundation of OLAP
Multidimensional Data Multidimensional Data ModelModel Database is a set ofDatabase is a set of facts facts (points) in a multidimensional space (points) in a multidimensional space A fact has a A fact has a measuremeasure dimension dimension
• quantity that is analyzed, e.g., sale, budgetquantity that is analyzed, e.g., sale, budget
A set of A set of dimensionsdimensions on which data is analyzed on which data is analyzed• e.g. , store, product, date associated with a sale amounte.g. , store, product, date associated with a sale amount
Dimensions form a sparsely populated coordinate systemDimensions form a sparsely populated coordinate system Each dimension has a set of Each dimension has a set of attributesattributes
• e.g., owner city and county of storee.g., owner city and county of store
Attributes of a dimension may be related by partial orderAttributes of a dimension may be related by partial order• HierarchyHierarchy: e.g., street > county >city: e.g., street > county >city
• LatticeLattice: e.g., date> month>year, date>week>year : e.g., date> month>year, date>week>year
Multidimensional DataMultidimensional Data
1010
4747
3030
1212
JuiceJuice
ColaCola
Milk Milk
CreaCreamm
NYNY
LALA
SFSF
Sales Sales Volume Volume as a as a functiofunction of n of time, time, city city and and productproduct3/1 3/2 3/3 3/1 3/2 3/3
3/43/4
DateDate
Operations in Operations in Multidimensional Data Multidimensional Data ModelModel
Aggregation (Aggregation (roll-uproll-up))• dimension reduction: e.g., total sales by citydimension reduction: e.g., total sales by city• summarization over aggregate hierarchy: e.g., total sales by summarization over aggregate hierarchy: e.g., total sales by
city and year -> total sales by region and by yearcity and year -> total sales by region and by year
Selection (Selection (sliceslice) defines a subcube) defines a subcube• e.g., sales where city = Palo Alto and date = 1/15/96e.g., sales where city = Palo Alto and date = 1/15/96
Navigation to detailed data (Navigation to detailed data (drill-downdrill-down))• e.g., (sales - expense) by city, top 3% of cities by average e.g., (sales - expense) by city, top 3% of cities by average
incomeincome
Visualization Operations (e.g., Pivot)Visualization Operations (e.g., Pivot)
A Visual Operation: Pivot A Visual Operation: Pivot (Rotate)(Rotate)
1010
4747
3030
1212
JuiceJuice
ColaCola
Milk Milk
CreaCreamm
NYNY
LALA
SFSF
3/1 3/2 3/3 3/1 3/2 3/3 3/43/4
DateDate
Month
Month
Reg
ion
Reg
ion
ProductProduct
Approaches to OLAP Approaches to OLAP ServersServers
Relational OLAP (ROLAP)Relational OLAP (ROLAP)• Relational and Specialized Relational DBMS to store and manage warehouse Relational and Specialized Relational DBMS to store and manage warehouse
datadata• OLAP middleware to support missing piecesOLAP middleware to support missing pieces
– Optimize for each DBMS backendOptimize for each DBMS backend
– Aggregation Navigation LogicAggregation Navigation Logic
– Additional tools and servicesAdditional tools and services
• Example: Microstrategy, MetaCube (Informix)Example: Microstrategy, MetaCube (Informix)
Multidimensional OLAP (MOLAP)Multidimensional OLAP (MOLAP)• Array-based storage structuresArray-based storage structures• Direct access to array data structuresDirect access to array data structures• Example: Essbase (Arbor), Accumate (Kenan)Example: Essbase (Arbor), Accumate (Kenan)
Domain-specific enrichmentDomain-specific enrichment
Relational DBMS as Relational DBMS as Warehouse ServerWarehouse Server
Schema designSchema design Specialized scan, indexing and join techniquesSpecialized scan, indexing and join techniques Handling of aggregate views (querying and Handling of aggregate views (querying and
materialization)materialization) Supporting query language extensions beyond Supporting query language extensions beyond
SQLSQL Complex query processing and optimizationComplex query processing and optimization Data partitioning and parallelismData partitioning and parallelism
Warehouse Database Warehouse Database SchemaSchema
ER design techniques not appropriateER design techniques not appropriate Design should reflect multidimensional Design should reflect multidimensional
viewview• Star SchemaStar Schema• Snowflake SchemaSnowflake Schema• Fact Constellation SchemaFact Constellation Schema
Example of a Star SchemaExample of a Star Schema
Order NoOrder No
Order DateOrder Date
Customer NoCustomer No
Customer Customer NameName
Customer Customer AddressAddress
CityCity
SalespersonIDSalespersonID
SalespersonNaSalespersonNameme
CityCity
QuotaQuota
OrderNOOrderNO
SalespersonIDSalespersonID
CustomerNOCustomerNO
ProdNoProdNo
DateKeyDateKey
CityNameCityName
QuantityQuantity
Total Price
ProductNOProductNO
ProdNameProdName
ProdDescrProdDescr
CategoryCategory
CategoryDescriptionCategoryDescription
UnitPriceUnitPrice
DateKeyDateKey
DateDate
CityNameCityName
StateState
CountryCountry
OrderOrder
CustomerCustomer
SalespersoSalespersonn
CityCity
DateDate
ProductProduct
Fact TableFact Table
Star SchemaStar Schema
A single fact table and a single table for each dimensionA single fact table and a single table for each dimension Every fact points to one tuple in each of the dimensions Every fact points to one tuple in each of the dimensions
and has additional attributesand has additional attributes Does not capture hierarchies directlyDoes not capture hierarchies directly Generated keys are used for performance and maintenance Generated keys are used for performance and maintenance
reasonsreasons Fact constellation: Multiple Fact tables that share many Fact constellation: Multiple Fact tables that share many
dimension tablesdimension tables• Example: Projected expense and the actual expense may share Example: Projected expense and the actual expense may share
dimensional tablesdimensional tables
Example of a Snowflake Example of a Snowflake SchemaSchema
Order NoOrder No
Order DateOrder Date
Customer NoCustomer No
Customer Customer NameName
Customer Customer AddressAddress
CityCity
SalespersonIDSalespersonID
SalespersonNaSalespersonNameme
CityCity
QuotaQuota
OrderNOOrderNO
SalespersonIDSalespersonID
CustomerNOCustomerNO
ProdNoProdNo
DateKeyDateKey
CityNameCityName
QuantityQuantity
Total Price
ProductNOProductNO
ProdNameProdName
ProdDescrProdDescr
CategoryCategory
CategoryCategory
UnitPriceUnitPrice
DateKeyDateKey
DateDate
MonthMonth
CityNameCityName
StateState
CountryCountry
OrderOrder
CustomerCustomer
SalespersoSalespersonn
CityCity
DateDate
ProductProduct
Fact TableFact Table
CategoryNaCategoryNameme
CategoryDeCategoryDescrscr
MontMonthh
YearYear YearYear
StateNameStateName
CountryCountry
CategoryCategory
StateState
MonthMonthYearYear
Snowflake SchemaSnowflake Schema
Represent dimensional hierarchy directly by Represent dimensional hierarchy directly by normalizing the dimension tablesnormalizing the dimension tables
Easy to maintainEasy to maintain Saves storage, but is alleged that it reduces Saves storage, but is alleged that it reduces
effectiveness of browsing (Kimball)effectiveness of browsing (Kimball)
Indexing TechniquesIndexing Techniques
Exploiting indexes to reduce scanning of Exploiting indexes to reduce scanning of data is of crucial importancedata is of crucial importance
Bitmap IndexesBitmap Indexes Join IndexesJoin Indexes Other IssuesOther Issues
• Text indexingText indexing• Parallelizing and sequencing of index builds and Parallelizing and sequencing of index builds and
incremental updatesincremental updates
BitMap IndexesBitMap Indexes
An alternative representation of RID-listAn alternative representation of RID-list Specially advantageous for low-cardinality domainsSpecially advantageous for low-cardinality domains Represent each row of a table by a bit and the table Represent each row of a table by a bit and the table
as a bit vectoras a bit vector There is a distinct bit vector Bv for each value v for There is a distinct bit vector Bv for each value v for
the domainthe domain Example: the attribute sex has values M and F. A Example: the attribute sex has values M and F. A
table of 100 million people needs 2 lists of 100 table of 100 million people needs 2 lists of 100 million bitsmillion bits
Bit Map IndexBit Map Index
Cust Region RatingC1 N HC2 S MC3 W LC4 W HC5 S LC6 W LC7 N H
Base Base TableTable
Row ID N S E W1 1 0 0 02 0 1 0 03 0 0 0 14 0 0 0 15 0 1 0 06 0 0 0 17 1 0 0 0
Row ID H M L1 1 0 02 0 1 03 0 0 04 0 0 05 0 1 06 0 0 07 1 0 0
Rating IndexRating Index
Region Region IndexIndex
Customers whereCustomers where Region = Region = WW
Rating = Rating = 11AndAnd
BitMap IndexesBitMap Indexes
Comparison, join and aggregation operations are Comparison, join and aggregation operations are reduced to bit arithmetic with dramatic reduced to bit arithmetic with dramatic improvement in processing timeimprovement in processing time
Significant reduction in space and I/O (30:1)Significant reduction in space and I/O (30:1) Adapted for higher cardinality domains as well.Adapted for higher cardinality domains as well. Compression (e.g., run-length encoding) exploitedCompression (e.g., run-length encoding) exploited Products that support bitmaps: Model 204, Products that support bitmaps: Model 204,
TargetIndex (Redbrick), IQ (Sybase), Oracle 7.3TargetIndex (Redbrick), IQ (Sybase), Oracle 7.3
Issues in Handling of Issues in Handling of Aggregate ViewsAggregate Views
Important component for ROLAP ServersImportant component for ROLAP Servers Representation in the context of star schemaRepresentation in the context of star schema
• Query ExpressionsQuery Expressions• Materialized ViewsMaterialized Views
Logic for Aggregation NavigationLogic for Aggregation Navigation• make optimum use of materialized aggregates to answer a make optimum use of materialized aggregates to answer a
queryquery
Choice of aggregate views to materializeChoice of aggregate views to materialize HP Intelligent Warehouse pioneered some of the HP Intelligent Warehouse pioneered some of the
techniquestechniques
SQL Extensions for Front SQL Extensions for Front End ToolsEnd Tools
Extended Family of Aggregate functionsExtended Family of Aggregate functions• rank (top 10) and N-Tile (“top 30%” of all products)rank (top 10) and N-Tile (“top 30%” of all products)
• Median, mode…..Median, mode…..
Reporting FeaturesReporting Features• running total, cumulative totalsrunning total, cumulative totals
Results of multiple group by:Results of multiple group by:• total sales by month and total sales by producttotal sales by month and total sales by product
SQL comes in the way of sequential processing and SQL comes in the way of sequential processing and columnar aggregationscolumnar aggregations• changes in total sale from 1994 to 1996, aggregated by brandchanges in total sale from 1994 to 1996, aggregated by brand
Query Processing in Query Processing in MOLAP ServersMOLAP Servers
The storage model is an n-dimensional arrayThe storage model is an n-dimensional array Front end multidimensional queries map to Front end multidimensional queries map to
server capabilities in a straightforward wayserver capabilities in a straightforward way Direct Addressing abilitiesDirect Addressing abilities A straightforward array representation has A straightforward array representation has
good indexing properties but very poor good indexing properties but very poor storage utilization when the data is sparsestorage utilization when the data is sparse
Query Processing in Query Processing in MOLAP ServersMOLAP Servers
2-dimensional dense arrays indexed by 2-dimensional dense arrays indexed by B-TreesB-Trees
Traditional Traditional indexing indexing structurestructure
2-dimensional 2-dimensional dense arraysdense arrays
Population & Refreshing Population & Refreshing the Warehousethe Warehouse
Data extractionData extraction Data cleaningData cleaning Data transformationData transformation
• Convert from legacy/host format to warehouse formatConvert from legacy/host format to warehouse format
Load Load • Sort, summarize, consolidate, compute views, check Sort, summarize, consolidate, compute views, check
integrity, build indexes, partitionintegrity, build indexes, partition
RefreshRefresh• Propogate updates from sources to the warehousePropogate updates from sources to the warehouse
Data CleaningData Cleaning
Why?Why?• Data warehouse contains data that is analyzed for business decisionsData warehouse contains data that is analyzed for business decisions
• More data and multiple sources could mean more errors in the data More data and multiple sources could mean more errors in the data and harder to trace such errorsand harder to trace such errors
• Results in incorrect analysisResults in incorrect analysis
Detecting data anomalies and rectifying them early has huge Detecting data anomalies and rectifying them early has huge payoffspayoffs
Important to identify tools that work together wellImportant to identify tools that work together well Long Term SolutionLong Term Solution
• Change business practices and data entry toolsChange business practices and data entry tools
• Repository for meta-dataRepository for meta-data
Data Cleaning TechniquesData Cleaning Techniques
Transformation RulesTransformation Rules• Example: translate “gender” to “sex”Example: translate “gender” to “sex”
• Products: Warehouse Manger (Prism), Extract (ETI)Products: Warehouse Manger (Prism), Extract (ETI)
Uses domain-specific knowledge to do scrubbingUses domain-specific knowledge to do scrubbing Parsing and fuzzy matchingParsing and fuzzy matching
• Multiple data sources (can designate a preferred source)Multiple data sources (can designate a preferred source)
• Products: Integrity (Vality), TrillumProducts: Integrity (Vality), Trillum
Discover facts that flag unusual patterns (auditing)Discover facts that flag unusual patterns (auditing)• Some dealer has never received a single complaintSome dealer has never received a single complaint
• Products: QDB, SBStar, WizRuleProducts: QDB, SBStar, WizRule
LoadLoad
Issues:Issues:• huge volumes of data to be loadedhuge volumes of data to be loaded• small time window (usually at night) when the warehouse can be taken off-small time window (usually at night) when the warehouse can be taken off-
lineline• When to build indexes and summary tablesWhen to build indexes and summary tables• allow system administrator to monitor status, cancel suspend, resume load, allow system administrator to monitor status, cancel suspend, resume load,
or change load rateor change load rate• restart after failure with no loss of data integrityrestart after failure with no loss of data integrity
Techniques:Techniques:• batch load utility: sort input records on clustering key and use sequential batch load utility: sort input records on clustering key and use sequential
I/O; build indexes and derived tablesI/O; build indexes and derived tables• sequential loads still too long (~100 days for TB)sequential loads still too long (~100 days for TB)• use parallelism and incremental techniquesuse parallelism and incremental techniques
RefreshRefresh
Issues:Issues:• when to refreshwhen to refresh
– on every update: too expensive, only necessary if OLAP on every update: too expensive, only necessary if OLAP queries need current data queries need current data (e.g., up-the-minute stock quotes)(e.g., up-the-minute stock quotes)
– periodically (e.g., every 24 hours, every week) or after periodically (e.g., every 24 hours, every week) or after “significant” events“significant” events
– refresh policy set by administrator based on user needs and refresh policy set by administrator based on user needs and traffictraffic
– possibly different policies for different sourcespossibly different policies for different sources
• how to refreshhow to refresh
Refresh TechniquesRefresh Techniques
Full extract from base tablesFull extract from base tables• read entire source table or database: expensiveread entire source table or database: expensive• may be the only choice for legacy databases or files.may be the only choice for legacy databases or files.
Incremental techniques (related to work on active dbs)Incremental techniques (related to work on active dbs)• detect & propagate changes on base tables: replication servers (e.g., detect & propagate changes on base tables: replication servers (e.g.,
Sybase, Oracle, IBM Data Propagator)Sybase, Oracle, IBM Data Propagator)– snapshots & triggers (Oracle)snapshots & triggers (Oracle)– transaction shipping (Sybase)transaction shipping (Sybase)
• Logical correctnessLogical correctness– computing changes to star tablescomputing changes to star tables– computing changes to derived and summary tablescomputing changes to derived and summary tables– optimization: only significant changesoptimization: only significant changes
• transactional correctness: incremental loadtransactional correctness: incremental load
Metadata RepositoryMetadata Repository
Administrative metadataAdministrative metadata• source databases and their contentssource databases and their contents• gateway descriptionsgateway descriptions• warehouse schema, view & derived data definitionswarehouse schema, view & derived data definitions• dimensions, hierarchiesdimensions, hierarchies• pre-defined queries and reportspre-defined queries and reports• data mart locations and contentsdata mart locations and contents• data partitionsdata partitions• data extraction, cleansing, transformation rules, defaultsdata extraction, cleansing, transformation rules, defaults• data refresh and purging rulesdata refresh and purging rules• user profiles, user groupsuser profiles, user groups• security: user authorization, access controlsecurity: user authorization, access control
Metdata Repository .. 2Metdata Repository .. 2
Business dataBusiness data• business terms and definitionsbusiness terms and definitions• ownership of dataownership of data• charging policiescharging policies
operational metadataoperational metadata• data lineage: history of migrated data and sequence of data lineage: history of migrated data and sequence of
transformations appliedtransformations applied• currency of data: active, archived, purgedcurrency of data: active, archived, purged• monitoring information: warehouse usage statistics, error monitoring information: warehouse usage statistics, error
reports, audit trails.reports, audit trails.
Warehouse Design ToolsWarehouse Design Tools
Creating and managing a warehouse is hard.Creating and managing a warehouse is hard. Development toolsDevelopment tools
• defining & editing metadata repository contents (schemas, scripts, defining & editing metadata repository contents (schemas, scripts, rules).rules).
• Queries and reportsQueries and reports
• Shipping metadata to and from RDBMS catalogue (e.g., Prism Shipping metadata to and from RDBMS catalogue (e.g., Prism Warehouse Manager).Warehouse Manager).
Planning & analysis toolsPlanning & analysis tools• impact of schema changesimpact of schema changes
• capacity planningcapacity planning
• refresh performance: changing refresh rates or time windowsrefresh performance: changing refresh rates or time windows
Warehouse Management Warehouse Management ToolsTools
Monitoring and reporting tools (e.g., HP Intelligent Warehouse Monitoring and reporting tools (e.g., HP Intelligent Warehouse Advisor)Advisor)• which partitions, summary tables, columns are used which partitions, summary tables, columns are used • query execution timesquery execution times• for summary tables, types & frequencies of roll downsfor summary tables, types & frequencies of roll downs• warehouse usage over time (detect peak periods)warehouse usage over time (detect peak periods)
Systems and network management tools (e.g., HP OpenView, Systems and network management tools (e.g., HP OpenView, IBM NetView, Tivoli): traffic, utilizationIBM NetView, Tivoli): traffic, utilization
Exception reporting/alerting tools 9e.g., DB2 Event Alerters, Exception reporting/alerting tools 9e.g., DB2 Event Alerters, Information Advantage InfoAgents & InfoAlert)Information Advantage InfoAgents & InfoAlert)• runaway queriesrunaway queries
Analysis/Visualization tools: OLAP on metadataAnalysis/Visualization tools: OLAP on metadata
State of Commercial State of Commercial PracticePractice
Products and Vendors [Datamation, May 15, 1996; R.C. Barquin, H.A. Edelstein: Planning and Designin Products and Vendors [Datamation, May 15, 1996; R.C. Barquin, H.A. Edelstein: Planning and Designin gthe Data Warehous. Prentice Hall. 1997]gthe Data Warehous. Prentice Hall. 1997]
• Connectivity to sourcesConnectivity to sources– ApertusApertus CA-Ingres GatewayCA-Ingres Gateway– Information Builders EDA/SQLInformation Builders EDA/SQL IBM Data JionerIBM Data Jioner– Informix Enterprise GatewayInformix Enterprise Gateway Microsoft ODBCMicrosoft ODBC– Oracle Open ConnectOracle Open Connect Platinum InfohubPlatinum Infohub– SAS ConnectSAS Connect Software AG EntireSoftware AG Entire– Sybase Enterprise ConnectSybase Enterprise Connect Trinzic InfoHubTrinzic InfoHub
• Data extract, clean, transfomr, refreshData extract, clean, transfomr, refresh– CA-Ingres ReplicatorCA-Ingres Replicator Carleton PassportCarleton Passport– Evolutionary Tech Inc. ETI-ExtractEvolutionary Tech Inc. ETI-Extract Harte-Hanks TrilliumHarte-Hanks Trillium– IBM Data Joiner, Data PropagatorIBM Data Joiner, Data Propagator Oracle 7Oracle 7– Platinum InfoRefiner, InfroPumpPlatinum InfoRefiner, InfroPump Praxis OmniReplicatorPraxis OmniReplicator– Prism Warehouse ManagerPrism Warehouse Manager Redbrick TMURedbrick TMU– SAS AccessSAS Access Software AG SouorcepointSoftware AG Souorcepoint– Sybase Replication ServerSybase Replication Server Trinzic InfoPumpTrinzic InfoPump
State of Commercial State of Commercial Practice..2Practice..2
Multidimensional Database EnginesMultidimensional Database EnginesArbor EssbaseArbor Essbase Comshare Commander OLAPComshare Commander OLAP
Oracle IRI ExpressOracle IRI Express SAS SystemSAS System Warehouse Data ServersWarehouse Data Servers
• CA-IngresCA-Ingres IBM DB2IBM DB2
• Information Builders FocusInformation Builders Focus InformixInformix
• OracleOracle Praxiz Model 204Praxiz Model 204
• RedbrickRedbrick Software AG ADABASSoftware AG ADABAS
• Sybase MPPSybase MPP TandemTandem
• TerdataTerdata ROLAP ServersROLAP Servers
• HP Intelligent WarehouseHP Intelligent Warehouse Information Advantage AsxysInformation Advantage Asxys
• Informix MetacubeInformix Metacube MicroStrategy DSS ServerMicroStrategy DSS Server
State of Commercial State of Commercial Practice..3Practice..3
Query/Reporting EnvironmentsQuery/Reporting EnvironmentsBrio/QueryBrio/Query Business ObjectsBusiness Objects
Cognos ImpromptuCognos Impromptu CA Visual ExpressCA Visual Express
IBM DataGuideIBM DataGuide Information Builders Focus SixInformation Builders Focus Six
Informix ViewPointInformix ViewPoint Platinum Forest & TreesPlatinum Forest & Trees
SAS AccessSAS Access Software AG EsperantSoftware AG Esperant Multidimensional AnalysisMultidimensional Analysis
Andydne PabloAndydne Pablo Arbor Essbase Analysis ServerArbor Essbase Analysis Server
Business ObjectsBusiness Objects Cognos PowerPlayCognos PowerPlay
Dimensional Insight Cross TargetDimensional Insight Cross Target Holistic Systems HOLOSHolistic Systems HOLOS
Information Advantage Decision SuiteInformation Advantage Decision Suite IQ Software IQ/VisionIQ Software IQ/Vision
Kenan System AcumateKenan System Acumate Lotus 123Lotus 123
Microsoft ExcelMicrosoft Excel Microstrategy DSSMicrostrategy DSS
Pilot LightshipPilot Lightship Platinum Forest & TreesPlatinum Forest & Trees
Prodea BeaconProdea Beacon SAS OLAP ++SAS OLAP ++
Stanford Technology Group MetacubeStanford Technology Group Metacube
State of Commercial State of Commercial Practice..4Practice..4
Metadata ManagementMetadata Management• HP Intelligent WarehouseHP Intelligent Warehouse IBM Data GuideIBM Data Guide
• Platinum Repository Platinum Repository Prism Directory ManagerPrism Directory Manager System ManagementSystem Management
• CA UnicenterCA Unicenter HP OpenViewHP OpenView
• IBM DataHub, NetViewIBM DataHub, NetView Information Builder Site AnalyzerInformation Builder Site Analyzer
• Prism Warehouse ManagerPrism Warehouse Manager SAS CPESAS CPE
• TivoliTivoli Software AG Source PointSoftware AG Source Point
• Redbrick Enterprise Control and CoordinationRedbrick Enterprise Control and Coordination Process Management Process Management
• At& T TOPENDAt& T TOPEND HP Intelligent WarehouseHP Intelligent Warehouse
• IBM FlowMarkIBM FlowMark Platinum RepositoryPlatinum Repository
• Prism Warehouse ManagerPrism Warehouse Manager Software AG Source PointSoftware AG Source Point Systems integration and consultingSystems integration and consulting
Research IssuesResearch Issues
Data cleaningData cleaning• focus on data inconsistencies, not schema differencesfocus on data inconsistencies, not schema differences
• data mining techniquesdata mining techniques Physical DesignPhysical Design
• design of summary tables, partitions, indexesdesign of summary tables, partitions, indexes
• tradeoffs in use of different indexestradeoffs in use of different indexes Query processingQuery processing
• selecting appropriate summary tablesselecting appropriate summary tables
• dynamic optimization with feedbackdynamic optimization with feedback
• acid test for query optimization: cost estimation, use of transformations, acid test for query optimization: cost estimation, use of transformations, search strategiessearch strategies
• partitioning query processing between OLAP server and backend server.partitioning query processing between OLAP server and backend server.
Research Issues .. 2Research Issues .. 2
Warehouse ManagementWarehouse Management• detecting runaway queriesdetecting runaway queries
• resource managementresource management
• incremental refresh techniquesincremental refresh techniques
• computing summary tables during loadcomputing summary tables during load
• failure recovery during load and refreshfailure recovery during load and refresh
• process management: scheduling queries, load and process management: scheduling queries, load and refreshrefresh
• use of workflow technology for process managementuse of workflow technology for process management