decision support, data warehousing, and olap by prof. sham navathe georgia institute of technology...
Post on 19-Dec-2015
215 Views
Preview:
TRANSCRIPT
Decision Support, Data Decision Support, Data Warehousing, and OLAPWarehousing, and OLAP
By Prof. Sham NavatheBy Prof. Sham NavatheGeorgia Institute of TechnologyGeorgia Institute of Technology
(Courtesy : Prof. Anindya Datta)(Courtesy : Prof. Anindya Datta)
Extensions by Svetlana MansmannExtensions by Svetlana Mansmann
University of KonstanzUniversity of Konstanz
OutlineOutline Terminology: OLAP vs. OLTPTerminology: OLAP vs. OLTP Data Warehousing ArchitectureData Warehousing Architecture TechnologiesTechnologies ProductsProducts ReferencesReferences
Decision Support and OLAPDecision Support and OLAP Information technology to help the knowledge worker Information technology to help the knowledge worker
(executive, manager, analyst) make faster and better decisions(executive, manager, analyst) make faster and better decisions• What were the sales volumes by region and product category What were the sales volumes by region and product category
for the last year?for the last year?
• How did the share price of computer manufacturers correlate How did the share price of computer manufacturers correlate with quarterly profits over the past 10 years?with quarterly profits over the past 10 years?
• Which orders should we fill to maximize revenues?Which orders should we fill to maximize revenues?
• Will a 10% discount increase sales volume sufficiently?Will a 10% discount increase sales volume sufficiently?
• Which of two new medications will result in the best outcome: Which of two new medications will result in the best outcome: higher recovery rate & shorter hospital stay?higher recovery rate & shorter hospital stay?
On-Line Analytical Processing (OLAP) is an element of On-Line Analytical Processing (OLAP) is an element of decision support systems (DSS) decision support systems (DSS)
Business IntelligenceBusiness Intelligence
EvolutionEvolution 60’s: Batch reports60’s: Batch reports
hard to find and analyze informationhard to find and analyze information inflexible and expensive, reprogram every new requestinflexible and expensive, reprogram every new request
70’s: Terminal-based DSS and EIS (executive information 70’s: Terminal-based DSS and EIS (executive information systems)systems) still inflexible, not integrated with desktop toolsstill inflexible, not integrated with desktop tools
80’s: Desktop data access and analysis tools80’s: Desktop data access and analysis tools query tools, spreadsheets, GUIsquery tools, spreadsheets, GUIs easier to use, but only access operational databaseseasier to use, but only access operational databases
90’s: Data warehousing with integrated OLAP engines and 90’s: Data warehousing with integrated OLAP engines and toolstools
2000’s: Personalization engines and e-commerce2000’s: Personalization engines and e-commerce
OLTP vs. OLAPOLTP vs. OLAP
Clerk, IT ProfessionalClerk, IT Professional Day to day operationsDay to day operations
Application-oriented (E-R Application-oriented (E-R based)based)
Current, IsolatedCurrent, Isolated Detailed, Flat relationalDetailed, Flat relational Structured, RepetitiveStructured, Repetitive Short, Simple transactionShort, Simple transaction Read/writeRead/write Index/hash on prim. KeyIndex/hash on prim. Key TensTens ThousandsThousands 100 MB-GB100 MB-GB Trans. throughputTrans. throughput
Knowledge workerKnowledge worker Decision supportDecision support
Subject-oriented (Star, Subject-oriented (Star, snowflake)snowflake)
Historical, ConsolidatedHistorical, Consolidated Summarized, MultidimensionalSummarized, Multidimensional Ad hocAd hoc Complex queryComplex query Read MostlyRead Mostly Lots of ScansLots of Scans MillionsMillions HundredsHundreds 100 GB-TB100 GB-TB Query throughput, responseQuery throughput, response
User
Function
DB Design
Data
View
Usage
Unit of work
Access
Operations
# Records accessed
#Users
Db size
Metric
OLTPOLTP OLAPOLAP
Data WarehouseData Warehouse A decision support database that is maintained A decision support database that is maintained
separately from the organization’s operational separately from the organization’s operational databases.databases.
A data warehouse is a A data warehouse is a subject-oriented,subject-oriented, integrated,integrated, time-varying,time-varying, non-volatilenon-volatile
A collection of data that is used primarily in A collection of data that is used primarily in organizational decision makingorganizational decision making
Why Separate Data Why Separate Data Warehouse?Warehouse?
PerformancePerformance Operational databases designed & tuned for known Operational databases designed & tuned for known
taxes & workloadstaxes & workloads Complex OLAP queries would degrade performance, Complex OLAP queries would degrade performance,
taxing operationstaxing operations Special data organization, access & implementation Special data organization, access & implementation
methods needed for multidimensional views & queriesmethods needed for multidimensional views & queries
Why Separate Data Why Separate Data Warehouse?Warehouse? FunctionFunction
Missing data: Decision support requires historical data, Missing data: Decision support requires historical data, which operational databases do not typically maintainwhich operational databases do not typically maintain
Data consolidation: Decision support requires Data consolidation: Decision support requires consolidation (aggregation, summarization) of data consolidation (aggregation, summarization) of data from many heterogeneous sources: operational from many heterogeneous sources: operational databases, external sources. databases, external sources.
Data quality: Different sources typically use Data quality: Different sources typically use inconsistent data representations, codes, and formats inconsistent data representations, codes, and formats which have to be reconciled.which have to be reconciled.
Data Warehousing / OLAP Data Warehousing / OLAP MarketMarket
Data Warehousing / OLAP Data Warehousing / OLAP MarketMarket
Data Warehousing MarketData Warehousing Market
Data Warehousing Data Warehousing ArchitectureArchitecture
Three-Tier ArchitectureThree-Tier Architecture Warehouse database serverWarehouse database server
Almost always a relational DBMS; rarely flat filesAlmost always a relational DBMS; rarely flat files OLAP serversOLAP servers
Relational OLAP (ROLAP): extended relational DBMS that Relational OLAP (ROLAP): extended relational DBMS that maps operations on multidimensional data to standard relational maps operations on multidimensional data to standard relational operations.operations.
Multidimensional OLAP (MOLAP): special purpose server that Multidimensional OLAP (MOLAP): special purpose server that directly implements multidimensional data and operations.directly implements multidimensional data and operations.
ClientsClients Query and reporting toolsQuery and reporting tools Analysis toolsAnalysis tools Data mining tools (e.g., trend analysis, prediction) Data mining tools (e.g., trend analysis, prediction)
Data Warehouse vs. Data Data Warehouse vs. Data MartsMarts Enterprise warehouse: collects all information about subjects Enterprise warehouse: collects all information about subjects
(customers, products, sales, assets, personnel) that span the entire (customers, products, sales, assets, personnel) that span the entire organization.organization. Requires extensive business modelingRequires extensive business modeling May take years to design and buildMay take years to design and build
Data Marts: Departmental subsets that focus on selected subjects: Data Marts: Departmental subsets that focus on selected subjects: Marketing data mart: customer, products, sales.Marketing data mart: customer, products, sales. Faster roll out, but complex integration in the long runFaster roll out, but complex integration in the long run
Virtual warehouse: views over operational DBsVirtual warehouse: views over operational DBs Materialize some summary views for efficient query processingMaterialize some summary views for efficient query processing Easier to buildEasier to build Requisite excess capacity on operational DB serversRequisite excess capacity on operational DB servers
Design & Operational Design & Operational ProcessProcess Define architecture. Do capacity planning.Define architecture. Do capacity planning. Integrate DB and OLAP servers, storage and client tools.Integrate DB and OLAP servers, storage and client tools. Design warehouse schema, views.Design warehouse schema, views. Design physical warehouse organization: data placement, partitioning, Design physical warehouse organization: data placement, partitioning,
access methods.access methods. Connect sources: gateways, ODBC drivers, wrappers.Connect sources: gateways, ODBC drivers, wrappers. Design & implement scripts for data extract, load refresh.Design & implement scripts for data extract, load refresh. Define metadata and populate repository.Define metadata and populate repository. Design & implement end-user Design & implement end-user applicationsapplications.. Roll out warehouse and applications.Roll out warehouse and applications. Monitor the warehouse.Monitor the warehouse.
OLAP for Decision SupportOLAP for Decision Support Goal of OLAP is to support ad-hoc querying for the Goal of OLAP is to support ad-hoc querying for the
business analystbusiness analyst Business analysts are familiar with spreadsheetsBusiness analysts are familiar with spreadsheets Extend spreadsheet analysis model to work with Extend spreadsheet analysis model to work with
warehouse datawarehouse data Large data setLarge data set Semantically enriched to understand business terms (e.g., time, Semantically enriched to understand business terms (e.g., time,
geography)geography) Combined with reporting featuresCombined with reporting features
Multidimensional Multidimensional view of data is the foundation of OLAP view of data is the foundation of OLAP
OLAP for Decision SupportOLAP for Decision Support Pivot table - a multidimensional spreadsheetPivot table - a multidimensional spreadsheet
Multidimensional Data ModelMultidimensional Data Model Database is a set ofDatabase is a set of facts facts (points) in a multidimensional space (points) in a multidimensional space A fact has a A fact has a measuremeasure dimension dimension
quantity that is analyzed, e.g., sale, budgetquantity that is analyzed, e.g., sale, budget A set of A set of dimensionsdimensions on which data is analyzed on which data is analyzed
e.g. , store, product, date associated with a sale amounte.g. , store, product, date associated with a sale amount Dimensions form a sparsely populated coordinate systemDimensions form a sparsely populated coordinate system Each dimension has a set of Each dimension has a set of attributesattributes
e.g., owner city and county of storee.g., owner city and county of store Attributes of a dimension may be related by partial orderAttributes of a dimension may be related by partial order
HierarchyHierarchy: e.g., street > county >city: e.g., street > county >city LatticeLattice: e.g., date> month>year, date>week>year : e.g., date> month>year, date>week>year
Multidimensional DataMultidimensional Data
1010
4747
3030
1212
JuicJuicee
ColaCola
Milk Milk
CreaCreamm
NYNY
LALA
SFSF
Sales volume Sales volume as a function as a function of date, city of date, city and productand product3/1 3/2 3/1 3/2
3/3 3/43/3 3/4
DateDate
Produc
Produc
tt
City
City
Sample Data CubeSample Data Cube
∑
Degree Diploma
B.Sc.M.Sc.
Term1st 2nd 3rd 4th
Country
Germany
Switzerland
U.S.A.
German students in the 4th term
pursuing a diploma
Country
Germany
Switzerland
U.S.A.
∑
∑
∑
∑ ∑ ∑
Operations in Operations in Multidimensional Data ModelMultidimensional Data Model
Aggregation (Aggregation (roll-uproll-up)) dimension reduction: e.g., total sales by citydimension reduction: e.g., total sales by city summarization over aggregate hierarchy: e.g., total sales by city summarization over aggregate hierarchy: e.g., total sales by city
and year -> total sales by region and by yearand year -> total sales by region and by year
Navigation to detailed data (Navigation to detailed data (drill-downdrill-down)) e.g., (sales - expense) by city, top 3% of cities by average income e.g., (sales - expense) by city, top 3% of cities by average income
Selection (Selection (sliceslice) defines a subcube) defines a subcube e.g., sales where city = Palo Alto and date = 1/15/96e.g., sales where city = Palo Alto and date = 1/15/96
Visualization Operations (e.g., Pivot)Visualization Operations (e.g., Pivot)
A Visual Operation: Pivot A Visual Operation: Pivot (Rotate)(Rotate)
1010
4747
3030
1212
JuicJuicee
ColaCola
Milk Milk
CreaCreamm
NYNY
LALA
SFSF
3/1 3/2 3/1 3/2 3/3 3/43/3 3/4
Month
Month
Region
Region
ProductProduct
Approaches to OLAP ServersApproaches to OLAP Servers Relational OLAP (ROLAP)Relational OLAP (ROLAP)
Relational and Specialized Relational DBMS to store and manage Relational and Specialized Relational DBMS to store and manage warehouse datawarehouse data
OLAP middleware to support missing piecesOLAP middleware to support missing pieces Optimize for each DBMS backendOptimize for each DBMS backend Aggregation Navigation LogicAggregation Navigation Logic Additional tools and servicesAdditional tools and services
Multidimensional OLAP (MOLAP)Multidimensional OLAP (MOLAP) Array-based storage structuresArray-based storage structures Direct access to array data structuresDirect access to array data structures
Domain-specific enrichmentDomain-specific enrichment
Relational DBMS as Relational DBMS as Warehouse ServerWarehouse Server Schema designSchema design Specialized scan, indexing and join techniquesSpecialized scan, indexing and join techniques Handling of aggregate views (querying and Handling of aggregate views (querying and
materialization)materialization) Supporting query language extensions beyond Supporting query language extensions beyond
SQLSQL Complex query processing and optimizationComplex query processing and optimization Data partitioning and parallelismData partitioning and parallelism
Warehouse Database SchemaWarehouse Database Schema ER design techniques not appropriateER design techniques not appropriate Design should reflect multidimensional Design should reflect multidimensional
viewview Star SchemaStar Schema Snowflake SchemaSnowflake Schema Fact Constellation SchemaFact Constellation Schema
Example of a Star SchemaExample of a Star Schema
Order NoOrder No
Order DateOrder Date
Customer NoCustomer No
Customer Customer NameName
Customer Customer AddressAddress
CityCity
SalespersonIDSalespersonID
SalespersonNaSalespersonNameme
CityCity
QuotaQuota
OrderNOOrderNO
SalespersonIDSalespersonID
CustomerNOCustomerNO
ProdNoProdNo
DateKeyDateKey
CityNameCityName
QuantityQuantity
Total Price
ProductNOProductNO
ProdNameProdName
ProdDescrProdDescr
CategoryCategory
CategoryDescriptiCategoryDescriptionon
UnitPriceUnitPrice
DateKeyDateKey
DateDate
CityNameCityName
StateState
CountryCountry
OrderOrder
CustomerCustomer
SalespersSalespersonon
CityCity
DateDate
ProductProduct
Fact Fact TableTable
Star SchemaStar Schema A single fact table and a single table for each dimensionA single fact table and a single table for each dimension Every fact points to one tuple in each of the dimensions Every fact points to one tuple in each of the dimensions
and has additional attributesand has additional attributes Does not capture hierarchies directlyDoes not capture hierarchies directly Generated keys are used for performance and maintenance Generated keys are used for performance and maintenance
reasonsreasons Fact constellation: Multiple Fact tables that share many Fact constellation: Multiple Fact tables that share many
dimension tablesdimension tables Example: Projected expense and the actual expense may share Example: Projected expense and the actual expense may share
dimensional tablesdimensional tables
Example of a Snowflake Example of a Snowflake SchemaSchema
Order NoOrder No
Order DateOrder Date
Customer NoCustomer No
Customer Customer NameName
Customer Customer AddressAddress
CityCity
SalespersonIDSalespersonID
SalespersonNaSalespersonNameme
CityCity
QuotaQuota
OrderNOOrderNO
SalespersonIDSalespersonID
CustomerNOCustomerNO
ProdNoProdNo
DateKeyDateKey
CityNameCityName
QuantityQuantity
Total Price
ProductNOProductNO
ProdNameProdName
ProdDescrProdDescr
CategoryCategory
CategoryCategory
UnitPriceUnitPrice
DateKeyDateKey
DateDate
MonthMonth
CityNameCityName
StateState
CountryCountry
OrderOrder
CustomerCustomer
SalespersSalespersonon
CityCity
DateDate
ProductProduct
Fact Fact TableTable
CategoryNaCategoryNameme
CategoryDeCategoryDescrscr
MonthMonth
YearYear YearYear
StateNameStateName
CountryCountry
CategoryCategory
StateState
MonthMonthYearYear
Snowflake SchemaSnowflake Schema Represent dimensional hierarchy directly by Represent dimensional hierarchy directly by
normalizing the dimension tablesnormalizing the dimension tables Easy to maintainEasy to maintain Saves storage, but is alleged that it reduces Saves storage, but is alleged that it reduces
effectiveness of browsing (Kimball)effectiveness of browsing (Kimball) Galaxy schema: multiple fact tables with shared Galaxy schema: multiple fact tables with shared
dimension categoriesdimension categories
Population & Refreshing the Population & Refreshing the WarehouseWarehouse Data extractionData extraction Data cleaningData cleaning Data transformationData transformation
Convert from legacy/host format to warehouse formatConvert from legacy/host format to warehouse format Load Load
Sort, summarize, consolidate, compute views, check Sort, summarize, consolidate, compute views, check integrity, build indexes, partitionintegrity, build indexes, partition
RefreshRefresh Propagate updates from sources to the warehousePropagate updates from sources to the warehouse
Metadata RepositoryMetadata Repository Administrative metadataAdministrative metadata
source databases and their contentssource databases and their contents gateway descriptionsgateway descriptions warehouse schema, view & derived data definitionswarehouse schema, view & derived data definitions dimensions, hierarchiesdimensions, hierarchies pre-defined queries and reportspre-defined queries and reports data mart locations and contentsdata mart locations and contents data partitionsdata partitions data extraction, cleansing, transformation rules, defaultsdata extraction, cleansing, transformation rules, defaults data refresh and purging rulesdata refresh and purging rules user profiles, user groupsuser profiles, user groups security: user authorization, access controlsecurity: user authorization, access control
Metadata Repository .. 2Metadata Repository .. 2
Business dataBusiness data business terms and definitionsbusiness terms and definitions ownership of dataownership of data charging policiescharging policies
operational metadataoperational metadata data lineage: history of migrated data and sequence of data lineage: history of migrated data and sequence of
transformations appliedtransformations applied currency of data: active, archived, purgedcurrency of data: active, archived, purged monitoring information: warehouse usage statistics, error monitoring information: warehouse usage statistics, error
reports, audit trails.reports, audit trails.
Warehouse Design ToolsWarehouse Design Tools Creating and managing a warehouse is hardCreating and managing a warehouse is hard Development toolsDevelopment tools
defining & editing metadata repository contents (schemas, scripts, defining & editing metadata repository contents (schemas, scripts, rules)rules)
Queries and reportsQueries and reports Shipping metadata to and from RDBMS catalogue (e.g., Prism Shipping metadata to and from RDBMS catalogue (e.g., Prism
Warehouse Manager)Warehouse Manager) Planning & analysis toolsPlanning & analysis tools
impact of schema changesimpact of schema changes capacity planningcapacity planning refresh performance: changing refresh rates or time windowsrefresh performance: changing refresh rates or time windows
Warehouse Management ToolsWarehouse Management Tools Monitoring and reporting tools (e.g., HP Intelligent Monitoring and reporting tools (e.g., HP Intelligent
Warehouse Advisor)Warehouse Advisor) which partitions, summary tables, columns are used which partitions, summary tables, columns are used query execution timesquery execution times for summary tables, types & frequencies of roll downsfor summary tables, types & frequencies of roll downs warehouse usage over time (detect peak periods)warehouse usage over time (detect peak periods)
Systems and network management tools (e.g., HP Systems and network management tools (e.g., HP OpenView, IBM NetView, Tivoli): traffic, utilizationOpenView, IBM NetView, Tivoli): traffic, utilization
Exception reporting/alerting tools 9e.g., DB2 Event Exception reporting/alerting tools 9e.g., DB2 Event Alerters, Information Advantage InfoAgents & InfoAlert)Alerters, Information Advantage InfoAgents & InfoAlert) runaway queriesrunaway queries
Analysis/Visualization tools: OLAP on metadataAnalysis/Visualization tools: OLAP on metadata
OLAP ToolsOLAP Tools Existing Tools: Seagate, Brio, CognosExisting Tools: Seagate, Brio, Cognos
Functionality:Functionality:- Choice of tablesChoice of tables- Allowing user to specify interrelation relationshipsAllowing user to specify interrelation relationships- Use of filtering conditionsUse of filtering conditions- Construction of “cubes on the fly”Construction of “cubes on the fly”
Main Problems:Main Problems:Cost per license, poor semantics of aggregations across tables, Cost per license, poor semantics of aggregations across tables,
performance for multiple dimension cubesperformance for multiple dimension cubes Visual OLAP Tool Tableau:Visual OLAP Tool Tableau:
http://www.tableausoftware.com/ptour.htmhttp://www.tableausoftware.com/ptour.htm
top related