data ware house
TRANSCRIPT
Slide 1
2.Data warehousing
2/12/20132.Data warehouse/D.S.Jagli1
2/12/20131
Topics to be coveredGetting data into the Data warehouse.ExtractionTransformationCleansingLoadingSummarization Meta dataData MartsData warehousing & ERPData warehousing & KMData warehousing & CRM
2/12/20132.Data warehouse/D.S.Jagli2
2/12/20132.Data warehouse/D.S.Jagli3
Which are our lowest/highest margin customers ?
Who are my customers and what products are they buying?
Which customers are most likely to go to the competition ?
What impact will new products/services have on revenue and margins?
What product prom--otions have the biggest impact on revenue?
What is the most effective distribution channel?
A Sales Manager wants to know.
2/12/20133
2/12/20132.Data warehouse/D.S.Jagli4
Data, Data everywhereyet ...I cant find the data I needdata is scattered over the network.many versions, slight differences.
I cant get the data I needneed an expert to get the data.I cant understand the data I foundavailable data poorly documented.
I cant use the data I foundresults are unexpected.data needs to be transformed from one format to other.
2/12/20134
2/12/20132.Data warehouse/D.S.Jagli5
What is a Data Warehousing? A process of transforming data into information and making it available to users in a timely enough manner to make a difference
[Forrester Research, April 1996]
DataInformation
2/12/20135
What is a Data warehouseA data warehouse is a repository (collection of resources that can be accessed to retrieve information) of an organization's electronically stored data, designed to facilitate reporting and analysis. In simple form data warehouse is a collection of large amount of data.A single, complete and consistent store of data obtained from a variety of different sources made available to end users in what they can understand and use in a business context.[Barry Devlin]
2/12/20132.Data warehouse/D.S.Jagli6
2/12/20136
2/12/20132.Data warehouse/D.S.Jagli7
What are the users saying...Data should be integrated across the enterpriseSummary data has a real value to the organizationHistorical data holds the key to understanding data over time
2/12/20137
2/12/20132.Data warehouse/D.S.Jagli8
Data WarehouseA data warehouse is a Subject-orientedIntegratedTime-varyingNon-volatilecollection of data that is used primarily in organizational decision making.-- Bill Inmon, Building the Data Warehouse
2/12/20138
Subject Oriented: Data is organized by business topic not by customer_id
2/12/20132.Data warehouse/D.S.Jagli9Data is Integrated and Loaded by Subject
D/WData
1996199719961998
payt
Order
CustProd
2/12/20139
Integrate: Integrated means that data are stored as a single unit not as a collection of files that may have different structures .Eg:
2/12/20132.Data warehouse/D.S.Jagli10
2/12/201310
Time Variant: it means that a time dimension is explicitly included in the data so that trends and changes over time can be studied .
Data elements won t change i.e a single ,complete and consistent store of data obtained from a variety of sources and made available to end users in a way they can understand and use in business context.
Eg: Designated Time Frame (3 - 10 Years)
Key Includes Date2/12/20132.Data warehouse/D.S.Jagli11
2/12/201311
Non-Volatile: It means that the data dont keep changing, new data may be added on a scheduled basic but old data arent discarded
2/12/20132.Data warehouse/D.S.Jagli12No Data UpdateData Warehouse
LoadRead
ReadReadRead
2/12/201312
These definitions are quite different on the surface .yet several common threads emerge from reading them as group:A DWH containing a lot of data some of the data comes from operational source and outside.
A DWH is organized so as to facilitate the use of this data for decision making purpose.
The DWH provides tools by which end users can access this data2/12/20132.Data warehouse/D.S.Jagli13
2/12/201313
2/12/20132.Data warehouse/D.S.Jagli1414ODS Vs DWODSDATAWARE HOUSEStores current operational dataStores historic and current dataPerform numerous quick and simple queries on small amounts of data Perform complex queries on large amounts of data Contains small amount of transactional data contains large amounts of static data Day To Day Decisions,Current operational results, Tactical reporting, Long-Term Decisions,Strategic reporting,Trend detectionNear to Normal formStar SchemaFrequency of Load: Twice Daily , Daily, WeeklyFrequency of Load: Weekly, Monthly, Quarterly
2/12/201314
2/12/20132.Data warehouse/D.S.Jagli15
OLTP vs. Data WarehouseOLTP systems are tuned for known transactions and workloads while workload is not known a priori in a data warehouse.
Special data organization, access methods and implementation methods are needed to support data warehouse queries.
e.g. average amount spent on phone calls between 9AM-5PM in Pune during the month of December.
2/12/201315
2/12/20132.Data warehouse/D.S.Jagli16
OLTP vs. Data WarehouseOLTPApplication OrientedUsed to run businessDetailed dataCurrent up to dateIsolated DataRepetitive accessClerical UserWarehouse (DSS)Subject OrientedUsed to analyze businessSummarized and refinedSnapshot dataIntegrated DataAd-hoc accessKnowledge User (Manager)
2/12/201316
Who uses DWHs?Ideal DWH user is something like this:This persons job involves drawing conclusions from ,making decisions based on, large masses of dataThis person also doesnt want to access a database in a highly technical fashionAny person who wants to analyze data and take a decisionEg: Tourists: Browse information harvested by farmers
Farmers: Harvest information from known access paths
2/12/20132.Data warehouse/D.S.Jagli17
2/12/201317
2/12/20132.Data warehouse/D.S.Jagli18Data Warehouse Architecture
Data Warehouse EngineOptimized LoaderEXTRACTTRANSFORMCLEANLOADREFRESH
AnalyzeQueryMetadata Repository
RelationalDatabasesLegacyDataPurchased DataERPSystemsData mining
OLAPReport/User QuerySource DataData StagingData StorageMeta DataInformation Delivery
2/12/201318
Overview of the ComponentsSource Data ComponentData Staging ComponentData Storage ComponentInformation Delivery ComponentMetadata ComponentManagement and Control Component2/12/20132.Data warehouse/D.S.Jagli19
2/12/20132.Data warehouse/D.S.Jagli20Overview of the ComponentsSource Data Component:Production Data : data comes from operational systems of the Enterprise.
Internal Data: in every organization keeps some private data.
Archived Data: it comes from back ups.
Topics to be coveredGetting data into the Data warehouse.ExtractionTransformationCleansingLoadingSummarization Meta dataData MartsData warehousing & ERPData warehousing & KMData warehousing & CRM
2/12/20132.Data warehouse/D.S.Jagli21
Overview of the Components: Getting data into the DWH2.Data Staging Component5 steps of data stagingExtractionTransformationCleansingLoadingSummarization
2/12/20132.Data warehouse/D.S.Jagli22
Overview of the Components: Getting data into the DWHData staging componentExtractingCapture of relevant data from operational source in as is statusSources for data generally in legacy mainframes in, IMS, DB2; more data today in relational databases on Unix.Transformation: In the case Multiple input sources to a data warehouse, inconsistency can sometimes make data unusable. Transformation is the process of dealing with these inconsistenciesEg: Use of different names/formats Mumbai, BombayCust_id, C_iddd/mm/yy, mm/dd/yy
2/12/20132.Data warehouse/D.S.Jagli23
Overview of the Components: Getting data into the DWHData staging componentCleansingIt is necessary to go through data entered into DWH and make it error free, this process is called Data cleansing.
This include missing data , incorrect data in one source, inconsistent data and conflicting data when two or more sources are involved.
Clean data is vital for the success of the warehouseEg:Seshadri, Sheshadri, Sesadri, Seshadri S., Srinivasan Seshadri, etc. are the same person
2/12/20132.Data warehouse/D.S.Jagli24
Overview of the Components: Getting data into the DWHData staging componentLoadingIt implies physical movement of data from the computers storing the source databases to that which will store the data warehouse.
Most common channel for the data movement process is a high-speed communication link.
It is always necessary to close off access to the DWH when the loading is taking place.
2/12/20132.Data warehouse/D.S.Jagli25
Overview of the Components: Getting data into the DWHData staging componentSummarization : In which any desired summaries of the data warehouse data are precalculated for later useOnce the DWH database has been loaded it is possible to create summaries. Eg: base customer (1985-87)custid, from date, to date, name, phone, dobbase customer (1988-90)custid, from date, to date, name, credit rating, employercustomer activity (1986-89) -- monthly summarycustomer activity detail (1987-89)custid, activity date, amount, clerk id, order no
2/12/20132.Data warehouse/D.S.Jagli26
Overview of the Components: Getting data into the DWHData staging componentSummarized data stored : advantagesreduce storage costsreduce CPU usageincreases performance since smaller number of records to be processeddesign around traditional high level reporting needs
2/12/20132.Data warehouse/D.S.Jagli27
Overview of the ComponentsData storage ComponentThe data storage for data warehouse is a separate repositoryPropagate updates on source data to the warehouseperiodically (e.g., every night, every week) or after significant events.
refresh policy set by administrator based on user needs and traffic.
possibly different policies for different sources.
This function is time consuming.
2/12/20132.Data warehouse/D.S.Jagli28
Overview of the ComponentsInformation delivery componentIn order to provide information to the wide community of data warehouse users the information delivery component includes different methods.E g: Online ,intranet, internet, email.
2/12/20132.Data warehouse/D.S.Jagli29
Overview of the ComponentsMeta data componentMeta data are data that describe data in the data warehouseAdministrative metadatasource databases and their contentsgateway descriptionswarehouse schema, view & derived data definitionsdimensions, hierarchiespre-defined queries and reportsdata mart locations and contentsdata partitionsdata extraction, cleansing, transformation rulesdata refreshuser profiles, user groupssecurity: user authorization, access control
2/12/20132.Data warehouse/D.S.Jagli30
Meta data componentBusiness meta databusiness terms and definitionsownership of dataoperational metadatadata lineage: history of migrated data and sequence of transformations appliedcurrency of data: active, archived,monitoring information: warehouse usage statistics, error reports, audit trails.End- user meta data
2/12/20132.Data warehouse/D.S.Jagli31
Overview of the ComponentsManagement and control componentThis component of the data warehouse sits on the top of all other components. it coordinates services and activities with within the DWH
It interacts with meta data component to perform the management and control functions2/12/20132.Data warehouse/D.S.Jagli32
Content of DWH DataOperational data to warehouse data
4 Data levels:Operational dataAtomic dataSummary dataQuery processing
Eg1: My checking account balance right now is 1200/-RsEg2: My checking account balance at the end of January was 25000/-RsEg3: At the end of January 2011 we have 12500 customersEg4: Our customers in Mumbai zone grew by 10% during last three months
2/12/20132.Data warehouse/D.S.Jagli33Query processing
Summary data
Atomic dataOperational data
2/12/201333
Data Warehouse vs. Data MartsWhat comes first
Data mart Data mart Data mart a subset of a data warehouse that supports the requirements of particular department or business function.
Data mart focuses on only the requirements of users associated with one department or business function.
Data marts do not normally contain detailed operational data, unlike data warehouses.
As data marts contain less data compared with data warehouses, data marts are more easily understood and navigated.
2/12/20132.Data warehouse/D.S.Jagli35
From the Data Warehouse to Data Marts36
DepartmentallyStructured
IndividuallyStructuredData WarehouseOrganizationallyStructuredLessMore
HistoryNormalizedDetailedDataInformation
Reasons for creating a data martTo give users access to the data they need to analyze most oftenTo provide data in a form that matches the collective view of the data by a group of users in a department or business functionTo improve end-user response time due to the reduction in the volume of data to be accessedTo provide appropriately structured data as dictated by the requirements of end-user access tools data cleansing, loading, transformation, and integration are far easier, and hence implementing and setting up a data mart is simpler.The cost of implementing data marts is normally lessThe potential users of a data mart are more clearly defined and can be more easily targeted to obtain support for a data mart project2/12/20132.Data warehouse/D.S.Jagli37
Data Warehouse and Data Marts38
OLAPData MartLightly summarizedDepartmentally structuredOrganizationally structuredAtomicDetailed Data Warehouse Data
Characteristics of the Departmental Data MartOLAPSmallFlexibleCustomized by DepartmentSource is departmentally structured data warehouse39
Techniques for Creating Departmental Data MartOLAPSubsetSummarizedSupersetIndexedArrayed40
SalesMktg.Finance
Data Mart Centric41
Data MartsData SourcesData Warehouse
Problems with Data Mart Centric Solution42
If you end up creating multiple warehouses, integrating them is a problem
True Warehouse43
Data MartsData SourcesData Warehouse
Bill Inman Vs Ralph KimballTop -down Vs Bottom -up w.r.t Data Mart2/12/20132.Data warehouse/D.S.Jagli44
A Comparison DM vs DWH2/12/20132.Data warehouse/D.S.Jagli45DATA MARTDATAWARE HOUSELimited subject areas (one or two)Many subject areas One data mart for one business theme/ subject areaEnterprise repository with multiple data martsHas limited dimensions and measures depends on the subject areaHas all dimensions and measures requiredTime to build is shortLong time activityCan be built as smaller scale DWLarge scale
Trends in DWHContinued growth in2/12/20132.Data warehouse/D.S.Jagli46
OperationalSystemsOrder ProcessingOrder ID = 10
D/WAccounts ReceivableOrder ID = 12
Order ID = 16Product ManagementOrder ID = 8HR SystemSex = M/F
D/WPayrollSex = 1/2
Sex = M/FProduct ManagementSex = 0/1