data ware house

46

Click here to load reader

Upload: vesituniversity-of-mumbai

Post on 16-Apr-2017

1.816 views

Category:

Documents


0 download

TRANSCRIPT

Slide 1

2.Data warehousing

2/12/20132.Data warehouse/D.S.Jagli1

2/12/20131

Topics to be coveredGetting data into the Data warehouse.ExtractionTransformationCleansingLoadingSummarization Meta dataData MartsData warehousing & ERPData warehousing & KMData warehousing & CRM

2/12/20132.Data warehouse/D.S.Jagli2

2/12/20132.Data warehouse/D.S.Jagli3

Which are our lowest/highest margin customers ?

Who are my customers and what products are they buying?

Which customers are most likely to go to the competition ?

What impact will new products/services have on revenue and margins?

What product prom--otions have the biggest impact on revenue?

What is the most effective distribution channel?

A Sales Manager wants to know.

2/12/20133

2/12/20132.Data warehouse/D.S.Jagli4

Data, Data everywhereyet ...I cant find the data I needdata is scattered over the network.many versions, slight differences.

I cant get the data I needneed an expert to get the data.I cant understand the data I foundavailable data poorly documented.

I cant use the data I foundresults are unexpected.data needs to be transformed from one format to other.

2/12/20134

2/12/20132.Data warehouse/D.S.Jagli5

What is a Data Warehousing? A process of transforming data into information and making it available to users in a timely enough manner to make a difference

[Forrester Research, April 1996]

DataInformation

2/12/20135

What is a Data warehouseA data warehouse is a repository (collection of resources that can be accessed to retrieve information) of an organization's electronically stored data, designed to facilitate reporting and analysis. In simple form data warehouse is a collection of large amount of data.A single, complete and consistent store of data obtained from a variety of different sources made available to end users in what they can understand and use in a business context.[Barry Devlin]

2/12/20132.Data warehouse/D.S.Jagli6

2/12/20136

2/12/20132.Data warehouse/D.S.Jagli7

What are the users saying...Data should be integrated across the enterpriseSummary data has a real value to the organizationHistorical data holds the key to understanding data over time

2/12/20137

2/12/20132.Data warehouse/D.S.Jagli8

Data WarehouseA data warehouse is a Subject-orientedIntegratedTime-varyingNon-volatilecollection of data that is used primarily in organizational decision making.-- Bill Inmon, Building the Data Warehouse

2/12/20138

Subject Oriented: Data is organized by business topic not by customer_id

2/12/20132.Data warehouse/D.S.Jagli9Data is Integrated and Loaded by Subject

D/WData

1996199719961998

payt

Order

CustProd

2/12/20139

Integrate: Integrated means that data are stored as a single unit not as a collection of files that may have different structures .Eg:

2/12/20132.Data warehouse/D.S.Jagli10

2/12/201310

Time Variant: it means that a time dimension is explicitly included in the data so that trends and changes over time can be studied .

Data elements won t change i.e a single ,complete and consistent store of data obtained from a variety of sources and made available to end users in a way they can understand and use in business context.

Eg: Designated Time Frame (3 - 10 Years)

Key Includes Date2/12/20132.Data warehouse/D.S.Jagli11

2/12/201311

Non-Volatile: It means that the data dont keep changing, new data may be added on a scheduled basic but old data arent discarded

2/12/20132.Data warehouse/D.S.Jagli12No Data UpdateData Warehouse

LoadRead

ReadReadRead

2/12/201312

These definitions are quite different on the surface .yet several common threads emerge from reading them as group:A DWH containing a lot of data some of the data comes from operational source and outside.

A DWH is organized so as to facilitate the use of this data for decision making purpose.

The DWH provides tools by which end users can access this data2/12/20132.Data warehouse/D.S.Jagli13

2/12/201313

2/12/20132.Data warehouse/D.S.Jagli1414ODS Vs DWODSDATAWARE HOUSEStores current operational dataStores historic and current dataPerform numerous quick and simple queries on small amounts of data Perform complex queries on large amounts of data Contains small amount of transactional data contains large amounts of static data Day To Day Decisions,Current operational results, Tactical reporting, Long-Term Decisions,Strategic reporting,Trend detectionNear to Normal formStar SchemaFrequency of Load: Twice Daily , Daily, WeeklyFrequency of Load: Weekly, Monthly, Quarterly

2/12/201314

2/12/20132.Data warehouse/D.S.Jagli15

OLTP vs. Data WarehouseOLTP systems are tuned for known transactions and workloads while workload is not known a priori in a data warehouse.

Special data organization, access methods and implementation methods are needed to support data warehouse queries.

e.g. average amount spent on phone calls between 9AM-5PM in Pune during the month of December.

2/12/201315

2/12/20132.Data warehouse/D.S.Jagli16

OLTP vs. Data WarehouseOLTPApplication OrientedUsed to run businessDetailed dataCurrent up to dateIsolated DataRepetitive accessClerical UserWarehouse (DSS)Subject OrientedUsed to analyze businessSummarized and refinedSnapshot dataIntegrated DataAd-hoc accessKnowledge User (Manager)

2/12/201316

Who uses DWHs?Ideal DWH user is something like this:This persons job involves drawing conclusions from ,making decisions based on, large masses of dataThis person also doesnt want to access a database in a highly technical fashionAny person who wants to analyze data and take a decisionEg: Tourists: Browse information harvested by farmers

Farmers: Harvest information from known access paths

2/12/20132.Data warehouse/D.S.Jagli17

2/12/201317

2/12/20132.Data warehouse/D.S.Jagli18Data Warehouse Architecture

Data Warehouse EngineOptimized LoaderEXTRACTTRANSFORMCLEANLOADREFRESH

AnalyzeQueryMetadata Repository

RelationalDatabasesLegacyDataPurchased DataERPSystemsData mining

OLAPReport/User QuerySource DataData StagingData StorageMeta DataInformation Delivery

2/12/201318

Overview of the ComponentsSource Data ComponentData Staging ComponentData Storage ComponentInformation Delivery ComponentMetadata ComponentManagement and Control Component2/12/20132.Data warehouse/D.S.Jagli19

2/12/20132.Data warehouse/D.S.Jagli20Overview of the ComponentsSource Data Component:Production Data : data comes from operational systems of the Enterprise.

Internal Data: in every organization keeps some private data.

Archived Data: it comes from back ups.

Topics to be coveredGetting data into the Data warehouse.ExtractionTransformationCleansingLoadingSummarization Meta dataData MartsData warehousing & ERPData warehousing & KMData warehousing & CRM

2/12/20132.Data warehouse/D.S.Jagli21

Overview of the Components: Getting data into the DWH2.Data Staging Component5 steps of data stagingExtractionTransformationCleansingLoadingSummarization

2/12/20132.Data warehouse/D.S.Jagli22

Overview of the Components: Getting data into the DWHData staging componentExtractingCapture of relevant data from operational source in as is statusSources for data generally in legacy mainframes in, IMS, DB2; more data today in relational databases on Unix.Transformation: In the case Multiple input sources to a data warehouse, inconsistency can sometimes make data unusable. Transformation is the process of dealing with these inconsistenciesEg: Use of different names/formats Mumbai, BombayCust_id, C_iddd/mm/yy, mm/dd/yy

2/12/20132.Data warehouse/D.S.Jagli23

Overview of the Components: Getting data into the DWHData staging componentCleansingIt is necessary to go through data entered into DWH and make it error free, this process is called Data cleansing.

This include missing data , incorrect data in one source, inconsistent data and conflicting data when two or more sources are involved.

Clean data is vital for the success of the warehouseEg:Seshadri, Sheshadri, Sesadri, Seshadri S., Srinivasan Seshadri, etc. are the same person

2/12/20132.Data warehouse/D.S.Jagli24

Overview of the Components: Getting data into the DWHData staging componentLoadingIt implies physical movement of data from the computers storing the source databases to that which will store the data warehouse.

Most common channel for the data movement process is a high-speed communication link.

It is always necessary to close off access to the DWH when the loading is taking place.

2/12/20132.Data warehouse/D.S.Jagli25

Overview of the Components: Getting data into the DWHData staging componentSummarization : In which any desired summaries of the data warehouse data are precalculated for later useOnce the DWH database has been loaded it is possible to create summaries. Eg: base customer (1985-87)custid, from date, to date, name, phone, dobbase customer (1988-90)custid, from date, to date, name, credit rating, employercustomer activity (1986-89) -- monthly summarycustomer activity detail (1987-89)custid, activity date, amount, clerk id, order no

2/12/20132.Data warehouse/D.S.Jagli26

Overview of the Components: Getting data into the DWHData staging componentSummarized data stored : advantagesreduce storage costsreduce CPU usageincreases performance since smaller number of records to be processeddesign around traditional high level reporting needs

2/12/20132.Data warehouse/D.S.Jagli27

Overview of the ComponentsData storage ComponentThe data storage for data warehouse is a separate repositoryPropagate updates on source data to the warehouseperiodically (e.g., every night, every week) or after significant events.

refresh policy set by administrator based on user needs and traffic.

possibly different policies for different sources.

This function is time consuming.

2/12/20132.Data warehouse/D.S.Jagli28

Overview of the ComponentsInformation delivery componentIn order to provide information to the wide community of data warehouse users the information delivery component includes different methods.E g: Online ,intranet, internet, email.

2/12/20132.Data warehouse/D.S.Jagli29

Overview of the ComponentsMeta data componentMeta data are data that describe data in the data warehouseAdministrative metadatasource databases and their contentsgateway descriptionswarehouse schema, view & derived data definitionsdimensions, hierarchiespre-defined queries and reportsdata mart locations and contentsdata partitionsdata extraction, cleansing, transformation rulesdata refreshuser profiles, user groupssecurity: user authorization, access control

2/12/20132.Data warehouse/D.S.Jagli30

Meta data componentBusiness meta databusiness terms and definitionsownership of dataoperational metadatadata lineage: history of migrated data and sequence of transformations appliedcurrency of data: active, archived,monitoring information: warehouse usage statistics, error reports, audit trails.End- user meta data

2/12/20132.Data warehouse/D.S.Jagli31

Overview of the ComponentsManagement and control componentThis component of the data warehouse sits on the top of all other components. it coordinates services and activities with within the DWH

It interacts with meta data component to perform the management and control functions2/12/20132.Data warehouse/D.S.Jagli32

Content of DWH DataOperational data to warehouse data

4 Data levels:Operational dataAtomic dataSummary dataQuery processing

Eg1: My checking account balance right now is 1200/-RsEg2: My checking account balance at the end of January was 25000/-RsEg3: At the end of January 2011 we have 12500 customersEg4: Our customers in Mumbai zone grew by 10% during last three months

2/12/20132.Data warehouse/D.S.Jagli33Query processing

Summary data

Atomic dataOperational data

2/12/201333

Data Warehouse vs. Data MartsWhat comes first

Data mart Data mart Data mart a subset of a data warehouse that supports the requirements of particular department or business function.

Data mart focuses on only the requirements of users associated with one department or business function.

Data marts do not normally contain detailed operational data, unlike data warehouses.

As data marts contain less data compared with data warehouses, data marts are more easily understood and navigated.

2/12/20132.Data warehouse/D.S.Jagli35

From the Data Warehouse to Data Marts36

DepartmentallyStructured

IndividuallyStructuredData WarehouseOrganizationallyStructuredLessMore

HistoryNormalizedDetailedDataInformation

Reasons for creating a data martTo give users access to the data they need to analyze most oftenTo provide data in a form that matches the collective view of the data by a group of users in a department or business functionTo improve end-user response time due to the reduction in the volume of data to be accessedTo provide appropriately structured data as dictated by the requirements of end-user access tools data cleansing, loading, transformation, and integration are far easier, and hence implementing and setting up a data mart is simpler.The cost of implementing data marts is normally lessThe potential users of a data mart are more clearly defined and can be more easily targeted to obtain support for a data mart project2/12/20132.Data warehouse/D.S.Jagli37

Data Warehouse and Data Marts38

OLAPData MartLightly summarizedDepartmentally structuredOrganizationally structuredAtomicDetailed Data Warehouse Data

Characteristics of the Departmental Data MartOLAPSmallFlexibleCustomized by DepartmentSource is departmentally structured data warehouse39

Techniques for Creating Departmental Data MartOLAPSubsetSummarizedSupersetIndexedArrayed40

SalesMktg.Finance

Data Mart Centric41

Data MartsData SourcesData Warehouse

Problems with Data Mart Centric Solution42

If you end up creating multiple warehouses, integrating them is a problem

True Warehouse43

Data MartsData SourcesData Warehouse

Bill Inman Vs Ralph KimballTop -down Vs Bottom -up w.r.t Data Mart2/12/20132.Data warehouse/D.S.Jagli44

A Comparison DM vs DWH2/12/20132.Data warehouse/D.S.Jagli45DATA MARTDATAWARE HOUSELimited subject areas (one or two)Many subject areas One data mart for one business theme/ subject areaEnterprise repository with multiple data martsHas limited dimensions and measures depends on the subject areaHas all dimensions and measures requiredTime to build is shortLong time activityCan be built as smaller scale DWLarge scale

Trends in DWHContinued growth in2/12/20132.Data warehouse/D.S.Jagli46

OperationalSystemsOrder ProcessingOrder ID = 10

D/WAccounts ReceivableOrder ID = 12

Order ID = 16Product ManagementOrder ID = 8HR SystemSex = M/F

D/WPayrollSex = 1/2

Sex = M/FProduct ManagementSex = 0/1