data ware house in brief

27
What is a Data Warehouse? What is a Data Warehouse? Shipra Varshney Shipra Varshney Lecture Lecture ± ±MBA MBA

Upload: jyoti-pahadwa

Post on 08-Apr-2018

220 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: DATA WARE HOUSE in brief

8/7/2019 DATA WARE HOUSE in brief

http://slidepdf.com/reader/full/data-ware-house-in-brief 1/27

What is a Data Warehouse?What is a Data Warehouse?

Shipra VarshneyShipra Varshney

LectureLecture ±±MBAMBA

Page 2: DATA WARE HOUSE in brief

8/7/2019 DATA WARE HOUSE in brief

http://slidepdf.com/reader/full/data-ware-house-in-brief 2/27

What Is a Data Warehouse?What Is a Data Warehouse?

Nobody can agreeNobody can agree

So I¶m not actually going to define a DWSo I¶m not actually going to define a DW

Don¶t feel cheated, thoughDon¶t feel cheated, though

By the end of this talk, you¶llBy the end of this talk, you¶ll�� Understand key concepts that underlie allUnderstand key concepts that underlie all

warehouse implementations (³talk the talk´)warehouse implementations (³talk the talk´)

�� Understand the various components out of Understand the various components out of 

which DW architects construct realwhich DW architects construct real--world dataworld datawarehouseswarehouses

�� Understand what a data warehouse projectUnderstand what a data warehouse projectlooks likelooks like

Page 3: DATA WARE HOUSE in brief

8/7/2019 DATA WARE HOUSE in brief

http://slidepdf.com/reader/full/data-ware-house-in-brief 3/27

Why Are Schools Setting UpWhy Are Schools Setting Up

Data Warehouses?Data Warehouses? A data warehouse makes it easier to:A data warehouse makes it easier to:

�� Optimize classroom, computer lab usageOptimize classroom, computer lab usage�� Refine admissions ratings systemsRefine admissions ratings systems�� Forecast future demand f or courses, majorsForecast future demand f or courses, majors�� Tie private spreadsheet data into central repositoriesTie private spreadsheet data into central repositories

�� Correlate admissions and IR data with outcomes such as:Correlate admissions and IR data with outcomes such as: GPAsGPAs Placement ratesPlacement rates Happiness, as measured by alumni surveysHappiness, as measured by alumni surveys

�� Notify advisors when extra help may be needed based onNotify advisors when extra help may be needed based on Admissions data (student vitals; SAT, etc.)Admissions data (student vitals; SAT, etc.)

Special events: ASpecial events: A--student suddenly gets a C in his/her majorstudent suddenly gets a C in his/her major Slower trends: Student¶s GPA falls f or > 2 semesters/termsSlower trends: Student¶s GPA falls f or > 2 semesters/terms

�� (Many other examples could be given!)(Many other examples could be given!)

Better inf ormation = better decisionsBetter inf ormation = better decisions�� Better admission decisionsBetter admission decisions�� Better retention ratesBetter retention rates�� More effective fund raising, etc.More effective fund raising, etc.

Page 4: DATA WARE HOUSE in brief

8/7/2019 DATA WARE HOUSE in brief

http://slidepdf.com/reader/full/data-ware-house-in-brief 4/27

Talking The TalkTalking The Talk

To think and communicate usefully about data warehousesTo think and communicate usefully about data warehousesyou¶ll need to understand a set of common terms and you¶ll need to understand a set of common terms and concepts:concepts:�� OLTPOLTP�� ODSODS

�� OLAP, ROLAP, MOLAPOLAP, ROLAP, MOLAP�� ETLETL�� Star schemaStar schema�� Conf ormed dimensionConf ormed dimension�� Data martData mart�� CubeCube

�� MetadataMetadata Even if you¶re not an IT person, pay heed:Even if you¶re not an IT person, pay heed:

�� You¶ll have to communicate with IT peopleYou¶ll have to communicate with IT people�� More importantly:More importantly:

Evidence shows that IT will only build a successful warehouse if Evidence shows that IT will only build a successful warehouse if youyouare intimately involved!are intimately involved!

Page 5: DATA WARE HOUSE in brief

8/7/2019 DATA WARE HOUSE in brief

http://slidepdf.com/reader/full/data-ware-house-in-brief 5/27

OLTPOLTP

OLTP =OLTP = online transaction processingonline transaction processing

The process of moving data around to The process of moving data around to handle dayhandle day--toto--day affairsday affairs

�� Scheduling classesScheduling classes

�� Registering studentsRegistering students

�� Tracking benefitsTracking benefits

�� Recording payments, etc.Recording payments, etc.

Systems supporting this kind of activitySystems supporting this kind of activityare called are called transactional systemstransactional systems

Page 6: DATA WARE HOUSE in brief

8/7/2019 DATA WARE HOUSE in brief

http://slidepdf.com/reader/full/data-ware-house-in-brief 6/27

Transactional SystemsTransactional Systems Transactional systems are optimized primarily f orTransactional systems are optimized primarily f or

thethe here and now here and now �� Can support many simultaneous usersCan support many simultaneous users�� Can support heavy read/write accessCan support heavy read/write access�� Allow f or constant changeAllow f or constant change

�� Are big, ugly, and often don¶t give people the data theyAre big, ugly, and often don¶t give people the data theywantwant As a result a lot of data ends up in shadow databasesAs a result a lot of data ends up in shadow databases Some ends up locked away in private spreadsheetsSome ends up locked away in private spreadsheets

Transactional systems don¶t record all previousTransactional systems don¶t record all previousdata statesdata states

Lots of data gets thrown away or archived, e.g.:Lots of data gets thrown away or archived, e.g.:�� Admissions dataAdmissions data�� Enrollment dataEnrollment data�� Asset tracking data (³How many computers did weAsset tracking data (³How many computers did we

support each year, from 1996 to 2006, and where do wesupport each year, from 1996 to 2006, and where do weexpect to be in 2010?´)expect to be in 2010?´)

Page 7: DATA WARE HOUSE in brief

8/7/2019 DATA WARE HOUSE in brief

http://slidepdf.com/reader/full/data-ware-house-in-brief 7/27

Simple Transactional DatabaseSimple Transactional Database

Map of MicrosoftMap of MicrosoftWindows UpdateWindows UpdateService (WUS)Service (WUS)

backback--end databaseend database�� Diagrammed usingDiagrammed usingSybaseSybasePowerDesignerPowerDesigner Each green box is aEach green box is a

database ³table´ database ³table´  Arrows are ³joins´ orArrows are ³joins´ or

f oreign keysf oreign keys

This isThis is simplesimple f or anf or anOLTP back endOLTP back end

Page 8: DATA WARE HOUSE in brief

8/7/2019 DATA WARE HOUSE in brief

http://slidepdf.com/reader/full/data-ware-house-in-brief 8/27

Page 9: DATA WARE HOUSE in brief

8/7/2019 DATA WARE HOUSE in brief

http://slidepdf.com/reader/full/data-ware-house-in-brief 9/27

ODSODS

ODS =ODS = operational data storeoperational data store

ODSs were an early workaround to the ³reportingODSs were an early workaround to the ³reportingproblem´ problem´ 

To create an ODS youTo create an ODS you�� Build a separate/simplified version of an OLTP systemBuild a separate/simplified version of an OLTP system

�� Periodically copy data into it from the live OLTP systemPeriodically copy data into it from the live OLTP system

�� Hook it to operational reporting toolsHook it to operational reporting tools

An ODS can be an integration point or realAn ODS can be an integration point or real--timetime

³reporting database´ f or an operational system³reporting database´ f or an operational system It¶s not enough f or full enterpriseIt¶s not enough f or full enterprise--level, crosslevel, cross--databasedatabase analytical processinganalytical processing

Page 10: DATA WARE HOUSE in brief

8/7/2019 DATA WARE HOUSE in brief

http://slidepdf.com/reader/full/data-ware-house-in-brief 10/27

OLAPOLAP

OLAP =OLAP = online analytical processingonline analytical processing OLAP is the process of creating and OLAP is the process of creating and 

summarizing historical, multidimensionalsummarizing historical, multidimensional

datadata�� To help users understand the data betterTo help users understand the data better�� Provide a basis f or inf ormed decisionsProvide a basis f or inf ormed decisions�� Allow users to manipulate and explore dataAllow users to manipulate and explore data

themselves, easily and intuitivelythemselves, easily and intuitively

More than just ³reporting´ More than just ³reporting´  Reporting is just one (static) product of Reporting is just one (static) product of 

OLAPOLAP

Page 11: DATA WARE HOUSE in brief

8/7/2019 DATA WARE HOUSE in brief

http://slidepdf.com/reader/full/data-ware-house-in-brief 11/27

OLAP Support DatabasesOLAP Support Databases

OLAP systems require support databasesOLAP systems require support databases

These databases typicallyThese databases typically�� Support fewer simultaneous users thanSupport fewer simultaneous users than OLOLT T P P 

back endsback ends�� Are structured simply; i.e., Are structured simply; i.e., denormalized denormalized 

�� Can grow largeCan grow large Hold snapshots of data in OLTP systemsHold snapshots of data in OLTP systems

Provide history/time depth to our analysesProvide history/time depth to our analyses�� Are optimized f or read (not write) accessAre optimized f or read (not write) access

�� Updated via periodic batch (e.g., nightly)Updated via periodic batch (e.g., nightly) E TLE TL

processesprocesses

Page 12: DATA WARE HOUSE in brief

8/7/2019 DATA WARE HOUSE in brief

http://slidepdf.com/reader/full/data-ware-house-in-brief 12/27

ETL ProcessesETL Processes

ETL = extract, transf orm, loadETL = extract, transf orm, load�� Ex tract Ex tract data from various sourcesdata from various sources�� T ransformT ransform and clean the data from those sourcesand clean the data from those sources�� Load Load the data into databases used f or analysis and the data into databases used f or analysis and 

reportingreporting

ETL processes are coded in various waysETL processes are coded in various ways�� By hand in SQL, UniBASIC, etc.By hand in SQL, UniBASIC, etc.�� Using more general programming languagesUsing more general programming languages�� In semiIn semi--automated fashion using specialized ETL toolsautomated fashion using specialized ETL tools

like Cognos Decision Streamlike Cognos Decision Stream

Most institutions do hand ETL; but note well:Most institutions do hand ETL; but note well:�� Hand ETL is slowHand ETL is slow�� Requires specialized knowledgeRequires specialized knowledge�� Becomes extremely difficult to maintain as codeBecomes extremely difficult to maintain as code

accumulates and databases/personnel change!accumulates and databases/personnel change!

Page 13: DATA WARE HOUSE in brief

8/7/2019 DATA WARE HOUSE in brief

http://slidepdf.com/reader/full/data-ware-house-in-brief 13/27

Where Does the Data Go?Where Does the Data Go?

What sort of a database do the ETLWhat sort of a database do the ETLprocesses dump data into?processes dump data into?

Typically, into very simple tableTypically, into very simple tablestructuresstructures

These table structures are:These table structures are:

�� DenormalizedDenormalized�� Minimally branched/hierarchizedMinimally branched/hierarchized

�� Structured into Structured into star sc hemasstar sc hemas

Page 14: DATA WARE HOUSE in brief

8/7/2019 DATA WARE HOUSE in brief

http://slidepdf.com/reader/full/data-ware-house-in-brief 14/27

So What Are Star Schemas?So What Are Star Schemas?

Star schemas are collections of data arranged Star schemas are collections of data arranged into starinto star--like patternslike patterns�� They haveThey have fact tablesfact tables in the middle, which containin the middle, which contain

amounts, measures (like counts, dollar amounts, GPAs)amounts, measures (like counts, dollar amounts, GPAs)

�� DimensionDimension tables around the outside, which containtables around the outside, which containlabels and classifications (like names, geocodes, majors)labels and classifications (like names, geocodes, majors)

�� For faster processing, For faster processing, aggregate fact tablesaggregate fact tables arearesometimes also used (e.g., counts presometimes also used (e.g., counts pre--averaged f or anaveraged f or anentire term)entire term)

Star schemas shouldStar schemas should�� Have descriptive column/field labelsHave descriptive column/field labels

�� Be easy f or users to understandBe easy f or users to understand

�� Perf orm well on queriesPerf orm well on queries

Page 15: DATA WARE HOUSE in brief

8/7/2019 DATA WARE HOUSE in brief

http://slidepdf.com/reader/full/data-ware-house-in-brief 15/27

A Very Simple Star SchemaA Very Simple Star Schema

Data Center UPSData Center UPS

Power OutputPower Output

Dimensions:Dimensions:

PhasePhase

TimeTime

DateDate

Facts:Facts:

VoltsVolts

AmpsAmps

Etc.Etc.

Page 16: DATA WARE HOUSE in brief

8/7/2019 DATA WARE HOUSE in brief

http://slidepdf.com/reader/full/data-ware-house-in-brief 16/27

A More Complex Star SchemaA More Complex Star Schema

Freshman surveyFreshman surveydata (HERI/CIRP)data (HERI/CIRP)

Dimensions:Dimensions:�� QuestionsQuestions

�� Survey yearsSurvey years�� Data about testData about test

takerstakers

Facts:Facts:�� Answer (text)Answer (text)

�� Answer (raw)Answer (raw)

�� Count (1)Count (1)

OopsOops�� Not a starNot a star

�� Snowflaked!Snowflaked!Oops, answers should have been placed in theirown dimension (creating a ³factless fact table´).I¶ll demo a better version of this star later!

Page 17: DATA WARE HOUSE in brief

8/7/2019 DATA WARE HOUSE in brief

http://slidepdf.com/reader/full/data-ware-house-in-brief 17/27

Data MartsData Marts One definition:One definition:

�� One or more star schemas that present data on a singleOne or more star schemas that present data on a singleor related set of business processesor related set of business processes

Data marts should Data marts should not not be built in isolationbe built in isolation They need to be connected via dimensional tablesThey need to be connected via dimensional tables

that arethat are�� The same or subsets of each otherThe same or subsets of each other�� Hierarchized the same way internallyHierarchized the same way internally

So, e.g., if I construct data marts f or«So, e.g., if I construct data marts f or«�� GPA trends, student major trends, enrollmentsGPA trends, student major trends, enrollments�� Freshman survey data, senior survey data, etc.Freshman survey data, senior survey data, etc.

«I connect these marts via a conf ormed «I connect these marts via a conf ormed student student dimensiondimension�� Makes correlation of data across star schemas intuitiveMakes correlation of data across star schemas intuitive�� Makes it easier f or OLAP tools to use the dataMakes it easier f or OLAP tools to use the data�� Allows nonspecialists to do much of the workAllows nonspecialists to do much of the work

Page 18: DATA WARE HOUSE in brief

8/7/2019 DATA WARE HOUSE in brief

http://slidepdf.com/reader/full/data-ware-house-in-brief 18/27

Simple Data Mart ExampleSimple Data Mart ExampleUPSUPSBattery starBattery star

By batteryBy batteryRunRun--timetime% charged% chargedCurrentCurrent

Input starInput starBy phaseBy phaseVoltageVoltageCurrentCurrent

Output starOutput starBy phaseBy phase

VoltageVoltage

CurrentCurrent

Sensor starSensor starBy sensorBy sensor

TempTempHumidityHumidity

Note conf ormed date, 

time dimensions!

Page 19: DATA WARE HOUSE in brief

8/7/2019 DATA WARE HOUSE in brief

http://slidepdf.com/reader/full/data-ware-house-in-brief 19/27

ROLAP, MOLAPROLAP, MOLAP

ROLAP = OLAP via direct relational queryROLAP = OLAP via direct relational query

�� E.g., against a (materialized) viewE.g., against a (materialized) view

�� Against star schemas in a warehouseAgainst star schemas in a warehouse

MOLAP = OLAP via multidimensionalMOLAP = OLAP via multidimensionaldatabase (MDB)database (MDB)

�� MDB is a special kind of databaseMDB is a special kind of database

�� Treats data kind of like a big, fast spreadsheetTreats data kind of like a big, fast spreadsheet

�� MDBs typically draw data in from a dataMDBs typically draw data in from a datawarehousewarehouse

Built to work best withBuilt to work best with star sc hemasstar sc hemas

Page 20: DATA WARE HOUSE in brief

8/7/2019 DATA WARE HOUSE in brief

http://slidepdf.com/reader/full/data-ware-house-in-brief 20/27

MetadataMetadata Metadata =Metadata = data about datadata about data In a data warehousing context it can mean manyIn a data warehousing context it can mean many

thingsthings�� Inf ormation on data in source OLTP systemsInf ormation on data in source OLTP systems�� Inf ormation on ETL jobs and what they do to the dataInf ormation on ETL jobs and what they do to the data

�� Inf ormation on data in marts/star schemasInf ormation on data in marts/star schemas�� Documentation in OLAP tools on the data theyDocumentation in OLAP tools on the data they

manipulatemanipulate

Many institutions make metadata available viaMany institutions make metadata available viadata malls or warehouse portals, e.g.:data malls or warehouse portals, e.g.:

�� University of New MexicoUniversity of New Mexico�� UC DavisUC Davis�� Rensselear Polytechnic InstituteRensselear Polytechnic Institute�� University of IllinoisUniversity of Illinois

Good ETL tools automate the setup of Good ETL tools automate the setup of malls/portals!malls/portals!

Page 21: DATA WARE HOUSE in brief

8/7/2019 DATA WARE HOUSE in brief

http://slidepdf.com/reader/full/data-ware-house-in-brief 21/27

The Data WarehouseThe Data Warehouse

OK now we¶re experts in terms like OLTP, OLAP, OK now we¶re experts in terms like OLTP, OLAP, star schema, metadata, etc.star schema, metadata, etc.

Let¶s use some of these terms to describe how aLet¶s use some of these terms to describe how aDW works:DW works:

�� Provides ample metadataProvides ample metadata ±± data about the datadata about the data�� Utilizes easyUtilizes easy--toto--understand column/field namesunderstand column/field names�� Feeds multidimensional databases (MDBs)Feeds multidimensional databases (MDBs)�� Is updated via periodic (mainly nightly) ETL jobsIs updated via periodic (mainly nightly) ETL jobs�� Presents data in a simplified, denormalized f ormPresents data in a simplified, denormalized f orm�� Utilizes starUtilizes star--like fact/dimension table schemaslike fact/dimension table schemas

�� Encompasses multiple, smaller data ³marts´ Encompasses multiple, smaller data ³marts´ �� Supports OLAP tools (Access/Excel, Safari, Cognos BI)Supports OLAP tools (Access/Excel, Safari, Cognos BI)�� Derives data from (multiple) backDerives data from (multiple) back--end OLTP systemsend OLTP systems�� Houses historical data, and Houses historical data, and cancan grow very biggrow very big

Page 22: DATA WARE HOUSE in brief

8/7/2019 DATA WARE HOUSE in brief

http://slidepdf.com/reader/full/data-ware-house-in-brief 22/27

A Data Warehouse is Not«A Data Warehouse is Not«

Vendor and consultant proclamationsVendor and consultant proclamationsaside, a data warehouse is not:aside, a data warehouse is not:�� A projectA project

With a specific end dateWith a specific end date

�� A product you buy from a vendorA product you buy from a vendor Like an ODS (such as SCT¶s)Like an ODS (such as SCT¶s) A canned ³warehouse´ supplied by iStrategyA canned ³warehouse´ supplied by iStrategy Cognos ReportNetCognos ReportNet

�� A database schema or instanceA database schema or instance Like OracleLike Oracle SQL ServerSQL Server

�� A cutA cut--down version of your live transactionaldown version of your live transactionaldatabasedatabase

Page 23: DATA WARE HOUSE in brief

8/7/2019 DATA WARE HOUSE in brief

http://slidepdf.com/reader/full/data-ware-house-in-brief 23/27

Kimball & Caserta¶s DefinitionKimball & Caserta¶s Definition

According to Ralph Kimball and JoeAccording to Ralph Kimball and JoeCaserta, a data warehouse is:Caserta, a data warehouse is:

A system that extracts, cleans, conf orms, and A system that extracts, cleans, conf orms, and delivers source data into adelivers source data into a dimensional datadimensional datastorestore and then supports and implementsand then supports and implementsquerying and analysis f or the purpose of querying and analysis f or the purpose of decision making.decision making.

Another def.: The union of all the enterprise¶s data martsAnother def.: The union of all the enterprise¶s data marts

Aside: The Kimball model is not without some critics:Aside: The Kimball model is not without some critics:�� E.g., BillE.g., Bill InmonInmon

Page 24: DATA WARE HOUSE in brief

8/7/2019 DATA WARE HOUSE in brief

http://slidepdf.com/reader/full/data-ware-house-in-brief 24/27

Example Data Warehouse (1)Example Data Warehouse (1)

This one isThis one isRPI¶sRPI¶s

5 parts:5 parts:�� SourcesSources

�� ETL stuff ETL stuff 

�� DW properDW proper

�� Cubes etc.Cubes etc.�� OLAP appsOLAP apps

Page 25: DATA WARE HOUSE in brief

8/7/2019 DATA WARE HOUSE in brief

http://slidepdf.com/reader/full/data-ware-house-in-brief 25/27

Implementing a Data WarehouseImplementing a Data Warehouse

In many organizations IT people want to huddle and workIn many organizations IT people want to huddle and workout a warehousing plan, but in factout a warehousing plan, but in fact�� The purpose of a DW is decision supportThe purpose of a DW is decision support�� The primary audience of a DW is theref ore College decisionThe primary audience of a DW is theref ore College decision

makersmakers

�� It is College decision makers theref ore who must determineIt is College decision makers theref ore who must determine ScopeScope PriorityPriority ResourcesResources

Decision makers can¶t make these determinations withoutDecision makers can¶t make these determinations withoutan understanding of data warehousesan understanding of data warehouses

It is theref ore imperative that key decision makers first beIt is theref ore imperative that key decision makers first beeducated about data warehouseseducated about data warehouses�� Once this occurs, it is possible toOnce this occurs, it is possible to

Elicit requirements (a critical step that¶s often skipped)Elicit requirements (a critical step that¶s often skipped) Determine priorities/scopeDetermine priorities/scope Formulate a budgetFormulate a budget Create a plan and timeline, with real milestones and deliverables!Create a plan and timeline, with real milestones and deliverables!

Page 26: DATA WARE HOUSE in brief

8/7/2019 DATA WARE HOUSE in brief

http://slidepdf.com/reader/full/data-ware-house-in-brief 26/27

What Takes Up the Most Time?What Takes Up the Most Time?

You may be surprised You may be surprised to learn what DW stepto learn what DW steptakes the most timetakes the most time

Try guessing which:Try guessing which:�� HardwareHardware

�� Physical database setupPhysical database setup

�� Database designDatabase design

�� ETLETL

�� OLAP setupOLAP setup

Acc. to Kimball & Caserta, ETL will eat up 70% of the time.Other analysts give estimates ranging from 50% to 80%.

The most often underestimated part of the warehouse

project!

0

10

20

30

40

50

60

70

80

90

1st Qtr 2nd Qtr 3rd Qtr 4th Qtr 

East

West

North

Hardware

Database

ETL

Schemas

OLAP tools

Page 27: DATA WARE HOUSE in brief

8/7/2019 DATA WARE HOUSE in brief

http://slidepdf.com/reader/full/data-ware-house-in-brief 27/27

ConclusionConclusion

Inf ormation is held in transactional systemsInf ormation is held in transactional systems�� But transactional systems are complexBut transactional systems are complex�� They don¶t talk to each other well; each is a siloThey don¶t talk to each other well; each is a silo�� They require specially trained people to report off of They require specially trained people to report off of 

For normal people to explore institutional data, data inFor normal people to explore institutional data, data intransactional systems needs to betransactional systems needs to be�� Renormalized as star schemasRenormalized as star schemas�� Moved to a system optimized f or analysisMoved to a system optimized f or analysis�� Merged into a unified whole in aMerged into a unified whole in a data warehousedata warehouse

Note: This process must be led by ³customers´ Note: This process must be led by ³customers´ �� Yes, IT people must build the infrastructureYes, IT people must build the infrastructure�� But IT people aren¶t the main customersBut IT people aren¶t the main customers

So who are the customers?So who are the customers?�� Admissions officers trying to make good admission decisionsAdmissions officers trying to make good admission decisions�� Student counselors trying to find/help students at riskStudent counselors trying to find/help students at risk�� Development offers raising funds that support the CollegeDevelopment offers raising funds that support the College�� Alumni affairs people trying to manage volunteersAlumni affairs people trying to manage volunteers�� Faculty deans trying to rightFaculty deans trying to right--size departmentssize departments�� IT people managing software/hardware assets, etc«.IT people managing software/hardware assets, etc«.