dwh architecture best practicies

Upload: srini-rayankula

Post on 15-Oct-2015

56 views

Category:

Documents


0 download

DESCRIPTION

NA

TRANSCRIPT

  • Data Warehouse Data Warehouse Architecture Best PracticesArchitecture Best Practices

    Presenter: Bhimappa Desai Presenter: Bhimappa Desai

  • Agenda

    Introductions

    Business Intelligence Background

    Architecture Best Practices

    Questions & Answers

  • Business Intelligence Background

    Data Warehouse Data Warehouse Architecture Best PracticesArchitecture Best Practices

  • BI- Nutshell

  • What is Business Intelligence?

    A Data Warehouse is usually one component of an overall business intelligence solution

    IT people may be tempted to think in terms of products and technologies

    BUT...

  • BI Goal

    The overarching goal of business intelligence is to provide the information necessary to MANAGE a business

    This means providing information in support of management decision making, which is why BI is also called Decision Support

  • BI is about Data Abstraction

    wisdom

    knowledge

    information

    data

    audience for a data warehouse typically considers higher slices of data abstraction pyramid

    lowest level of pyramid is too detailed & unwieldy

    Stages Stages

    (4)(4)

  • Its Not Technology

    Business Intelligence is about delivering business value provide tangible benefit by answering important

    questions that can help the business to achieve its strategic focus Improving profitability

    Who are our five most profitable clients? What are our least profitable products?

    Reducing cost Who are our lowest cost suppliers? Which materials incur highest spoilage costs?

    Improving customer satisfaction What factors may lead to lost customers?

  • Business of BI

    Many leading companies use BI to achieve competitive advantage E.g. Walmart, Dell, Amazon.com, Kraft,

    American Express, etc

  • Data Warehouse Architecture

    Architecture is about delivering an elegant solution that meets the solution requirements

    This means really understanding the problem

    DW architecture is part art, part science

  • Good Architecture

    Its not easy to describe a good design, but we will know it when we see it

  • BI Architecture Requirements

    Must recognize change as a constant

    Take incremental development approach

    Existing applications must continue to work

    Need to allow more data and new types of data to be added

  • End User Acceptance

    Understandability

    understandability is in the eyes of the beholder

    want to hide the complexity

    try to make it:

    intuitive, obvious

    visible, memorable

  • End User Acceptance

    Performance

    dont want to interrupt the thinking process

    provide one click, instantaneous access

    warehouse must be available, production system

  • Architecture Best PracticesArchitecture Best Practices

  • High Level Architecture

    remember the different worlds

    on-line transaction processing (OLTP)

    business intelligence systems (BIS)

    users are different

    data content is different

    data structures are different

    architecture & methodology must be different

  • Two Different Worlds

    On-Line Transaction Processing

    Entity Relational Data Model

    created in 1960s to address

    performance issues with relational database implementations

    normalized to most efficiently get data in

    divides the data into many discrete entities

    many relationships between these entities

    this approach was documented by C.J. Date in An Introduction to Database Systems

  • Two Different Worlds

    Business Intelligence Systems

    Dimensional Data Model

    also called star schema

    designed to easily get information out

    fewer relationships than ERD, the only table with multiple joins connecting to other tables is the central table

    developed in 1960s by data service providers, formalized by Ralph Kimball in The Data Warehouse Toolkit

  • Entity Relation Disadvantages

    all tables look the same

    people cant visualize/remember diagrams

    software cant navigate as schema becomes too complex

    business processes mixed together

    many artificial keys created

  • Dimensional Model Advantages

    Simplicity

    humans can navigate and remember

    software can navigate deterministically

    business process explicitly separated (Data Mart)

    not so many keys (keys = # of attendant tables)

  • Best Practice #1

    Use a data model that is optimized for information retrieval

    Dimensional model

    Denormalized

    Hybrid approach

  • Data Acquisition Processes Extract Transform Load (ETL)

    the process of unloading or copying data from the source systems, transforming it into the format and data model required in the BI environment, and loading it to the DW

    also, a software development tool for building ETL processes (an ETL tool)

    many production DWs use COBOL or other general-purpose programming languages to implement ETL

  • Data Quality Assurance

    data cleansing

    the process of validating and enriching the data as it is published to the DW

    also, a software development tool for building data cleansing processes (a data cleansing tool)

    many production DWs have only very rudimentary data quality assurance processes

  • Data Acquisition & Cleansing

    getting data loaded efficiently and correctly is critical to the success of DW

    implementation of data acquisition & cleansing processes represents from 50 to 80% of effort on typical DW projects

    inaccurate data content can create the problem for user acceptance

  • Best Practice #2

    Carefully design the data acquisition and cleansing processes for your DW

    Ensure the data is processed efficiently and accurately

    Consider acquiring ETL and Data Cleansing tools

    Use them well!

  • Data Model

    Already discussed the benefits of a dimensional model

    No matter whether dimensional modeling or any other design approach is used, the data model must be documented

  • Documenting the Data Model

    The best practice is to use some kind of data modeling tool

    CA ERwin

    Sybase PowerDesigner

    Oracle Designer

    IBM Rational Rose

    Etc.

    Different tools support different modeling notations, but they are more or less equivalent anyway

    Most tools allow sharing of their metadata with an ETL tool

  • Data Model Standards

    data model standards appropriate for the environment and tools chosen in your data warehouse should be adopted

    considerations should be given to data access tool(s) and integration with overall enterprise standards

    standards must be documented and enforced within the DW team

    someone must own the data model

    to ensure a quality data model, all changes should be reviewed through some formal process

  • Data Model Metadata

    Business definitions should be recorded for every field (unless they are technical fields only)

    Domain of data should be recorded

    Sample values should be included

    As more metadata is populated into the modeling tool it becomes increasingly important to be able to share this data across ETL and Data Access tools

  • Metadata Architecture

    The strategy for sharing data model and other metadata should be formalized and documented

    Metadata management tools should be considered & the overall metadata architecture should be carefully planned

  • Best Practice #3

    Design a metadata architecture that allows sharing of metadata between components of your DW

    consider metadata standards such as OMGs Common Warehouse Metamodel (CWM)

    Source : www.service-architecture.com/web-services/articles/common_warehouse_meta-model_cwm.html

  • Alternative Architecture Approaches

    Bill Inmon: Corporate Information Factory

    Hub and Spoke philosophy

    JBOC just a bunch of cubes

    Let it consists of naturally

  • What We Want (Architectural Principal)

    In most cases, business and IT agree that the data warehouse should provide a single version of the truth

    Any approach that can result in disparate data marts or cubes is undesireable

    This is known as data silos or

  • Enterprise DW Architecture

    how to design an enterprise data warehouse and ensure a single version of the truth?

    according to Kimball: start with an overall data architecture phase

    use Data Warehouse Bus design to integrate multiple data marts

    use incremental approach by building one data mart at a time

  • Data Warehouse Bus Architecture

    named for the bus in a computer

    standard interface that allows you to plug in cdrom, disk drive, etc.

    these peripherals work together smoothly

    provides framework for data marts to fit together

    allows separate data marts to be implemented by different groups, even at different times

  • Data Mart Definition

    data mart is a complete subset of the overall data warehouse

    a single business process OR

    a group of related business processes

    think of a data mart as a collection of related fact tables sharing conformed dimensions, aka a fact constellation

  • Designing The DW Bus

    determine which dimensions will be shared across multiple data marts

    conform the shared dimensions

    produce a master suite of shared dimensions

    determine which facts will be shared across data marts

    conform the facts

    standardize the definitions of facts

  • Dimension Granularity

    conformed dimensions will usually be granular

    makes it easy to integrate with various base level fact tables

    easy to extend fact table by adding new facts

    no need to drop or reload fact tables, and no keys have to be changed

  • Conforming Dimensions

    by adhering to standards, the separate data marts can be plugged together

    e.g. customer, product, time

    they can even share data usefully, for example in a drill across report

    ensures reports or queries from different data marts share the same context

  • Conforming Dimensions (contd)

    accomplish this by adding any dimension attribute(s) needed in any data mart(s) to the standard dimension definition attributes not needed everywhere can always be ignored

    typically harder to determine how to load conformed dimensions than to design them initially

    need a single integrated ETL process what is the Source Data System for Record (SOR) for each

    attribute? how do we deal with attributes for which there is more

    than one possible SOR?

  • Data Consolidation

    a current trend in BI/DW is data consolidation

    from a software vendor perspective, it is tempting to simplify this:

    we can keep all the tables for all your disparate applications in one physical database

  • Data Integration

    To truly achieve a single version of the truth, must do more than simply consolidating application databases

    Must integrate data models and establish common terms of reference

  • Best Practice #4

    Take an approach that consolidates data into a single version of the truth

    Data Warehouse Bus

    conformed dimensions

    OR?

  • Operational Data Store (ODS)

    a single point of integration for disparate operational systems

    contains integrated data at the most detailed level (transactional)

    may be loaded in near real time or periodically

    can be used for centralized operational reporting

  • Role of an ODS in DW Architecture

    In the case where an ODS is a necessary component of the overall DW, it should be carefully integrated into the overall architecture

    Can also be used for:

    Staging area

    Master/reference data management

    Etc

  • ODS Data Model

    Not clear if any design approach for an ODS data model has emerged as a best practice

    normalized

    dimensional

    denormalized/hybrid

    any suggestions?

  • Best Practice #5

    Consider implementing an ODS only when information retrieval requirements are near the bottom of the data abstraction pyramid and/or when there are multiple operational sources that need to be accessed

    Must ensure that the data model is integrated, not just consolidated

    May consider 3NF data model

  • Capacity Planning

    DW workloads are typically very demanding, especially for I/O capacity

    Successful implementations tend to grow very quickly, both in number of users and data volume

    Rules of thumb do exist for sizing the hardware platform to provide adequate initial performance typically based on estimated raw data size of proposed

    database e.g. 100-150 Gb per modern CPU

  • Best Practice #6

    Create a capacity plan for your BI application & monitor it carefully

    Consider future additional performance demands

    Establish standard performance benchmark queries and regularly run them

    Implement capacity monitoring tools

    Build scalability into your architecture

  • Open Source Affordability

    Another emerging trend in IT generally is to utilize Open Source software running on commodity hardware

    this is expected to offer lower total cost of ownership

    certainly, GNU/Linux and other Open Source initiatives do provide very good functionality and quality for minimal cost

    This trend also applies to BI & DW:

    most traditional RDBMSs are now supported on Linux

    however, open source RDBMSs lag behind on providing good performance for DW queries

  • DW Appliances

    DW appliances, consisting of packaged solutions providing all required software and hardware, are beginning to offer very promising price/performance

    production experience is limited so far, so this is not yet a best practice

  • Q & A

  • Thank You

    For suggestion and queries email to [email protected]