data warehouse

34
Data Warehouse Prof Tarvinder Singh

Upload: tarvinder-singh

Post on 13-Jun-2015

1.443 views

Category:

Technology


5 download

DESCRIPTION

A quick view on Data Warehouse methodologies and structure

TRANSCRIPT

Page 1: Data warehouse

Data WarehouseProf Tarvinder Singh

Page 2: Data warehouse

Data Warehouse

• The data warehouse is that portion of an overall Architected Data Environment that serves as the single integrated source of data for processing information.• Data ware house contains Historical data

as well as Current data.

Page 3: Data warehouse

Relational Database

•Its known as RDBMS.•Data is stored as Relations•Concept of Primary and Foreign Key•ACID Properties

Page 4: Data warehouse

ACID properties• ACID properties are an important concept for

databases. The acronym stands for Atomicity, Consistency, Isolation, and Durability

• In the context of databases, a single logical operation on the data is called a transaction. An example of a transaction is a transfer of funds from one account to another, even though it might consist of multiple individual operations (such as debiting one account and crediting another). The ACID properties that such transactions are processed reliably.

Page 5: Data warehouse

1. ATOMICITY

•Atomicity refers to the ability of the DBMS to guarantee that either all of the tasks of a transaction are performed or none of them are.

•The transfer of funds can be completed or it can fail for a multitude of reasons, but atomicity guarantees that one account won't be debited if the other is not credited as well.

Page 6: Data warehouse

Consistency

•Consistency refers to the database being in a legal state when the transaction begins and when it ends.

•This means that a transaction can't break the rules, or integrity constraints, of the database. If an integrity constraint states that all accounts must have a positive balance, then any transaction violating this rule will be aborted.

Page 7: Data warehouse

ISOLATION

•Isolation refers to the ability of the application to make operations in a transaction appear isolated from all other operations.

• This means no operation outside the transaction can ever see the data in an intermediate state

Page 8: Data warehouse

Durability

•Refers to the guarantee that once the user has been notified of success, the transaction will persist, and not be undone

Page 9: Data warehouse

Data Warehouse Characteristics

1. Subject Oriented.2. Integrated3. Non Volatile4. Time variant5. Accessible

Page 10: Data warehouse

1. Subject Oriented

•Information is presented according to specific subjects or areas of interest, not simply as computer files. Data is manipulated to provide information about a particular subject.

Page 11: Data warehouse

2. Integrated

•A single source of information for and about understanding multiple areas of interest.

• The data warehouse provides one-stop shopping and contains information about a variety of subjects.

•Thus the University data warehouse has information on students, faculty and staff, instructional workload, and student outcomes.

Page 12: Data warehouse

3. Non Volatile

•Stable information that doesn’t change each time an operational process is executed.

•Information is consistent regardless of when the warehouse is accessed.

Page 13: Data warehouse

4. Time variant

•Containing a history of the subject, as well as current information.

•Historical information is an important component of a data warehouse.

Page 14: Data warehouse

5. Accessible

•The primary purpose of a data warehouse is to provide readily accessible information to end-users.

Page 15: Data warehouse

Some definitions• Data Warehouse: A data structure that is optimized

for distribution. It collects and stores integrated sets of historical data from multiple operational systems and feeds them to one or more data marts. It may also provide end-user access to support enterprise views of data.

• Data Mart: A data structure that is optimized for access. It is designed to facilitate end-user analysis of data. It typically supports a single, analytic application used by a distinct set of workers.

• Staging Area: Any data store that is designed primarily to receive data into a warehousing environment.

Page 16: Data warehouse

•Operational Data Store: A collection of data that addresses operational needs of various operational units. It is not a component of a data warehousing architecture, but a solution to operational needs.

•OLAP (On-Line Analytical Processing): A method by which multidimensional analysis occurs.

•Multidimensional Analysis: The ability to manipulate information by a variety of relevant categories or “dimensions” to facilitate analysis and understanding of the underlying data. It is also sometimes referred to as “drilling-down”, “drilling-across” and “slicing and dicing”

Page 17: Data warehouse

•Hypercube: A means of visually representing multidimensional data.

•OLAP Tools: A set of software products that attempt to facilitate multidimensional analysis. Can incorporate data acquisition, data access, data manipulation, or any combination thereof.

Page 18: Data warehouse

Star Schema

Page 19: Data warehouse

Star Schema• In computing, the star schema (also called

star-join schema, data cube, or multi-dimensional schema) is the simplest style of data warehouse schema. The star schema consists of one or more fact tables referencing any number of dimension tables. The star schema is an important special case of the snowflake schema, and is more effective for handling simpler queries.

• Benefit of Star Schema is ease of access in terms of writing queries.

Page 20: Data warehouse

Snow Flake Schema

Page 21: Data warehouse

Snow Flake Schema• In computing, a snowflake schema is a logical

arrangement of tables in a multidimensional database such that the entity relationship diagram resembles a snowflake in shape. The snowflake schema is represented by centralized fact tables which are connected to multiple dimensions.

• However, in the snowflake schema, dimensions are normalized into multiple related tables, whereas the star schema's dimensions are normalized with each dimension represented by a single table.

• Star and snowflake schemas are most commonly found in dimensional data warehouses and data marts where speed of data retrieval is more important than the efficiency of data manipulations

Page 22: Data warehouse

OLAP and OLTP

•OLTP vs. OLAP

We can divide IT systems into transactional (OLTP) and analytical (OLAP). In general we can assume that OLTP systems provide source data to data warehouses, whereas OLAP systems help to analyze it.

Page 23: Data warehouse

OLAP and OLTP

Page 24: Data warehouse

• OLTP (On-line Transaction Processing) is characterized by a large number of short on-line transactions (INSERT, UPDATE, DELETE). The main emphasis for OLTP systems is put on very fast query processing, maintaining data integrity in multi-access environments and an effectiveness measured by number of transactions per second. In OLTP database there is detailed and current data, and schema used to store transactional databases is the entity model (usually Normalized).

• OLAP (On-line Analytical Processing) is characterized by relatively low volume of transactions. Queries are often very complex and involve aggregations. For OLAP systems a response time is an effectiveness measure. OLAP applications are widely used by Data Mining techniques. In OLAP database there is aggregated, historical data, stored in multi-dimensional schemas (usually star schema).

Page 25: Data warehouse

Data Warehouse Approaches

• 1. Bottom up • 2. Top down• 3. Hybrid• 4. Federated

Page 26: Data warehouse

• Bottom-up design

• Built from a series of incremental, architected data marts.

• Data marts are first created to provide reporting and analytical capabilities for specific business processes. Data marts contain atomic data and, if necessary, summarized data.

• These data marts can eventually be unioned together to create a data warehouse.

• An important benefit to the organization is, it allows the project team to develop the skills and techniques required for data warehousing in a much lower risk and exposure environment than a full scale EDW project.

Page 27: Data warehouse

• Bottom up

• Advantages • * Offers low risk and low exposure• * Yields incremental design• * Requires lower level, shorter term• * Can be delivered quickly• * Enables a focused approach• * Provides faster ROI

• Disadvantages• * requires to integrate incremental data marts• * Needs multiple team coordination

Page 28: Data warehouse

• Top-down design

• An EDW is composed of multiple subject areas -- finance, human resources, marketing, sales, and manufacturing. In a top down scenario, the entire EDW is architected, and then a small slice of a subject area is chosen for construction.

• The top down EDW is architected, designed and constructed in an iterative manner.

• Data warehouse is as a centralized repository for the entire enterprise. The data warehouse is designed using a normalized enterprise data model. "Atomic" data, that is, data at the lowest level of detail, are stored in the data warehouse.

• Dimensional data marts containing data needed for specific business processes or specific departments are created from the data warehouse.

Page 29: Data warehouse

•Top-down design

•Advantages• * Coordinated environment• * Single point of control & development.

•Disadvantages

• * Time required to complete project• * Too much time spent analyzing problems• * Difficult to control the project• * Enterprise-wide nature of the project• * Too much risk

Page 30: Data warehouse

• Hybrid Approach

• The hybrid approach tries to blend the best of both “top-down” and “bottom-up” approaches.

• It attempts to capitalize on the speed and user-orientation of the “bottom-up” approach without sacrificing the integration enforced by a data warehouse in a “top down” approach.

• The hybrid approach relies on ETL tool to store and manage the enterprise and local models in the data marts as well as synchronize the differences between them.

Page 31: Data warehouse

• Hybrid Approach• Advantage• It combines rapid development techniques within an enterprise

architecture framework.

• It develops an enterprise data model iteratively and only develops a heavyweight infrastructure once it’s really needed (e.g. when executives start asking for reports that cross data mart boundaries.)

Disadvantages• Backfilling a data warehouse can be a highly disruptive process.

• Few query tools can dynamically and intelligently query atomic data in one database (i.e. the data warehouse) and summary data in another database (i.e. the data marts.) Users may be confused when to query which database.

• Relies heavily on an ETL tool ,although ETL tools have matured considerably, they can never enforce adherence to architecture.

Page 32: Data warehouse

• Federated Approach

• Is not a methodology or architecture, but a concession to the natural forces that undermine the best laid plans for deploying a perfect system.

• A federated approach rationalizes the use of whatever means possible to integrate analytical resources to meet changing needs or business conditions.

• In short, it’s a salve for the soul of the stressed out data warehousing project manager who must sacrifice architectural purity to meet the immediate (and ever-changing) needs of his business users.

Page 33: Data warehouse

• Federated Approach• Advantage• No need to change their architecture as per the IT standard

• Disadvantage• It is not well documented.

• Without a specific architecture in mind, it may lead to the continued decentralization and fragmentation of analytical resources, making it harder to deliver an enterprise view in the end.

• Also, integrating meta data is a very big problem in a heterogeneous, ever-changing environment.

Page 34: Data warehouse

•Summary of DW Methodology

•Ultimately, organizations need to understand the strengths and limitations of each methodology and then pursue their own way through the data warehousing thicket.