government polytechnic lohaghat ( champawat)

Government Polytechnic Lohaghat ( Champawat)

(Branch- Information Technology VI Semester)

Subject : Data Warehouse & Mining

KIRAN CHANDRA, LECT ( IT)

UNIT 2 Introduction to Data warehouse

DATA WAREHOUSE

A data warehouse is a repository of information collected from multiple sources, stored under a

unified schema, and that usually resides at a single site. Data warehouses are constructed via a

process of data cleaning, data integration, data transformation, data loading, and periodic data

refreshing. The data are stored to provide information from a historical perspective (such as

from the past 5-10 years) and are typically summarized. For example, rather than storing the

details of each sales transaction, the data warehouse may store a summary of the transactions

per item type for each store or, summarized to a higher level, for each sales region.

Fig. framework of a data warehouse

Why do we need a data warehouse?

Data warehouses are used extensively in the largest and most complex businesses around the world. In demanding situations, good decision making becomes critical. Significant and relevant data is required to make decisions. This is possible only with the help of a well-designed data warehouse.

Enhancing the turnaround time for analysis and reporting: Data warehouse allows business users to access critical data from a single source enabling them to take quick decisions. They need not waste time retrieving data from multiple sources.

The business executives can query the data themselves with minimal or no support from IT which in turn saves money and time.

Improved Business Intelligence: Data warehouse helps in achieving the vision for the managers and business executives. Outcomes that affect the strategy and procedures of an organization will be based on reliable facts and supported with evidence and organizational data.

Benefit of historical data: Transactional data stores data on a day to day basis or for a very short period of duration without the inclusion of historical data. In comparison, a





data warehouse stores large amounts of historical data which enables the business to include time-period analysis, trend analysis, and trend forecasts.

Standardization of data: The data from heterogeneous sources are available in a single format in a data warehouse. This simplifies the readability and accessibility of data.

For example, gender is denoted as Male/ Female in Source 1 and m/f in Source 2 but in a data warehouse the gender is stored in a format which is common across all the businesses i.e. M/F.

Immense ROI (Return On Investment): Return On Investment refers to the additional revenues or reduces expenses a business will be able to realize from any project.

What are the components of a Data warehouse?

The components of a data warehouse are depicted in the figure below

Data Sources

• A flat file database stores data in a normal text format. Contrary to a relational database where the data is stored in the form of tables, in a flat file database the data stored does not have a folders or paths related to them. No manipulations are performed on the data. Delimiters are used in flat files to separate the data columns.

• Excel spreadsheets are regularly used in data warehousing operations. They are impressive, low-priced, and flexible tolls that many decision-makers find convenient to use. Excel also





provides graphing features that allow the end-user to present the required data in chart and graph formats. These formats can be easily integrated into MS Word and Power Point presentations.

• Operational systems of a business contain the day to day transactions of the data at a low-level. For example, the sales data, HR data, marketing data are used as input sources for a data warehouse.

• Legacy systems are the applications of the yesteryear. They mirror the requirements of a business that might be twenty to twenty five year old. They are use till date since over years these systems have captured the business knowledge and rules that are exceptionally difficult to translate to a new platform/application.

Staging Area

• The first part of the staging area is the process of extraction. Depending on how accurately the data is extracted the subsequent operations succeed or fail. The data may be extracted not only once but also periodically when changes occur at the source side.

• The second stage is the transformation where the data is converted from one format to another. Since data often exists in different locations and formats across the enterprises, data conversion is mandatory to ensure that data from one application is comprehensible to other applications and databases.

• The third stage is the loading where the extracted and transformed data is loaded into a data mart or a data warehouse depending on the business.

Data Repository

The data is loaded into a data warehouse in the form of facts and dimensions

Users

The loaded data is accessed for reporting, analysis, and mining. The reporting tools like Business Objects and Cogon’s are used by users to generate reports. The data is also used for predicting trends

Characteristics and Functions of Data warehouse

Data warehouse can be controlled when the user has a shared way of explaining the trends that are introduced as specific subject. Below are major characteristics of data warehouse:





1. Subject-oriented –Subject-oriented as the warehouse is organized around the major subjects of the enterprise (such as customers, products, and sales) rather than major application areas (such as customer invoicing, stock control, and product sales). Date warehouse is designed to support decision making rather than application oriented data.

2. Integrated –Integrated because of the coming together of source data from different enterprise-wide applications systems. The source data is often inconsistent using, for example, different formats. The integrated data source must be made consistent to present a unified view of the data to the users. 3. Time-Variant – Time-variant because data in the warehouse is only accurate and valid at some point in· time or over some time interval. 4. Non-Volatile –Non-volatile as the data is not updated in real time but is refreshed from on a regular basis from different data sources. New data is always added as a supplement to the database, rather than a replacement. The database continually absorbs this new data, incrementally integrating it with the previous data.

Functions of Data Warehouse Tools and Utilities

The following are the functions of data warehouse tools and utilities −

• Data Extraction − Involves gathering data from multiple heterogeneous sources.

• Data Cleaning − Involves finding and correcting the errors in data.

http://ecomputernotes.com/fundamental/what-is-a-database/advantages-and-disadvantages-of-dbms





• Data Transformation − Involves converting the data from legacy format to warehouse format.

• Data Loading − Involves sorting, summarizing, consolidating, checking integrity, and building indices and partitions.

• Refreshing − Involves updating from data sources to warehouse.

Process Flow in Data Warehouse

There are four major processes that contribute to a data warehouse −

• Extract and load the data.

• Cleaning and transforming the data.

• Backup and archive the data.

• Managing queries and directing them to the appropriate data sources.

Extract and Load Process

Data extraction takes data from the source systems. Data load takes the extracted data and loads it into the data warehouse.

Before loading the data into the data warehouse, the information extracted from the external sources must be reconstructed.





Controlling the Process

Controlling the process involves determining when to start data extraction and the consistency check on data. Controlling process ensures that the tools, the logic modules, and the programs are executed in correct sequence and at correct time.

When to Initiate Extract

Data needs to be in a consistent state when it is extracted, i.e., the data warehouse should represent a single, consistent version of the information to the user.

For example, in a customer profiling data warehouse in telecommunication sector, it is illogical to merge the list of customers at 8 pm on Wednesday from a customer database with the customer subscription events up to 8 pm on Tuesday. This would mean that we are finding the customers for whom there are no associated subscriptions.

Loading the Data

After extracting the data, it is loaded into a temporary data store where it is cleaned up and made consistent.

Consistency checks are executed only when all the data sources have been loaded into the temporary data store.

Clean and Transform Process

Once the data is extracted and loaded into the temporary data store, it is time to perform Cleaning and Transforming. Here is the list of steps involved in Cleaning and Transforming −

• Clean and transform the loaded data into a structure

• Partition the data

• Aggregation

Clean and Transform the Loaded Data into a Structure

Cleaning and transforming the loaded data helps speed up the queries. It can be done by making the data consistent −

• within itself.

• with other data within the same data source.

• with the data in other source systems.

• with the existing data present in the warehouse.

Transforming involves converting the source data into a structure. Structuring the data increases the query performance and decreases the operational cost. The data





contained in a data warehouse must be transformed to support performance requirements and control the ongoing operational costs.

Partition the Data

It will optimize the hardware performance and simplify the management of data warehouse. Here we partition each fact table into multiple separate partitions.

Aggregation

Aggregation is required to speed up common queries. Aggregation relies on the fact that most common queries will analyze a subset or an aggregation of the detailed data.

Backup and Archive the Data

In order to recover the data in the event of data loss, software failure, or hardware failure, it is necessary to keep regular back ups. Archiving involves removing the old data from the system in a format that allow it to be quickly restored whenever required.

For example, in a retail sales analysis data warehouse, it may be required to keep data for 3 years with the latest 6 months data being kept online. In such as scenario, there is often a requirement to be able to do month-on-month comparisons for this year and last year. In this case, we require some data to be restored from the archive.

Query Management Process

This process performs the following functions −

• manages the queries.

• helps speed up the execution time of queris.

• directs the queries to their most effective data sources.

• ensures that all the system sources are used in the most effective way.

• monitors actual query profiles.

The information generated in this process is used by the warehouse management process to determine which aggregations to generate. This process does not generally operate during the regular load of information into data warehouse.





Online Transaction Processing (OLTP) and Online

Analytic Processing (OLAP)

Online Transaction Processing (OLTP): OLTP databases are meant to be used to do many small transactions, and usually serve as a “single source of storage”. An example of OLTP system is online movie ticket booking website. Suppose two persons at the same time wants to book the same seat for the same movie for same movie timing then in this case whoever will complete the transaction first will get the ticket. The key thing to note here is that OLTP systems are designed for transactional priority instead data analysis.

Figure – Pictorial Representation of OLTP

Benefits of using OLTP services:- • The main benefit of using OLTP services is it responds to its user actions

immediately as it can process query very quickly. • OLTP services allows its user to perform operations like read, write and delete

data quickly. Drawbacks of OLTP service:-

https://www.geeksforgeeks.org/on-line-transaction-processing-oltp-system-in-dbms/





• The major problem with the OLAP services is it is not fail-safe. If there is hardware failures, then online transactions gets affected.

• OLTP allow users to access and change the data at the same time which cause unprecedented situation.

Online Analytic Processing (OLAP): OLAP databases on the other hand are more suited for analytics, data mining, less queries but they are usually bigger (they operate on more data). We can say that any Datawarehouse system is an OLAP system. Many company compare their sales of current month with previous month to keep trace of business. Here company compare the sales and keep the result in another location, which is a separate database. Here company uses OLAP databases.

Figure – Pictorial Representation of OLAP Benefits of using OLAP services:- • The main benefit of using OLAP services is it helps to keep trace of consistency

and calculation. • OLAP builds one single platform where we can store planning, analysis and

budgeting for business analytics. • With the OLAP as service, we can easily apply security restrictions to protect data

Drawbacks of OLAP service:-

https://www.geeksforgeeks.org/types-of-olap-systems-in-dbms/





• The major problem with the OLAP services is it always needs IT professionals to handle the data because OLAP tools require a complicated modeling procedure.

• As mentioned in the benefits of using OLAP services, we can use OLAP as a single platform where we can store planning, analysis and budgeting for business analytics but here we need help of different departments at one time i.e., OLAP tools need cooperation between people of various departments, which leads dependency problem.

The key differences between OLTP and OLAP databases:

OLTP OLAP/ DATA WAREHOUSE

OLTP is characterized by a large

number of short on-line transactions

(INSERT, UPDATE, DELETE).

OLAP is characterized by relatively

low volume of transactions.

OLTP queries are simple and easy

to understand.

OLAP Queries are often very

complex and involve aggregations.

OLTP is widely used for small

transaction.

OLAP applications are widely used

by Data Mining techniques.

OLTP is highly normalized. OLAP is typically de-normalized.

OLTP is used for Backup religiously. OLAP is used for regular backup.

OLTP usually uses schema used to

store transnational databases is the

OLAP uses star model to store the

data.

https://www.geeksforgeeks.org/difference-between-olap-and-oltp-in-dbms/





OLTP OLAP/ DATA WAREHOUSE

entity model (usually 3NF).

Performance of OLTP is comparably

fast as compared to OLAP.

Performance of OLAP is comparably

low as compared to OLTP.

Architectural Framework of a Data Warehouse





Operational Source Systems

• Operational systems are used to process everyday transactions of an organization • The operational systems are designed in such a way that the transactions occur smoothly and

the data-integrity is maintained efficiently • The operational systems have very fast insert/update since minimal data is affected each time a

transaction occurs • In order to improve performance the old data is purged systematically

Data Staging Area

ETL - Extraction, Transformation and Loading.

Extraction

• The extraction methods in a data warehouse depend on the performance of the source system and the demands of the business.

• Full extraction is applied when the data is required to be retrieved and loaded the first time. Hence, this extraction represents the current data available in the source system

• Incremental extraction is a process where the differences in the source data since the last extraction are captured. Only the changes will be loaded based on the last changed timestamp

• Online extraction is a process where the data is extracted from the source system directly • Offline extraction is a process of extraction where the source system is emptied into a flat file

outside of the source. This flat file is used to extract the data

Transformation

• The data is transformed based on the transformation rules provided by the business. The data is converted to a standard format and common semantics

• Data cleansing is the process of distinguishing and correcting the discrepant data from a database or table. Data cleansing also involves the synchronization of data. For example, the compliance of Male/Female to M/F

Loading

• Once the data is cleansed and transformed into a structure persistent with the data warehouse requisites, the data is then qualified to be loaded into a data warehouse

• Populating the data into the tables present in a data warehouse and verifying if the data is ready for use is the first step of loading

• After loading the facts and dimensions a DBA should check for referential integrity i.e. each record from the fact table should be related to a dimension record





Data Presentation Area

• The presentation area represents a collection of data marts. A data mart is a sub set of a data warehouse

• Data marts are preferred for smaller data volumes and fewer data sources. It enables easier data cleaning process

• Dependent data marts retrieve data from a central data warehouse whereas the independent data marts are standalone systems that extract data directly from the operational systems or external sources

Data Access Tools

• Business Intelligence tools are used for accessing the data for strategic, operational, and analytical purposes

• Senior executives and managers access the data warehouse for taking critical decisions. They devise strategies and observe the business performance

E.g. Balance Scorecards

• Operational managers execute the details of the strategies against the targets.

E.g. Sales Forecasts

• Analytical operations are performed by analysts to evaluate the outcomes of a business process and understand the functioning of the business

E.g. Financial and Sales Analysis





Data Warehouse Three-tier Architecture

1. The bottom tier is a warehouse database server that is almost always a relational database system. Back-end tools and utilities are used to teed data into the bottom tier from operational databases or other external sources (such as customer profile information provided by external consultants). These tools and utilities perform data extraction, cleaning, and transformation (e.g., to merge similar data from different sources into a unified format), as well as load and refresh functions to update the data warehouse .The data are extracted using application program interfaces known as gateways. A gateway is supported by the underlying DBMS and allows client programs to generate SQL code to be executed at a server. Examples of gateways include ODBC (Open Database Connection) and OLEDB (Open Linking and Embedding for Databases) by Microsoft and JDBC (Java Database Connection). This tier also contains a metadata repository, which stores information about the data warehouse and its contents.





2. The middle tier is an OLAP server that is typically implemented using either (1) a relational OLAP (ROLAP) model, that is, an extended relational DBMS that maps operations on multidimensional data to standard relational (2) a multidimensional OLAP (MOLAP) model, that is, a special –purpose server that directly implements multidimensional data and operations.

3. The top tier is a front-end client layer, which contains query and reporting tools, analysis tools, and/or data mining tools (e.g., trend analysis, prediction, and so on).

Benefits of Data Warehousing The successful implementation of a data warehouse can bring major, benefits to an organization including:

• Potential high returns on investment

Implementation of data warehousing by an organization requires a huge investment typically from Rs 10 lack to 50 lacks. However, a study by the International Data Corporation (IDC) in 1996 reported that average three-year returns on investment (RO I) in data warehousing reached 401%.

• Competitive advantage

The huge returns on investment for those companies that have successfully implemented a data warehouse is evidence of the enormous competitive advantage that accompanies this technology. The competitive advantage is gained by allowing decision-makers access to data that can reveal previously unavailable, unknown, and untapped information on, for example, customers, trends, and demands.

• Increased productivity of corporate decision-makers

Data warehousing improves the productivity of corporate decision-makers by creating an integrated database of consistent, subject-oriented, historical data. It integrates data from multiple incompatible systems into a form that provides one consistent view of the organization. By transforming data into meaningful information, a data warehouse allows business managers to perform more substantive, accurate, and consistent analysis.

• More cost-effective decision-making

Data warehousing helps to reduce the overall cost of the· product· by reducing the number of channels.

• Better enterprise intelligence.

It helps to provide better enterprise intelligence.

• Enhanced customer service.

• It is used to enhance customer" service.





The need of data warehouse is illustrated in figure.

Problems of Data Warehousing The problems associated with developing and managing a data warehousing are as follows:

Underestimation of resources of data loading

Some times we underestimate the time required to extract, clean, and load the data into the warehouse. It may take the significant proportion of the total development time, although some tools are there which are used to reduce the time and effort spent on this process.

Hidden problems with source systems

Some times hidden .problems associated with the source systems feeding the data warehouse may be identified after years of being undetected. For example, when entering the details of a new property, certain fields may allow nulls which may result in staff entering incomplete property data, even when available and applicable.

Required data not captured

In some cases the required data is not captured by the source systems which may be very important for the data warehouse purpose. For example the date of registration for the property may be not used in source system but it may be very important analysis purpose.

Increased end-user demands

After satisfying some of end-users queries, requests for support from staff may increase rather than decrease. This is caused by an increasing awareness of the users on the

http://ecomputernotes.com/images/needs-of-data-warehouse.jpg





capabilities and value of the data warehouse. Another reason for increasing demands is that once a data warehouse is online, it is often the case that the number of users and queries increase together with requests for answers to more and more complex queries.

Data homogenization

The concept of data warehouse deals with similarity of data formats between different data sources. Thus, results in to lose of some important value of the data.

High demand for resources

The data warehouse requires large amounts of data.

Data ownership

Data warehousing may change the attitude of end-users to the ownership of data. Sensitive data that owned by one department has to be loaded in data warehouse for decision making purpose. But some time it results in to reluctance of that department because it may hesitate to share it with others.

High maintenance

Data warehouses are high maintenance systems. Any reorganization· of the business processes and the source systems may affect the data warehouse and it results high maintenance cost.

Long-duration projects

The building of a warehouse can take up to three years, which is why some organizations are reluctant in investigating in to data warehouse. Some only the historical data of a particular department is captured in the data warehouse resulting data marts. Data marts support only the requirements of a particular department and limited the functionality to that department or area only.

Complexity of integration

The most important area for the management of a data warehouse is the integration capabilities. An organization must spend a significant amount of time determining how well the various different data warehousing tools can be integrated into the overall solution that is needed. This can be a very difficult task, as there are a number of tools for every operation of the data warehouse.

What is Data Mart?

A DATA MART is focused on a single functional area of an organization and contains a subset of data stored in a Data Warehouse. A Data Mart is a condensed version of Data Warehouse and is designed for use by a specific department, unit or set of users in an organization. E.g., Marketing, Sales, HR or finance. It is often controlled by a single department in an organization.





Data Mart usually draws data from only a few sources compared to a Data warehouse. Data marts are small in size and are more flexible compared to a Datawarehouse.

Why do we need Data Mart?

• Data Mart helps to enhance user's response time due to reduction in volume of data • It provides easy access to frequently requested data. • Data mart are simpler to implement when compared to corporate Datawarehouse. At the

same time, the cost of implementing Data Mart is certainly lower compared with implementing a full data warehouse.

• Compared to Data Warehouse, a datamart is agile. In case of change in model, datamart can be built quicker due to a smaller size.

• A Datamart is defined by a single Subject Matter Expert. On the contrary data warehouse is defined by interdisciplinary SME from a variety of domains. Hence, Data mart is more open to change compared to Datawarehouse.

• Data is partitioned and allows very granular access control privileges. • Data can be segmented and stored on different hardware/software platforms.

Type of Data Mart

There are three main types of data marts are:

1. Dependent: Dependent data marts are created by drawing data directly from operational, external or both sources.

2. Independent: Independent data mart is created without the use of a central data warehouse.

3. Hybrid: This type of data marts can take data from data warehouses or operational systems.





Dependent Data Mart

A dependent data mart allows sourcing organization's data from a single Data Warehouse. It offers the benefit of centralization. If you need to develop one or more physical data marts, then you need to configure them as dependent data marts.

Dependent data marts can be built in two different ways. Either where a user can access both the data mart and data warehouse, depending on need, or where access is limited only to the data mart. The second approach is not optimal as it produces sometimes referred to as a data junkyard. In the data junkyard, all data begins with a common source, but they are scrapped, and mostly junked.

Independent Data Mart

An independent data mart is created without the use of central Data warehouse. This kind of Data Mart is an ideal option for smaller groups within an organization.

An independent data mart has neither a relationship with the enterprise data warehouse nor with any other data mart. In Independent data mart, the data is input separately, and its analyses are also performed autonomously.

Implementation of independent data marts is antithetical to the motivation for building a data warehouse. First of all, you need a consistent, centralized store of enterprise data which can be analyzed by multiple users with different interests who want widely varying information.





Hybrid data Mart:

A hybrid data mart combines input from sources apart from Data warehouse. This could be helpful when you want ad-hoc integration, like after a new group or product is added to the organization.

It is best suited for multiple database environments and fast implementation turnaround for any organization. It also requires least data cleansing effort. Hybrid Data mart also supports large storage structures, and it is best suited for flexible for smaller data-centric applications.

Steps in Implementing a Datamart

The significant steps in implementing a data mart are to design the schema, construct the

physical storage, populate the data mart with data from source systems, access it to make

informed decisions and manage it over time. So, the steps are:

Designing

The design step is the first in the data mart process. This phase covers all of the functions

from initiating the request for a data mart through gathering data about the requirements

and developing the logical and physical design of the data mart.

It involves the following tasks:

1. Gathering the business and technical requirements

2. Identifying data sources

3. Selecting the appropriate subset of data





4. Designing the logical and physical architecture of the data mart.

Constructing

This step contains creating the physical database and logical structures associated with the

data mart to provide fast and efficient access to the data.


1. Creating the physical database and logical structures such as tablespaces associated

with the data mart.

2. creating the schema objects such as tables and indexes describe in the design step.

3. Determining how best to set up the tables and access structures.

Populating

This step includes all of the tasks related to the getting data from the source, cleaning it up,

modifying it to the right format and level of detail, and moving it into the data mart.


1. Mapping data sources to target data sources

2. Extracting data

3. Cleansing and transforming the information.

4. Loading data into the data mart

5. Creating and storing metadata

Accessing

This step involves putting the data to use: querying the data, analyzing it, creating reports,

charts and graphs and publishing them.


1. Set up and intermediate layer (Meta Layer) for the front-end tool to use. This layer

translates database operations and objects names into business conditions so that

the end-clients can interact with the data mart using words which relates to the

business functions.

2. Set up and manage database architectures like summarized tables which help

queries agree through the front-end tools execute rapidly and efficiently.





Managing

This step contains managing the data mart over its lifetime. In this step, management

functions are performed as:

1. Providing secure access to the data.

2. Managing the growth of the data.

3. Optimizing the system for better performance.

4. Ensuring the availability of data event with system failures.

Difference between Data Warehouse and Data Mart

Data Warehouse Data Mart

A Data Warehouse is a vast repository of

information collected from various

organizations or departments within a

corporation.

A data mart is an only subtype of a Data

Warehouses. It is architecture to meet the

requirement of a specific user group.

It may hold multiple subject areas. It holds only one subject area. For example,

Finance or Sales.

It holds very detailed information. It may hold more summarized data.

Works to integrate all data sources It concentrates on integrating data from a

given subject area or set of source systems.

In data warehousing, Fact constellation is

used.

In Data Mart, Star Schema and Snowflake

Schema are used.

It is a Centralized System. It is a Decentralized

System.

Data Warehousing is the data-oriented. Data Marts is a project-oriented.

government polytechnic lohaghat ( champawat)

Documents