building a data warehouse on the hp3000 (1)

Building A Data Warehouse on the HP3000:A Practical Example

Cailean Sherman - Taurus SoftwareCheryl Grandy - DISC

Jennie Hou - HP

Taurus Software1032 Elwell Court

Suite 245Palo Alto, CA 94303

650-961-1323 [email protected]

Dynamic Information Systems CorporationBoulder, CO 80301

[email protected]

Hewlett-Packard19447 Pruneridge Avenue

Cupertino, CA 95014408-447-5971

[email protected]

[email protected]

Building A Data Warehouse on the HP3000: A Practical Example

Table of Contents

WHY A DATA WAREHOUSE OR DATA MART?

WHAT IS A DATA WAREHOUSE OR DATA MART? Technical goals of the data warehouse Technical structure of the data warehouse The data The data model: Re-architected database - star schema The environment

BUILDING A DATA WAREHOUSE STEP BY STEP Analyze the business needs and determine value Determine the scope of your project Building a target model Map the data Extract operational data and load the target Optimize queries End user interface and deployment Determine the value of the mart

CONCLUSION

Overview

This paper explains the steps involved in building a data warehouse on the HP3000 and how that data can be used for in-depth analysis of a variety of business parameters. It will discuss the nature of data warehouses and data marts, the technology used to build them, and their value to an organization. Because data warehouses can adapt to the needs of any organization, this paper will detail the process one company went through to decide what they wanted in their data mart and their step-by-step implementation.

Why a data warehouse or data mart?The literature of data warehousing is studded with sound bites that suggest how companies use data warehouses and data marts for competitive advantage. No company in its right mind will give away real secrets, so the examples below are more illustrative than real.

A supermarket found a correlation in its data warehouse that linked sales of beer and diapers on Friday evenings between 5 and 7 PM. By sending observers into stores at that time, they found that dads with infants were doing the shopping. The retailer placed the highest-margin diapers alongside the beer. Sales and profit on diapers soared.

A web retailer of women's clothes sold an expensive mail list to a Weight Loss Clinic for all customers whose dress size increased from the previous order.

A Managed Care Organization found that 31% of its medical costs came from 8% of its members, and that the MCO was losing money on that group. By focusing on this segment, the MCO put in place programs to return the group to profit.

HOW TO FIND THE VALUE IN A DATA MART

The goal of data warehousing is to increase profitability. Any investment proposal that increases net Earnings Before Income Tax (EBIT) will attract off-budget and immediate funding if the return is large and the risk small. Driving up EBIT can be done by increasing revenues, cutting costs or improving margins, separately or in conjunction. Cutting Costs

Most companies have cut their costs to the bone by reducing headcount and automating processes. There is usually very little fat left that a data warehouse or mart can help, with the exception of Inventory Management, if that function is unsophisticated.

Increasing revenue and improving margins can be achieved in a number of ways:

1. Discount analysis: In a direct sales operation, you can carefully monitor which rep gives what discount percentage to which customers and when. Cutting the overall rate of discounting to win business by 1% increases sales in a $100 million company by $1 million. The same amount drops straight to EBIT. If your sales are larger, or your gain is more than 1%, the ROI for a Billings or Bookings data warehouse or mart can be staggering.

2. Customer profitability: You can find out who your best customers are, not only in sales volume, but also in EBIT, if you can factor in informal estimates of the costs of servicing the customer. Maybe you are consciously buying market share by selling to some customers at a loss or at minimal profit. If not, a Sales Analysis warehouse or mart will help identify your most profitable customers. Your sales force can then find more customers like them, and you can drop or re-price the less profitable customers.

Customer lifetime value can reveal that you should spend most of your effort on your base rather than looking for new accounts, with profound implications for your sales organization's productivity.

3. Product profitability: The same analysis can be applied to products.

Most companies sell a wide range of products to one customer with a wide range of individual gross margins. If you could satisfy the customer with a product mix with adequate function, with an average 1% increase in margin, and do it without reducing revenues, the 1% and $1 million or more would drop straight to EBIT.

If your customer set and your product set both can be improved in this fashion, then you can adopt both cross-selling and up-selling. The product becomes a weapon for targeting your best customers with new product and higher margin variants of the products they already buy.

ROI is of little relevance unless you can find a senior line executive who is directly affected by the ROI, and who is willing to go to bat for the data warehouse or data mart. "Directly affected" means that the ROI favorably affects their personal Key Performance Indicators (KPIs). VPs of Sales are keenly interested in anything that promises more sales. VPs of Manufacturing like a firm schedule that makes high margin products. CFOs drool over anything that offers a sniff at improving EBIT.

What is a data warehouse or data mart?

Let's go to the source for this definition. In the words of Ralph Kimbal, one of the gurus of data warehousing, "A data warehouse is a powerful database model that significantly enhances managers' ability to quickly analyze large, multidimensional data sets." A data warehouse contains cleansed and organized data that allows data warehouse managers to make business decisions based on facts, not on intuition.

The "data mart" is collected data in a single subject area, usually extracted from the data warehouse or pulled directly from an operational system.

Experience has been that companies build many marts, and all of these may need similar source data. A company may start with a Billings Mart containing information on sales derived from invoices. Later they may decide to have a Finance Mart containing information on profit derived from the same information.

Why should both marts pull this data directly from the source system? A data warehouse can feed both sets of marts and queries, thereby reducing redundant work.

Technical Goals of the data warehouse or data mart

To provide the information users need to increase profitability you must meet both of the following goals of the data warehouse or data mart.: To provide easy access to corporate data:

1. The end user access tools must be easy to use. The end user access tools should live on the PC in an environment familiar to the end user. They must have a short learning curve and be highly intuitive. If you have high level managers asking questions of the data warehouse, they must have a simple way to interact with it.

2. The access should be graphic, so it is easy to spot trends. People have been using Microsoft Excel® for years. Spotting trends in a columnar report listing 500 customers' sales for the quarter is a very tedious process. However, if you take the same list and chart it graphically, it is easy for the human eye to spot the winners and losers.

3. The access should be manageable by end users. MIS is no longer the end user of the tools that access the data. End users are analysts, mid-level managers, even in some cases, high level managers. They must be able to easily get answers to their questions, and ask new questions, all without getting MIS involved.

4. The process must be fast. Questions leapfrog, so you have to get answers fast. The very nature of data analysis is that not all requests are known beforehand, so there's lot of unpredictable, ad hoc inquiry. In an analytical environment, the end user is directly interacting with the data, rather than looking at a printed report. The answers have to be delivered fast, before users lose their train of thought. If you lose the interest of a user, they won't come back!

To provide data for analysis that is clean and reliable

1. For consistent analysis, the environment must be stable. Now that users can create their own reports, they must have consistent data. One department doing an analysis must get the same results as any other. You can't have two managers showing up at a meeting with different sales figures for last month!

2. Source conflicts must be resolved. Transactional (production) systems often have data stored in several different applications. Customer addresses, for example, may be in the invoicing system, as well as the sales order system. Chase Manhattan Bank may be stored as "Chase" in one system and "Chase Manhattan Bank" in another. A straight load of data for this customer into the data warehouse would result in two customers being represented - and with invoices being listed with no supporting sales orders and visa versa. These conflicts in the source data need to be resolved and merged as they move into the analytical environment.

3. Historical analysis must be possible. There are two aspects to historical data that must be considered. The amount of historical data in the warehouse has to be sufficient to satisfy the need to spot trends over time. To see the growth in sales of a product over 3 years, perhaps to help predict sales of future products, you need three years of data. The second historical consideration is the ability to see the state of affairs at a historical point in time. Take, for example, a situation where you want to see what your inventory amounts have been over the last year to help determine if you can carry less inventory and still satisfy demand. You may need to be able to see what your inventory amount was for product X on the 3rd, 4th, and 5th of July. As data is added to the data warehouse, it needs to be added and date stamped - not updated like it would be in the transactional inventory system.

4. Performance of operational systems must not be impaired. It's a given that the Data Warehouse will run on a separate and dedicated systems to avoid conflicts with the performance needs of the operational systems. Extracting production data to populate the data warehouse will have some impact on the operational system. Consideration must be given to whether full loads of data are practical, or whether incremental loads are needed, using only changes since the last load.

Structure of the data warehouse

To create a data warehouse, there are several technical requirements involved. Creating data to be analyzed requires that the data be subject-oriented, integrated, time referenced and non-volatile. Making sure that the data can be accessed quickly and can meet the ad hoc queries that users need requires that the data be organized in a new database design, the star or multidimensional data model. And there are a whole new set of systems and tools required to build and maintain this new data environment.

The Data

Subject-oriented

Data warehouses group data by subject rather than by activity. Transactional systems are organized around activities - claim processing, ordering product, shipping product. Data organized by transaction cannot answer questions asked by subject, such as "how many shoes did we sell in Europe this year?" This request would require heavy searching of sales orders and customer address records. If the transactional system does not have a key or index on the country in the address record, the query would most likely be very slow.

Data used for analysis needs to be organized around subjects, customers, sales, and products. Because of the design that will be used for the data warehouse or data mart, if you ask for the sales in Europe, you are able to search through very few records to get the answer to your question.

Integrated

Integrated refers to de-duplicating data and merging it from many sources into one consistent location. If you are going to find your top 20 customers, you must know that HP and Hewlett Packard are the same company. There must be one customer number for any form of Hewlett-Packard in your data warehouse or mart. You can't have one from the sales system that is alphanumeric, and one from the shipping system that is a concatenation of their company name and phone number .

Much of the transformation and loading work that goes into the data warehouse or mart occurs here, integrating data and standardizing it. There are often problems found in the source data that need to be corrected as well, such as absence of data.

Time Referenced

The data warehouse user typically asks questions about prior states of being of the data. For example: "What was my backlog this month last year vs. this year?" or "How many refrigerator units were in my service department this time two years ago?" To answer these questions, you need to know what the order status was at a point in time last year, or what a service inventory count was at a historic point in time.

Data warehouses handle this technically by time stamping data as it is added to the warehouse and by continuing to append data to the warehouse without replacing or updating existing information. Many snapshots of data are moved into the warehouse, one on top of another.

In this way, you can accurately see the inventory count at a certain time, or the backorders outstanding on a specific day. This allows us to answer questions that compare past performance to the present, providing the ability to spot trends in the data and to see if these trends are consistent across different product lines. That makes it possible to better forecast future performance with facts rather than intuition.

Nonvolatile - read only

The data warehouse contents are typically not modified or updated. There are several reasons for this:

First, there really is no reason to update a data warehouse. As mentioned, it is a snapshot in time. You would not want to go back and change history. If a data element is bad in the warehouse, you don't as a rule want to make changes to your company's data in the warehouse environment. You typically want to change it in the source (transactional) system or in the load process. In that way, when the next batch of data gets loaded, the change continues to be implemented, and the history of the change is recorded.

Second, the data warehouse model is designed for queries and bulk loads; updates can be very expensive. Data warehouse models often replicate data in many places (de-normalize the data). An example of this would be a Health Insurance Claim Data Warehouse. Each insured member belongs to an insurance plan and each plan belongs to a group. You may set up a table listing each Plan and the Group it belongs to. A Group is now repeated several times in the database, once for each Plan that belongs to it. This design is used to make it easy to query via plan or group. If, however, you need to change a group name, you have to update many locations, and that can be very costly.

Third, data warehouse environments are large by nature. They contain historical data, summarized data and very detailed data. Typically these environments take up gigabytes of space. Updating gigabytes of data can be very costly.

Fourth, the data warehouse environment is heavily indexed. To allow for the slicing and dicing of data that an analyst requires, many indexes are placed on the data. To update these environments is very CPU intensive.

In short, data should simply be loaded, then accessed, never updated.

The Data Model: Re-architected database - Star schema.

Transactional systems are designed to push one transaction at a time through the system at high speed. The database organization is designed for processing efficiency, with minimum redundancy. This "normalized" database organization is complex and very inefficient for data analysis purposes.

The data warehouse adopts a data model called a star or snowflake schema that maximizes the performance of queries. The star consists of one large central fact table (data set) in the center of the star. It is the only table (data set) in the schema with multiple joins (paths), each connecting it to other satellite tables (data sets) called "dimension" tables.

The central fact table of a star schema is where the "numerical measurements of the business are stored. Each of these measurements is taken at the intersection of all the dimensions." (Ralph Kimball)

The fact table is linked to several dimension tables that contain data describing more detail about the fact they are linked to. These dimensions are used to query quickly into the fact table via keys - the fields in these dimensions are your query data points.

The data in this organization is deliberately designed for redundancy so that users will find it easy to create queries and they can run very fast, with minimal linking between tables.

If a database is going to be used for analysis, it should be designed for analysis - hence the star schema. See the following graphic for an example of a star schema. The fact table is INVOICE_LINES. The rest of the tables are dimensions.

You can see from the diagram above that it is easy to find sales using various combinations of criteria such as a particular product for a calendar month. You can select all of the invoices for that calendar month then easily find the parts you are inquiring about. What could have been a large serial pass to read through millions of records in a transactional system turns into a finely tuned pass through a fraction of the records using the star schema data model.

Relational databases may be used to house these star schemas, and until recently, these databases were the only environments accessible by the popular end-user tools that did the analysis needed for decision support. With DISC's introduction of specialized OMNIDEX indexes for Data Warehousing, Image/SQL is now an advantageous option as a data warehouse repository. In fact the capabilities of an IMAGE database enhanced with OMNIDEX go far beyond standard relational database technology.

The Environment

To satisfy the goals of the data warehouse and to accommodate the new data structures, several types of tools are used today to populate the data warehouse, house the data warehouse, index the data for query optimization, and allow fast user access.

The diagram above shows the different technologies used. Reading from right to left there are Extraction processes to move data out of the transactional systems and into the warehouse. There is the database management system that houses the warehouse. There are indexes added to the database to optimize the queries. And there are the users access tools.

Extraction - ETL

There are two ways to populate a data warehouse. You can use custom programs written in a language like Cobol. Or you can use tools built specifically to move data.

Custom programs have some advantages: You may have a staff of programmers who are familiar with a language. Or you may already have many processes written that you can leverage to extract data from your transactional systems.

The disadvantage of custom programs is the maintenance cost. Data warehouses are constantly changing, and if you are using a third generation language to populate the warehouse, it will need to be re-compiled each time you make a change. Each change made to the mapping and transformation similarly requires more time to implement in a third GL than a tool written specifically to transform data.

ETL tools used to populate a data warehouse must be able to perform the following functions:

1. Accept data from many sources 2. Move data to many targets 3. Handle any needed data transformations 4. Allow inserts, deletions or updates of any data 5. Move data in batch 6. Perform fast loads

7. Handle any network in the mix 8. Keep track of mapping - metadata 9. You may also want to be able to detect changes and incrementally update the warehouse.

Sources and targets must not be limited to databases, they must also include external input that may be coming via files. The sources and targets must all be accessible for inserts, updates and deletes if an ETL tool is going to handle all of the data manipulation you will require.

Moving data can be done all at once or incrementally. Take, for example, a data warehouse that is refreshed each night with new information. You can drop data from yesterday and replace it with all of your history (drop and load). The advantage to this method is that the logic is simple. The disadvantage is that if you have much data it can be too time consuming. It can take more time than the batch window available to populate the target.

Change oriented extraction grabs only the data elements that have been added, updated or deleted since the last refresh of the data warehouse. The advantage to this method is that the volume of data is significantly less than a full refresh. The disadvantage is that you have to find a way to make sure you track all changes. Using a date stamp on a record only traps the inserts and updates to that table or dataset. Deletes are lost. You must either use "triggers" in a relational database or find a way to detect change in a non-relational database to ensure that you retrieve all changes to the data, including any deletes.

Data transformations required in populating a data warehouse need to be as robust and encompassing as third generation languages. Data needs to be cleansed, integrated, follow rules, be standardized, manipulated logically and physically and flagged when incomplete or in error. Problems in the data are typically flagged and corrected early on in the process. Not only is the data transformation process a means to reorganize the data, it is a QA process.

The ETL must be able to take advantages of specific DBMS features to facilitate fast loads of large amounts of data. It should be able to move data across the network without requiring an additional software layer that can slow performance.

Database management system and indexing

There are two types of databases typically used to house data warehouses, Relational databases and Multidimensional databases. Relational databases are familiar to most of us, Oracle, DB2, and SQL Server. When using a Star schema model, these databases can provide accessibility to the data warehouse required by the users. Over the last few years, these environments have been enhanced by their vendors to improve multi-table queries.

Multidimensional databases were developed to accommodate more ad-hoc queries. They handle unpredictable access and analysis of the data they contain, allowing the data to be examined in many ways with optimal performance. The disadvantage of multidimensional databases is that they cannot handle more than 5 or 6 dimensions before the build and refresh times become unmanageable. Multidimensional databases do not support updates, take a long time to load, do not handle as much data as standard relational databases (typically only 50 Gb), and data summaries at a level that is not pre-defined are not allowed.

Many users choose to use both relational and multidimensional databases in their data warehouse architectures. Often the foundation or data warehouse will use a relational database, while the data mart or single subject area will be populated in a multidimensional (Cube) model.

There are also tools available which can enhance databases (including IMAGE/SQL) with speed and capabilities that go beyond relational and multidimensional databases, while eliminating their limitations. As an example, OMNIDEX for Data Warehousing provides unlimited multidimensional analysis across any number of dimensions.

OMNIDEX for Data Warehousing also eliminates the need for summary tables in relational databases, replacing them with high-speed dynamic data summaries. Plus it provides fast access to the underlying detail data.

End user environment

There are many kinds of questions you can ask of your corporate data warehouse. These seem to fall into three categories of access methods and tools:

Reporting tools provide pre-defined formats and content of data either to the desk top or printed reports. Some of these Report Writers come from a production reporting background, others are newer, web based offerings. Typically, queries are pre-written and executed by users with parameters or templates. Products that fall into this category are Cognos, Actuate, SQR and Microsoft Access.

Online Analytical Processing OLAP tools allow you to perform multidimensional analysis of the data warehouse. Typically, OLAP lets you answer not only "what happened" questions that reporting tools provide, but also "why?". To find out why your sales for May were down 50% from last year, you need to learn more about what comprises that total sales amount you see on your screen. If you can drill down into the products sold -- where they were sold, what customers bought them, what sales reps sold them -- you may be able to find out why your sales are down and hopefully change this. These features, drill down, drill up, pivoting are found in OLAP products.

Arbor, Business Objects, Brio, Microstrategy, Information Advantage, Microsoft and Cognos all offer OLAP tools.

Data mining tools use artificial intelligence to predict what may happen in the future. It uses complex statistical analysis to look at the data in your data warehouse, spot trends and make an educated guess at what will happen next. Examples of Data Mining Tool companies are Datamind, NHC Software, IBM

Building a data mart/data warehouse step by step.

Briefly stated, the steps involved in building a data warehouse or data mart are these:

1. Analyze the business need and determine value. 2. Pick the scope of initial deployment. 3. Build the target model. 4. Map data from production to the mart. 5. Extract operational and external data and load the mart. 6. Optimize queries into the mart. 7. Deploy end user interface. 8. Evaluate value of the mart to the end user.

Our Case Study

This section will use the case study of a company that distributes a range of imported electronic components throughout North America. The Sales channel has both direct reps, and distributors.

We initially started working with this company because they thought that they wanted to create a consolidated reporting environment which merged information from several systems into one central database. They initially were interested in allowing users access to their sales data for ad hoc queries.

Much of their in-house expertise is on the HP3000 , so they elected to put a data mart on an HP3000. DecisionVault, a data warehousing technology bundle specifically designed for HP3000 customers was selected to build and populate the data mart.

DecisionVault is comprised of BridgeWare from Taurus and Quest, OMNIDEX from DISC, BrioQuery from Brio Technology, and the HP3000 with IMAGE/SQL. It allowed the customer to create a high-performance, very cost competitive data warehouse environment on an HP3000.

Our Case Study - Step 1 Analyze the business needs and determine value.

We first met with our client's CIO and MIS staff. They initially were interested in a consolidated reporting environment to satisfy many of the reporting requests coming from users. They estimated that they had two people working full time writing reports and were hoping to free up these resources.

The goal of the needs analysis for a data warehouse or mart is to get executive sponsorship, and a clear definition of measurable business issues which can be addressed.

It is critically important for MIS to work with their users and find the pains they are facing. This accomplishes three things:

1. MIS knows what the users want. 2. MIS has user buy-in on the project. 3. MIS will be able to get user funding without having to use the MIS budget.

Finding Pain: A fast way to failure in a Data Warehousing project is ask the users for a list of the data items they want to see on reports. Business issues or Pains that VPs face are what is needed. If we can't find a Business issue, let's spend the time and money on something else. While IT or users may point to areas of Pain, the only way to specify and value Pain it to uncover it personally with the top executive team.

Find an executive Sponsor. Without Pain you won't find an Executive Sponsor. Even with Pain, you need a firm commitment from one of the three or four top executives in the company (CEO, CFO, VP Sales, VP Operations, VP Manufacturing).

The main reason for failure in Data Warehousing is lack of one or both of these key requirements!

Conducting interviews with users to uncover pains. Only after you have formal support from an executive should you go to the users. Concentrate on the direct reports of your Executive Sponsor and their staffs. Ask these interviewees whom they interact with in other functions on the Pain, and go to interview those people with a view to uncovering new Pain that you can raise with another VP.

The Pain and Interview process is a skilled task. Do not attempt to do this without training or at least without having read deeply into Kimball's material on the subject. Where you can, use an experienced business consultant to help you.

In our example company, the VP of Sales was directly paid under a management by objective (MBO) scheme. His targets included not only the traditional Sales measures, but also Gross Margin on Product.

He couldn't measure his progress accurately from any data on hand, so he couldn't be sure he was claiming all of his due compensation.

Personal pain like this can be a strong motivator despite the VPs reluctance to articulate the dollar impact. Underlying his personal issue was the knowledge that he was largely in the dark about who was selling what, particularly sales to large customers who could buy, either from his salespeople or from his distributors, with widely varying margin contributions

In addition the CFO had suspected that gross product margins varied widely. From a prototype built on billing data, he saw spreads on functionally similar products of up to 20%. He agreed immediately that the average margin could be improved by 5% with appropriate knowledge and training. The 5% translated to $5 million in EBIT that would give an ROI of twenty times the cost of the initial Data Mart

Our Case Study - Step 2 Pick the scope of initial deployment (data mart vs. warehouse).

No matter how much ROI you uncover, be sure to start with one data mart, with one functional area; say Billings, or Bookings. Start with the simplest, get the mart operational and prove your ROI before you even start the next mart. Small bites of a big meal prevent indigestion. Start with the end in mind, so lay down your first data mart on a foundation that will later support other data marts. Don't extract directly from production to the Mart, but by way of a staging area or central data warehouse that other data marts can later draw on. Begin by listing the potential marts. The process of uncovering a ROI will also uncover the functional areas that merit a data mart, as well as the priority. In the example company, a number of areas promised some ROI: Billings, Booking, Backlog, Inventory and Finances. All of these are candidate marts that will make up a company wide data warehouse in time. For each mart, you need to discover what elements are most important for analysis. This is done by interviewing the users. You will probably find that the CFO is interested in sales and gross margins by product and by all the descriptors that can apply to a product, including product family -- line, group, subcategory, color, size, supplier and so on. Most products have upwards of 100 descriptors that can reveal areas for margin improvement.

In dimensional modeling, you call the areas of analysis, such as customer and product, the DIMENSIONs. The descriptors are called attributes, and they become fields in the dimension tables (data sets). Even though you select one data mart to start, you define the dimensions across all the marts at the outset because you want CONFORMED dimensions that all the marts can share. For example, you will need the same product dimension for a billings mart as you will later want for a bookings mart and for a backlog mart.

Listing the marts will also uncover the facts or numeric measures that underpin the ROI you are seeking. Almost invariably, facts are numeric and not textual. Text indicators almost certainly belong in a dimension as an attribute.

In the case of the CFO, gross margin down to the part number will be required. You need to discover how to compute that margin. Is it list price minus cost, or invoiced price less cost? Is the cost acquisition cost, or that cost plus carrying costs, or are you adding customer service costs by product, or should you include freight and insurance?

For our project, numeric indicators were chosen for Billings, the items for that we now call FACTS and represent them in a FACT table. Cost of Goods was also defined as acquisition cost.

Billings was the chosen area because it offered the most ROI in a relatively simple area of the business. You can also be sure that the data is reasonably clean since customers generally complain if their invoices are inaccurate.

Our Case Study - Step 3 Build the target model.

Building a target model can be broken into four steps which we will cover in more detail.

Selecting the grain for the fact table Selecting the dimensions Selecting the individual facts for the fact table Completing the design and implementing it

Select the grain for the Fact Table

The "grain" or "granularity" is the lowest level of detail at which facts are held.

Our VP of Sales was very interested in sales by customer by fiscal quarter since he gets paid that way. But, if your grain were higher, such as by customer by quarter, you would never be able to see analyses by sales person or by product or by week or by invoice number. You generally want to hold facts at the deepest level of detail that is possible within the limitations of the source systems providing the data. From that base level, you can easily provide aggregates, perhaps by week, month, quarter and year.

In our project's Billings Mart, the details found on the individual lines of each invoice supplemented from the invoice header, were chosen to provide the highest (most) granularity and therefore the greatest flexibility in analysis.

Select the Dimensions

When you determine the scope of the project in the section above, you list all of the attributes by which you think you will want to perform analysis, such as by customer, geographic regions, time, product descriptions. These are the dimensions or points on your star schema.

Now you select the subset of dimensions that is meaningful in describing and amplifying the meaning of your selected Fact Table.

For our Invoice Lines Fact Table, our client required dimension tables representing Customers, Products, Sales People, Addresses and Time.

You saw above some examples of the attributes for the Product dimension. Time in the form of a Periods table allows every individual day that the Billings Mart may cover to be described, so you can analyze billings by those attributes, down to the day. Each day clearly falls into a given calendar year, quarter, month, and week. If fiscal year does not match calendar year, add fiscal year, quarter, month, and week. You can also have attributes for a day that describes season, say Back to School and Easter. You could have an attribute for an annual shut down, to explain why nothing happened in a given week.

Select the individual facts for the Fact Table

The Fact Table will contain pointers to the Dimensions that were selected. In this case, an Invoice Header tells you what customer was billed, at what billing address, and who took the order. These pointers (foreign keys in database talk) point to the rows (records) in the Customer, Address and Sales People Dimension Tables. These foreign keys were put in the detailed row (record) in the Fact Table. A detailed fact row (record) for each invoice number and invoice line number was the chosen grain for this data mart.

You now need to add the numeric indicators, the individual facts for each detailed fact row. You know that quantity of product billed and the individual monetary amounts on an invoice line will always sum up correctly to the invoice level or the month level, so those facts are fully additive, and you can insert them in your row design. Our billings mart allowed for the quantity, the list price, and the billed price.

Note that some numeric indicators, such as levels or balances, are not fully additive across all dimensions. For example, adding a percentage discount as a fact in our invoice line would be meaningless when we summarize. . Averages and percentages are not additive at all and shouldn't be facts. You can re-compute ratios whenever you need them, so we didn't include any of these measures.

You can take numeric data as you find it in the invoicing system, such as quantity and billed amount, and you can add computed facts if you want. In this case, you might find it useful to compute the difference between list price and billed price as the Discount Given amount, and the difference between the billed price and Cost of Goods Sold as the Gross Profit amount. Storing these amounts in the Fact record saves the analysis programs effort in computing them on the fly.

Complete the design and implement it.

Our design now looked like this. Erwin can be used to help visualize the model

From here, we implemented surrogate keys instead of the actual customer and product numbers to link the tables. As you can imagine, there are many reasons not to use the "real" numbers that identify customers and products or any of the foreign keys that point to the dimension tables. You may re-use those key values in the transactional system while the number still points to old data in the data mart.

You may add a division to the data mart that has a customer under a different "real" number. Or, say you have acquired a company for their customers, and you want to count the sales by both companies under one customer's name.

You may want to have two rows of dimension data that describe different attributes for different periods. Say your rep sells a new customer. In the data mart, you assign that sale to that rep, and his team, group and division. Two years later, you re-assign the customer to another rep and division. If you just change

the one existing row, that customer will disappear from the original rep's sales numbers and appear in the new rep's numbers, even though they might not have worked for the company at that time.

For disk space and speed of join processing, it's more efficient, to use small integers for your computer keys rather than the long text strings that the "real" numbers may have.

To neatly handle these situations, use surrogate keys that are artificial and never seen by the user. This adds some complexity each time you want to add a new dimension row, say to handle the sales rep problem above. It is well worth the effort. You will want to implement surrogate keys for all the dimension tables, including the Time dimension.

CUST_ID, and CUST_ADDR_ID listed above are examples of surrogate keys.

The last step is to physically build the empty files for your Fact and Dimension Tables, so that you can load and test them with data. In contrast to designing a data warehouse using a relational database, we did not need to design and build summary tables to aggregate the data at this point, since our client used OMNIDEX as part of DecisionVault. OMNIDEX will provide the data summaries at high speed using its own summary indexes.

Our Case Study - Step 4 Map the data from production to Mart

Now that we had built the target mart schema it was time to populate it.

The first step of every data movement project is to develop a mapping document. The mapping document details where the data is coming from; transformations, editing rules, lookups needed; and what needs to happen if an error occurs. This document becomes the specification for the extraction routines.

The mapping document is done after the data model for the data mart or warehouse is complete. For the purposes of illustration, let's look at a simplified version of our data model. In this version, we have only a single dimension table and the fact table.

Loading of the data mart begins first with the dimension, so we started there.

Create a Customer Dimension Mapping Document

Working with the data analyst most familiar with the source data, we found a source for each of the target fields. To develop the transformation rules, it is sometimes helpful to print the data for this dimension, e.g. a list of all customers. Normally seeing sample data will help identify any problem areas, e.g. customers without addresses, etc.

Together with the data analyst, we developed a minimum criterion for the acceptance of a dimension record. Our client wanted to accept all customers. Don't worry; moving all the data over to the data mart is a self-correcting problem. As executive management begins reviewing the data and finds junk, they will either impose "their" rules or demand that the source data be corrected. Either way, the data will get "cleaned" up.

Our project's mapping document was developed in Excel. You could use any tool that works for you. Excel is simple to use as it is easy to "paste" sample data onto a worksheet, and it is easy to email the documents between the parties involved in developing the data warehouse. Below is our mapping document for the customer dimension.

You can see in this document, it took four source tables to provide enough data to populate the dimension table. For those fields that required data from other tables, we made sure to include information on how to make those linkages from our primary source table (cust) to the other source tables (parent, cust_seg, and cust_commcn).

If there is decoding involved, either create a lookup table or provide another sheet that details the translation, e.g. this source code gets this description. The mapping document should be complete enough that an outside programmer could write code from the document.

You should write the mapping documents for all of the dimensions first and then move to the Fact Table(s). The Fact Table mapping will assume that all of the dimensions exist in the data model before the fact is added. In other words, you shouldn't have to check if the customer exists, because there shouldn't be any invoices added who do not have a customer. If your source system doesn't check this referential integrity, either as part of the application or as part of the database structure, you will need to include the rules needed to insure that the dimensional data is in place prior to writing a fact row. The mapping document is a "working" document that will change as you discover different aspects of the data, both before and during the load document.

Our Case Study - Step 5 Extract operational and external data and load the mart.

Once the mapping document for a particular dimension is complete, it is time to begin the development of the extraction routines. All data marts begin with an initial load that loads the entire applicable contents of the transactional source application into the data mart. This is a bulk load process that starts with the dimensions and then loads the fact tables.

As with most initial loads for data marts, we started with the dimensions first. The customer data is key to the model and is the cornerstone of the all of the information within the mart. When talking to the user about their customer data, they indicated that the data was clean for a number of reasons: Product had been shipped to the address in the database. The application has an automatic de-duping feature to ensure that duplicate catalogs are not sent to the same address. And address fields are automatically updated by the application.

Deciding on a refresh frequency:

Once the initial load is complete, deciding on a methodology to keep the data mart up-to-date is essential. The choices are reload the data every month, week or night or develop an incremental load scheme. The advantage of the reload process is that the same procedures that did the initial load can be used without any changes. The disadvantage is that the time window available to reload the data may not be long enough. In addition to this problem, the individual incremental changes, e.g. inventory changing from 10 to 20 with a net change of 10, are lost. You will just be left with the end-of-day inventory levels. Depending on what kinds of questions your data is trying to answer, this may not be acceptable.

Our client started with a full load process which was executed weekly. As the users started really using the system they asked for a daily refresh. They wanted to query with yesterday's results in the billings mart, they were using it for reporting as well as analyisis. In their case they did not have enough time overnight available to add another process which moved millions of records across the network and into the mart. They had to have an incremental update process.

This was one reason they decided to use DecisionVault, the technology bundle allows for both types of updating strategies. Taurus' BridgeWare product captures the transactions that have modified in the selected databases or files. The transactions can then be applied to the data mart. This update can be done either in real-time or at some interval that you set up.

Initial data load:

Let's look at the first dimension, CUSTOMERS. The source system for this application was an Oracle database. This meant that the data was already edited for type, meaning you could depend on numeric data having numeric values, dates having valid date values and character data was already upshifted and stripped of trailing spaces. The data cleansing issues were then reduced to finding duplicate customers. During your review of the customer data, you might find duplicated customers. The customer elected to load all duplicate customers and force data correction on the source side.

BridgeWare data mapping and movement could have been done either using the Forklift GUI development tool or developing a script in the editor. When we developed scripts, we tried to break them into small pieces (modules) and then put all the modules together at the end. The scripting broke into three major components: accessing files, selecting data, and the data movement actions. Let's examine each of these components separately.

The scripting language has all the power of a third generation programming language. The scripting language is broken into three major components: accessing files, selecting data, and the action to be taken on the data. To access a file, you either create it or open it. Open is used for those files that already exist. In this situation, you are accessing an Oracle database (your source) and an Image database (your target). The syntax of the open statement is:

open TAG [remote MACHINE LOGININFO] databasetype DATABASEPARMS.

In our case the client wasaccessing their source locally and writing to the target remotely: open PROD ORACLE PROD/pass & home=/oracle/product/723 sid=prodopen DW REMOTE hp3k user=mgr.dw image dw & pass=secret mode=1

Once the files are open for access, we tell BridgeWare what data we want from those files. In this situation, we wanted all of the customer information. There was no selection or filtering done. Selecting or reading is accomplished with the read statement. The syntax of the read statement is:

read TAG = DBTAG.DATASET [for SELECTION].

In this case, we read the CUST tables from the source database called PROD in the open statement. The read statement was:

read SRC_CUST = PROD.CUST

As you remember from the mapping document, we needed to read four source records to populate the single data mart table, CUSTOMERS. BridgeWare allows the creation of local record variables to hold the various pieces until you have a completed record. To create a local record variable, use the define statement. The define statement can be told to use the record layout of a file that is already open using the using clause. To save us from having to define each item, we used this clause:

define CUST using DW.CUSTOMERS

As the various data become available, we set the various fields to their appropriate values using the setvar statement. For example:

setvar CUST.PARENT_ID = null_idsetvar CUST.PARENT_NBR = null_nbrsetvar CUST.PARENT_NAME = SRC_CUST.CUST_NAME

When we had the various pieces, we could copy the record from the local record variable to the target dataset. This is done using the copy statement. The syntax for the copy statement is: copy tag to dbtag.dataset. So for our example it was:

copy cust to dw.customers

An incomplete script to read data from our source and copy it to our target looks like this:

open PROD ORACLE PROD/pass & home=/oracle/product/723 sid=prodopen DW REMOTE hp3k user=mgr.dw image dw & pass=secret mode=1

define CUST using DW.CUSTOMERS

read SRC_CUST = PROD.CUSTsetvar CUST.PARENT_ID = null_idsetvar CUST.PARENT_NBR = null_nbrsetvar CUST.PARENT_NAME = SRC_CUST.CUST_NAMEcopy cust to dw.customersendread

Let's expand to the script to match our project's real world.

open PROD ORACLE PROD/pass & home=/oracle/product/723 sid=prodopen DW REMOTE hp3k user=mgr.dw image & dw pass=secret mode=1 define CUST_KEY using DW.CUSTOMER_KEYdefine CUST using DW.CUSTOMERSdefine PARENT_KEY: using DW.PARENT_KEYdefine PARENTS using DW.PARENTS

define add_entry : booleandefine error_flag : booleandefine no_id_flag : boolean

define null_nbr : oracle varchar2(12) value 'NONE'define null_id : oracle number(11)

***Add any new Customer Numbers to the * Customer key table *(Generate the Surrogate Customer Id)**

read SRC_CUST = PROD.CUSTsetvar add_entry = $true

read C_KEY = DW.CUSTOMER_KEY &for C_KEY.CUST_NBR = SRC_CUST.USER_CUST_NO &and C_KEY.SOURCE_SYSTEM = 'SRCAPPL'setvar add_entry = $falseendread

if add_entry = $true thensetvar CUST_KEY.CUST_NBR = SRC_CUST.USER_CUST_NOsetvar CUST_KEY.SOURCE_SYSTEM = 'SRCAPPL'setvar CUST_KEY.LAST_UPDATED = $nowcopy CUST_KEY to DW.CUSTOMER_KEY endif

setvar CUST.CUST_ID = 0read C_KEY = DW.CUSTOMER_KEY & for C_KEY.CUST_NBR = SRC_CUST.USER_CUST_NOsetvar CUST.CUST_ID = C_KEY.CUST_IDendread

setvar no_id_flag = $trueif CUST.CUST_ID > 0 thensetvar no_id_flag = $falsesetvar CUST.CUST_NBR = SRC_CUST.USER_CUST_NOread SRC_CUST_SEG = PROD.CUST_SEG &for SRC_CUST_SEG.CUST_NO = SRC_CUST.CUST_NO & and SRC_CUST_SEG.SEG_NO = 317

setvar CUST.ACTIVE = SRC_CUST_SEG.ACTIVE_FLAGendread

read SRC_CUST_COMMCN = PROD.CUST_COMMCN & for SRC_CUST_COMMCN.CUST_NO = SRC_CUST.CUST_NOsetvar CUST.AREA_NBR = SRC_CUST_COMMCN.AREA_CDsetvar CUST.PHONE_NBR = SRC_CUST_COMMCN.PHONE_NO endread

setvar CUST.PARENT_ID = null_idsetvar CUST.PARENT_NBR = null_nbrsetvar CUST.PARENT_NAME = SRC_CUST.CUST_NAME

read SRC_PARENT = PROD.CUST & for SRC_PARENT.CUST_NO = SRC_CUST.WITHIN_CUST_NOsetvar add_entry = $true read P_KEY = DW.PARENT_KEY & for P_KEY.CUST_NBR = SRC_PARENT.USER_CUST_NO & and P_KEY.SOURCE_SYSTEM = 'SRCAPPL'setvar add_entry = $falseendread

if add_entry = $true then setvar PARENT_KEY.CUST_NBR = & SRC_PARENT.USER_CUST_NOsetvar PARENT_KEY.SOURCE_SYSTEM = 'SRCAPPL'setvar PARENT_KEY.LAST_UPDATED = $nowcopy PARENT_KEY to DW.PARENT_KEY endif

setvar CUST.PARENT_ID = 0read P_KEY = DW.PARENT_KEY & for P_KEY.CUST_NBR = SRC_PARENT.USER_CUST_NO & and P_KEY.SOURCE_SYSTEM = 'SRCAPPL' setvar CUST.PARENT_ID = P_KEY.PARENT_IDendread

setvar add_entry = $trueread PRNTS = DW.PARENTS & for PRNTS.PARENT_ID = CUST.PARENT_IDsetvar add_entry = $falseendreadif add_entry = $true thensetvar PARENTS.PARENT_ID = CUST.PARENT_IDsetvar PARENTS.PARENT_NBR = & SRC_PARENT.USER_CUST_NOsetvar PARENTS.PARENT_NAME = & SRC_PARENT.CUST_NAMEsetvar PARENTS.LAST_UPDATED = $now copy PARENTS to DW.PARENTSendif

setvar CUST.PARENT_NBR = SRC_PARENT.USER_CUST_NOsetvar CUST.PARENT_NAME = SRC_PARENT.CUST_NAMEendread

setvar CUST.CUST_NAME = SRC_CUST.CUST_NAMEsetvar CUST.CUST_TYPE = SRC_CUST.CUST_TYPE_CDsetvar CUST.DUNS_NBR = SRC_CUST.DUNS_NOsetvar CUST.DUNS_RATING = SRC_CUST.DUNS_RTNGsetvar CUST.TAX_ID = SRC_CUST.FED_TAX_IDsetvar CUST.SIC_CODE = SRC_CUST.SIC_CDsetvar CUST.LAST_UPDATED = $nowcopy CUST to DW.CUSTOMERSendif

if no_id_flag = $true thenprint 'Problem during load for customer", & src_cust.cust_noendifendread

This script from our example first opens the source and target database. It creates local record variables for those tables for which you are going to add records. There are four tables: cust_key (this is the customer's surrogate key, customers (contains the data as described in the mapping document in the previous section), parent (the customer's parent, e.g. HP is the parent for all the various division locations of HP), and the parent-key (this is the parent's surrogate key). Notice surrogate keys have been used. Surrogate keys are used when data can come from more than one system using the same "customer number" but mean two different customers. The surrogate creates a unique identifier for the customer within the data mart.

Next, we defined local variables that are used as switches and intermediate storage.

The script then reads our source. It checks to see if this customer already exists in the data mart by checking the surrogate key for customer, customer_key. If the customer doesn't exist, create the surrogate key and write it out. Next go and get the status information from the cust_seg table in your source database and put it in your record variable for customer. Do the same thing for the customer's phone number that you get from the cust_commcn table.

The next logic checks for a parent company. If there is a parent, it uses the parent's key. If there is no parent, it creates the parent record and puts an entry in the parent surrogate key dataset.

Finally, assign the remaining fields and copy your record to the customers dataset in the data mart.

Incremental load:

Once the data mart was loaded and began being used, it was clear that the customer wanted a daily refresh of data. They did not want to perform a full refresh each night because they did not have a batch window opening large enough for a full refresh.

DecisionVault's BridgeWare combines DataBridge with a real-time change detect component for the HP3000. On relational databases it is easy to capture changes as they are posted to the database using triggers. On the HP3000, Image does not have such a feature. Although Image has logging, the log file doesn't capture enough information to apply changes to the datamart.

BridgeWare's technology is best explained using a diagram.

BridgeWare, a joint development with Quest Software combines their detect change technology with Taurus' ETL. Their component, called SharePlex, detects changes in data files on the HP3000 and writes them to a message file. SharePlex works at the operating system level, detecting changes before they reach the data file. DataBridge is then used to define how the captured changes should be transformed and moved.

The message file holds the records of the changes, and contains a header record with the name of the file changes, the type of transaction, a sequence number, and the date and time the transaction occurred. Each type of record (insert, update, delete) always writes a header plus the record images which are appropriate. An insert has the record image for the record inserted. A delete has the record image of the record before it was deleted. An update has the before and after record image.

As our client moved to an incremental load process they had to consider how to handle each of their transaction types, adds, changes and deletes. The rest of the load process remained the same.

Our Case Study - Step 6 Optimize queries into the mart.

Once data was loaded into the IMAGE/SQL data warehouse using BridgeWare, it was almost ready to be queried, analyzed, and retrieved by the end-users using BrioQuery. At this point, performance became an issue because users can be quite demanding about query retrieval times (think of user demands for response time on the web).

Query retrieval times are very important because the #1 complaint of data warehousing managers, according to Alan Paller of TDWI (The DataWarehousing Institute) is:

"I should have paid more attention to performance."

We evaluated our options to improve query performance. According to data warehousing literature, standard options to satisfy user query requests are to:

Read serially through large "fact" tables Build summary tables (typically with relational databases) Load the data into a MDD (multidimensional database) and query against it

Serial reads through large tables or data sets are slow. They are not viable for any significant amount of data. Even with extremely large expenditures in CPUs or parallel processors, they are typically not fast enough for online reporting with multiple concurrent users competing for resources.

At first glance, using a relational database and building summary tables is a possibility. They store pre-calculated and sorted values that can answer specific queries. But we were worried that the number of tables needed to do the queries which came out of our user interviews would be prohibitive because a different table needs to be built for almost every query.

As mentioned above, multidimensional databases are limited in the number of dimensions they can handle, and they cannot be updated. Typically only 5 or 6 dimensions can be used before the build and refresh times become unmanageable and another data mart must be built. This we not going to meet our user's requirements, since they wanted to expand the data mart into a data warehouse over time, and the refresh times would outgrow their nightly batch window.

Add advanced indexes

This is where DISC's OMNIDEX indexing software came in. OMNIDEX is included in DecisionVault, and our client had used it before as a high speed keyword search engine for the HP3000. In the past few years, OMNIDEX had been specifically enhanced for fast and flexible access into data warehouses and data marts. It provides additional access that is not possible with IMAGE/SQL (or Oracle or SQL Server databases) alone.

OMNIDEX accomplishes high speed access to ad hoc queries by supplementing the existing database with Multidimensional and Aggregation Indexes that are specifically designed for decision support queries. OMNIDEX provides:

High-speed data retrieval for query optimization ODBC/Web middleware residing between the database and popular end-user tools such as Brio Multidimensional analysis, with an unlimited number of dimensions Dynamic data aggregations, without the need for summary tables Instant drill-down to detail Optimized for star or snowflake schema Fast index build rate and compressed disk space

To show how OMNIDEX was used to deliver the query performance we needed on our data mart let's look at the database's star schema design and the retrieval requirements.

The current model has a basic sales fact table and five (5) dimensions (refer again to the star schema diagram). The dimensions that can be sliced and diced in any combination are:

1. Customers 2. Addresses 3. Products 4. Periods 5. Sales Peoples

For our client, the primary purpose of the Billing/Sales Analysis data mart was to analyze customer characteristics and buying patterns, such as:

1. Customer demographics 2. Product performance 3. Sales rep performance

4. Regional preferences

Let's see how OMNIDEX handled multidimensional queries, data aggregations, and drill-down to detail for this client.

Multidimensional Queries

As with most data warehouse projects, we determined that most of the data warehousing queries which needed to be made were complex and multidimensional in nature, using multiple criteria against multiple fields or across multiple dimensions. For example, our end-users rarely wanted to access data by only one field or dimension, such as finding the number of customers in the state of California. They wanted to ask complex questions such as how many customers in the state of CA have purchased product B or D in the last year.

Our client's sales manager wanted to see the sum of sales for his sales reps for their new products during a certain time period such as last month. The inventory managers might want to see the quantity sold of a product for the last month and compare it to last year's. A marketing manager might want to see the sum of sales for some particular states and customer types.

We knew that the performance of these multidimensional queries in a data warehouse environment using a star or snowflake database design would be greatly enhanced through the use of OMNIDEX multidimensional indexes. OMNIDEX Multidimensional Indexes are specially designed for unlimited multidimensional access, allowing complex queries using any number of criteria or dimensions which we needed to satisfy our requirements.

We placed OMNIDEX Multidimensional indexes on each column in the dimension tables so the end-users could query by any combination. When an end-user issues a query, a qualifying count based on index access only is returned without touching the actual data. This speeds the query because it is much more efficient to service a query request by simply looking in an index or indexes instead of going to the primary source of data (the database itself). This is true for both pre-defined and ad hoc queries.

Because the end-users quickly obtain the results each time they perform an ad hoc search, they can decide whether to add criteria to further qualify the selection, aggregate it, drill down into the detail data, or start over. With the multidimensional indexes applied, our client's end-users were able to query repeatedly in seconds or minutes instead of hours or days, even with large amounts of data. This allowed them to interact with and explore the data in unpredictable ways, very quickly. Only when the end-users want to view the detail data was I/O performed against the database.

Data Aggregation

In addition to multidimensional queries, our client's end-users wanted to see data summaries (e.g., COUNT, SUM, AVERAGE, MINIMUM, MAXIMUM) such as total sales dollars, number of customers, average quantity, etc., and group them by a field such as sum of sales by region, or average quantity of product line B sold by sales rep.

OMNIDEX performs data summaries by using its Aggregation indexes only, summarizing quantities or amounts dynamically at the time of the query. Up to 1 million values per second can be queried.

We placed an OMNIDEX aggregation index on the fields to be summarized and were able to query at these very high speeds while still satisfying our client's requirement for heavy ad hoc inquiries.

Drill-down to detail

The multidimensional and aggregation indexes which we set up with the client contain pointers back to the actual data in the fact table, so the underlying detail data can be instantly accessed and displayed to the user at any time. This means data can be kept at any level of granularity, quickly aggregated, and then the detail displayed - a capability that is lost when using multidimensional databases or summary tables with relational databases .

OMNIDEX Installation

Once the star schema database is designed, the general query needs are determined, and data is loaded, OMNIDEX is ready to be installed. To enhance this client's star schema with OMNIDEX indexes, an "environment catalog" was set up that defines the database layout (like a data dictionary).

Once the environment catalog was created and compiled, an OMNIDEX consultant worked with the client's staff to further analyze the query needs. They first determined what snowflake tables would optimize the query performance for "Group By" fields from large data sets to be used in data aggregations.

Depending on the cardinality of the data (number of possible values per field), and the large size of the dimension tables, the snowflakes that were created to benefit this Sales Analysis data mart were as follows: (OMNIDEX has a program called SNOWGEN that generates the snowflake tables automatically.)

After the snowflakes were created and added to the environment catalog, the OMNIDEX installation job for the data mart was defined based on the overall query needs. First of all, all of the fields in the dimensions tables were defined as multidimensional keys (MDK).

CUSTOMERS CUST_NBRCUST_NAMEPARENT_IDPARENT_NBRPARENT_NAMECUST_TYPEACTIVEDUNS_NUMBERDUNS_RATINGTAX_ID

SIC_CODEAREA_NBRPHONE_NBR ADDRESSES CUST_NBRCUST_ADDR_NBRADDRESS_LINE_1ADDRESS_LINE_2ADDRESS_LINE_3CITY_NAMESTATE_PROVINCE_CODEPOSTAL_CODECOUNTRY_NAMESALESREPSSALESREP_IDLAST_NAMEFIRST_NAMEMIDDLE_INITIALFULL_NAMEOFFICE_CODEDISTRICT_CODEREGION_CODEPRODUCTS PRODUCT_IDSKUPART_DESCSTATUSPROD_LINEPRODUCT_TYPEPROD_GROUPVENDORDIST_PARTQUALITYINTRINSICSPACKAGINGORDER_QTYMIN_ORD_QTYMULTIPEL_QUTNEC_BUDGET_1NEC_BUDGET_2NEC_BUDGET_CODENEC_PART_NBRNEC_PLANNERNEC_PKG_TYPENEC_PROD_LINENEC_STATUSNEC_STD_BUS_UNITFORECAST_CLASSDS_ABC_CODE PERIODS ACTIVITY_DATECAL_YEARCAL_QUARTERCAL_MONTH_NAMECAL_MONTH_NBR CAL_MONTH_ABBREVCAL_WEEK_NBRDAY_NBR

DAY_NAMEFISCAL_YEARFISCAL_QUARTERFISCAL_WEEKFISCAL_PERIODPUBLIC_HOLIDAYNON_WORKING_DAYSPECIAL_DAY1SPECIAL_DAY2

Then the metric fields to be aggregated were specified from the INVOICE LINES fact table: Sales qtyList priceCOGSBilled priceDiscount givenGross profit

Seamless Integration

Once the indexes were created and loaded, it was time to install the OMNIDEX Client Server software on each user's PC, create some data source definitions, and invoke a "listener" process on the HP3000 server. High-performance queries then became available to the users through ODBC-compliant analysis and reporting tools.

When a pre-defined or ad hoc query or report issues a request for information, ODBC passes the query through the OMNIDEX search engine and SQL Optimizer, which accesses the indexes and passes the qualifying count or data aggregation back to the PC query or reporting tool. The OMNIDEX Multidimensional and Aggregation indexes allowed us a broad range of queries to be optimized with one set of indexes that took much less build time and disk space than other "standard" data warehousing options..

If the user requests the actual detail data, the pointers for the selected data records are used to quickly retrieve only those records of interest from the sales fact data set. The OMNIDEX SQL optimizer insures that all queries against a star schema or snowflake are completed in the fastest manner, and never cause slow, I/O-intensive serial reads of the data.

Our Case Study - Step 7 The end user interface and deployment

Once we had clean, certified data in our data mart with indexes installed on it, the Billings data mart was ready to give to the VPs and their staff. We needed to choose a business intelligence tool that users like. We knew that any ODBC-compliant tool can access an IMAGE database through the OMNIDEX interface. Our client chose BrioQuery because of its fast learning curve and ease of use.

Brio Technology's range of products allows users to access the data mart from their desktops or from their web browsers. BrioQuery generates SQL requests automatically that are sent to OMNIDEX so that OMNIDEX can quickly retrieve the data from the Data Mart. The data comes back to the user initially in row and column format, ready to be graphed and analyzed.

This sample shows the data requested from the Data Mart on the "Request" line, and below that the 6,999 rows of data that were returned.

Our client chose to develop standard analyses using charts, pivot spread-sheets and serial reports that cover most of the detail that the VPs want to focus on to meet their ROI goals. Fortunately, Brio's products are easy enough to use that an Excel user can create new analyses very quickly.

Here is an example bar chart graph: Analysis shows how the sales people have been discounting each quarter. We know that the VP of Sales and the CFO believe they can reduce the discounting rate from an average of 14% to at least 12%, if they can see where the problems are. That 2% is $2 million at the end of the year in Sales and Margin - with no effort. We can "drill into" this chart to find out that reps are the worst offenders and whether any customer or product is particularly responsible. Managers can then act to correct the worst offenders.

Here's an example of "Green Bar Paper" reporting brought up to date. Note that the output contains interactive graphic analyses that change as the control breaks on the listing change.

Here's an example of a Pivot table for product margin analysis.

The display is color coded to reflect the user's thresholds for "good" and "Bad" performance (Blue and Red). The Purple horizontal and vertical axes show the attributes of the Dimensions that this analysis uses. The gray tab at the end of each purple axis allows any horizontal dimension to be pivoted to the vertical and vice versa. The OLAP structure that underlies this analysis is then re-computed .

Our Case Study - Step 8 Value of the mart to the end user

It is important as an MIS organization to investigate the perceived and realized benefits provided by the data warehouse or mart. Once the users begin using the warehouse or mart they will ALWAYS ask for changes, and it is important to document the value of the new environment to help prioritize and justify new enhancements and new marts.

Determining the value of the mart can be done through interviews with end users or even through written surveys. This list of sample questions was pulled from a Data Warehousing Institute survey being conducted currently by the Survey Research Center at the University of Georgia.

Strategic Benefits:How did the mart increase our competitive position?How did it improve relationships with our customers?How did it improve relationships with suppliers?How did it increase our ability to conduct one to one marketing?How did it improve our business process?How did it improve cost efficiency?How did it decrease the cost of supplying information?How did it decrease the effort in supplying information?How did it decrease the time to build new applications?

Information Benefits:How did the mart improve our consistency?How did the mart improve integration?How did the mart improve comprehensiveness?How did the mart improve availability?How did the mart improve timeliness?How did the mart improve accuracy?

How did these improvements affect our business? And what value did this have?

Conclusion

You have seen how Pain is vital to the success of a Data Mart Project, and how we built our client's Data Mart .

You have seen that, for the first time, there is a viable, comprehensive solution for creating data warehouses and data marts on an HP3000 from data that had been previously locked up in transactional databases on the HP3000 or HP9000. Better yet, the data for this solution can be easily assembled by lower level MIS personnel and used by - good grief - marketing and executive types to increase the productivity and profitability of your organization.

We recommend that anyone entering the Data Warehousing arena start by reading the works of Ralph Kimball at minimum and attend his workshops. Ralph has codified the principles of data warehousing in a simple and down to earth fashion that addresses the business needs rather than the technology. No serious data warehousing project should start before the designers have fully absorbed his principles, particularly in data modeling. Ralph's two main works are:

The Data Warehousing ToolKit. ISBN 0-471-25547-5 The Data Warehousing LifeCycleToolKit.

building a data warehouse on the hp3000 (1)

Documents

data model

data marts

data extract operational

data warehouse step

nature of data warehouses

literature of data warehousing

medical costs

target model