a transaction-oriented approach to attribute partitioning

14
Information Systems Vol. 17, No. 4, pp. 329-342, 1992 0306-4379/92 $5.00 + 0.00 Printed in Great Britain. All rights reserved Copyright 0 1992 Pergamon Press Ltd A TRANSACTION-ORIENTED APPROACH TO ATTRIBUTE PARTITIONING PAI-CHENG CHU Academic Faculty of Accounting and MIS, College of Business, The Ohio State University, 1775 College Road, Columbus, OH 43210, U.S.A. (Received 19 February 1991; in revised form 16 April 1992) Abstract-This study introduces the concepts of su@ient and support to design two methods for solving the attribute partitioning problem: MAX and FORWARD SELECTION. These concepts express the simple idea that in order to minimize costs, a segment should contain all the attributes required by some transaction(s) and exclude other attributes. Contrary to all previous studies which without exception treat the attribute as the decision variable and may become computationally infeasible when the number of attributes is large (a condition favoring attribute partitioning), these two methods treat the transaction as the decision variable and their run time is not affected by the number of attributes. Both methods generate excellent results without using complicated mathematics. They also take into account the complexity caused by the interaction between attribute partitioning and access path selection, an issue most previous studies fail to address. Besides, both methods provide detailed information normally not available from other methods. This information provides new insight into the process of attribute partitioning. MAX is for cases involving a modest number of transactions (q 15), FORWARD SELECTION for cases involving a large number of transactions. Key words: Attribute partitioning, physical database design, relational database, access paths, query optimization 1. INTRODUCTION Attribute partitioning, also known as vertical partitioning or record partitioning, is the process of assigning the attributes of a logical relation to more than one physical segment for storage. The objective of attribute partitioning is to improve system performance by transferring smaller segments in lieu of larger unpartitioned relations between the primary and the secondary storage. When one attribute is accessed by only one transaction, attribute partitioning is a trivial process: simply partitioning the attributes by transactions. However, this situation seldom occurs. It is more likely that the attributes accessed by different transactions overlap resulting in a conflict of interest among transactions. The best partitioning for a particular transaction is not necessarily the best for another transaction. Thus, attribute partitioning often amounts to a trade-off process. The goal is to maximize performance for all transactions taken as a group. In general, the condition favoring attribute partitioning is one where the number of attributes in a file is large and the number of transactions accessing the file is small. The larger the number of attributes, the greater the need for attribute partitioning. The smaller the number of transactions, the smaller the chance for the conflict of interest among transactions to occur, and the greater the expected payoff. Conversely, when the number of attributes is small and the number of transactions accessing the file is large, there is no great need for attribute partitioning and the expected payoff will be lower due to intensifying conflict of interest among transactions. Attribute partitioning is a complex problem [l]. Besides being computationally intricate 12, 31, the problem is further complicated by the fact that attribute partitioning and selecting access path are not independent processes [l]. This is due to the fact that a partition scheme determines the size of a segment, which, in turn, affects the selection of access paths. In general, file scan tends to be more cost effective than index scan when a segment is smaller (with fewer data pages) and vice versa. This relationship is illustrated by a simple example. Suppose that the total number of logical records in a file, n, is 10,000 and that two partition schemes create two alternative segments differing in size (1000 pp. vs 200 pp.). Assume that each of these segments contains all the attributes referenced by a query that accesses 8% of the total records. Using a computational formula such 329

Upload: chu-pai-cheng

Post on 21-Jun-2016

212 views

Category:

Documents


0 download

TRANSCRIPT

Information Systems Vol. 17, No. 4, pp. 329-342, 1992 0306-4379/92 $5.00 + 0.00

Printed in Great Britain. All rights reserved Copyright 0 1992 Pergamon Press Ltd

A TRANSACTION-ORIENTED APPROACH TO ATTRIBUTE PARTITIONING

PAI-CHENG CHU

Academic Faculty of Accounting and MIS, College of Business, The Ohio State University, 1775 College Road, Columbus, OH 43210, U.S.A.

(Received 19 February 1991; in revised form 16 April 1992)

Abstract-This study introduces the concepts of su@ient and support to design two methods for solving the attribute partitioning problem: MAX and FORWARD SELECTION. These concepts express the simple idea that in order to minimize costs, a segment should contain all the attributes required by some transaction(s) and exclude other attributes. Contrary to all previous studies which without exception treat the attribute as the decision variable and may become computationally infeasible when the number of attributes is large (a condition favoring attribute partitioning), these two methods treat the transaction as the decision variable and their run time is not affected by the number of attributes. Both methods generate excellent results without using complicated mathematics. They also take into account the complexity caused by the interaction between attribute partitioning and access path selection, an issue most previous studies fail to address. Besides, both methods provide detailed information normally not available from other methods. This information provides new insight into the process of attribute partitioning. MAX is for cases involving a modest number of transactions (q 15), FORWARD SELECTION for cases involving a large number of transactions.

Key words: Attribute partitioning, physical database design, relational database, access paths, query optimization

1. INTRODUCTION

Attribute partitioning, also known as vertical partitioning or record partitioning, is the process of assigning the attributes of a logical relation to more than one physical segment for storage. The objective of attribute partitioning is to improve system performance by transferring smaller segments in lieu of larger unpartitioned relations between the primary and the secondary storage.

When one attribute is accessed by only one transaction, attribute partitioning is a trivial process: simply partitioning the attributes by transactions. However, this situation seldom occurs. It is more likely that the attributes accessed by different transactions overlap resulting in a conflict of interest among transactions. The best partitioning for a particular transaction is not necessarily the best for another transaction. Thus, attribute partitioning often amounts to a trade-off process. The goal is to maximize performance for all transactions taken as a group.

In general, the condition favoring attribute partitioning is one where the number of attributes in a file is large and the number of transactions accessing the file is small. The larger the number of attributes, the greater the need for attribute partitioning. The smaller the number of transactions, the smaller the chance for the conflict of interest among transactions to occur, and the greater the expected payoff. Conversely, when the number of attributes is small and the number of transactions accessing the file is large, there is no great need for attribute partitioning and the expected payoff will be lower due to intensifying conflict of interest among transactions.

Attribute partitioning is a complex problem [l]. Besides being computationally intricate 12, 31, the problem is further complicated by the fact that attribute partitioning and selecting access path are not independent processes [l]. This is due to the fact that a partition scheme determines the size of a segment, which, in turn, affects the selection of access paths. In general, file scan tends to be more cost effective than index scan when a segment is smaller (with fewer data pages) and vice versa. This relationship is illustrated by a simple example. Suppose that the total number of logical records in a file, n, is 10,000 and that two partition schemes create two alternative segments differing in size (1000 pp. vs 200 pp.). Assume that each of these segments contains all the attributes referenced by a query that accesses 8% of the total records. Using a computational formula such

329

330 PAI-CHENG CHU

as one proposed by Cardenas [4], we compute the expected numbers of data page accesses for the two cases as 550 (55%) and 197 (99%) respectively. To lower cost, the data should be accessed via index scan rather than file scan in the first case. In the second case, file scan should be used. This indicates that the access path for a transaction should not be chosen independently of a partition scheme, but should vary with it.

The need to vary the access path with the partition scheme means that the cost function for a transaction cannot be pre-specified, because the cost functions for different access paths vary (see Section 3 for additional discussion on this point). This complexity makes it impossible to apply optimization techniques to solving the problem, since optimization techniques require cost functions to be defined a priori. To model the problem realistically requires the development of heuristic procedures [3, 5, 61.

This paper introduces the concepts of suficient and support. Intuitively, attribute partitioning can be viewed as the process of creating new storage space to which some of the attributes are to be moved from the previously unpartitioned area. The concept of suficient means that the new storage area should contain all the attributes needed by some transaction(s). The objective is to prevent accessing two or more segments in executing these transactions. The concept of support means that a storage space should minimize extraneous attributes not needed by the transactions sufficient in it so that these transactions can be executed efficiently. These concepts shift the focus from attributes to transactions. After all, files exist to support transactions.

Based on these concepts, two attribute partitioning procedures are developed: the MAX procedure and the FORWARD SELECTION procedure. These procedures are inspired by methods used in regression analysis. Both procedures are easy to understand and implement. They achieve excellent results without making use of complicated mathematics. In addition, they have the following advantages:

1. Both procedures produce excellent results. A validation test, which is described in Section 4, indicates that the MAX procedure captures 99.9% and the FORWARD SELECTION procedure captures 93% of the maximum possible improvement from partitioning.

2. Both procedures address the problem caused by the lack of independence between access path selection and attribute partitioning. Neither procedure requires pre-specifying an access path for a transaction. Instead, both procedures dynamically identify the optimal access path for a transaction under a particular partition scheme.

3. Both procedures can be efficiently executed even when the number of attributes in a file is large (a condition that favors attribute partitioning). When the number of attributes is large, all previous procedures may become computationally infeasible, because they all treat the attribute as the decision variable. The two proposed procedures treat the transaction as the decision variable.

4. Both procedures provide detailed information normally not available from other studies. This information provides logical explanations to the process of attribute partitioning. Attribute partitioning ceases to be a “black-box” process incapable of generating intuitive explanations as it was previously held to be [7, p. 6351.

The remainder of this paper is organized as follows. Section 2 reviews prior work in attribute partitioning. In Section 3, the two procedures are presented. Section 4 evaluates their performance. Section 5 concludes the paper.

2. REVIEW OF PRIOR STUDIES

In this section, a brief review of prior studies on attribute partitioning is provided. The reader is referred to March [l] for an excellent discussion of the attribute partitioning problem in general and a detailed review of studies before 1983. Navathe et al. [6] and De et al. [8] include a review of more recent studies.

We classify previous studies into three categories according to their treatment of the interaction between access path selection and attribute partitioning. The studies in the first category do not address the complexity caused by the interaction between attribute partitioning and access path selection. In this category are [3, 6, 7, 9, 10, 161. Each of these studies is briefly reviewed below.

Attribute partitioning 331

In [9], Hoffer developed the first nonlinear zero-one integer programming formulation of the attribute partitioning problem. The formulation was comprehensive in terms of the costs accounted for. However, in a later study Hoffer and Severance [3] pointed out that such a formulation is computationally intractable.

In [7], Eisner and Severance studied a special case where a logical relation was divided into two segments, one of which (the primary subfile) was accessed by all transactions and the other (the secondary subfile) was accessed by a transaction only when it needed an attribute or attributes not contained in the primary subfile. It assumed that access to the primary subfile was either all random with both subfiles unblocked or all sequential with a blocked primary subfile and an unblocked secondary subfile. In a follow-up study, March and Severance [lo] extended this model by incorporating block factors for both subfiles. In both studies, Ford-Fulkerson algorithm was used to solve the problem. The need for determining access paths dynamically was not addressed. This approach is also limited by the requirement that a transaction must first access the primary subfile even though it does not contain any attribute needed by the transaction.

Conceding that solving the attribute partitioning problem by integer programming technique was computationally intractable when the number of attributes is large, Hoffer and Severance [3] proposed a cluster analysis technique. The bond energy algorithm and subjective judgment were used to reduce the number of attributes into a smaller number of clusters according to the affinity among attributes. The smaller number of clusters, then, were used as decision variables so as to reduce computation. Navathe et al. [6] extended the work of Hoffer and Severance by proposing heuristic algorithms to reduce the subjective element in defining the clusters. Navathe and Ra [16] employed a graphical technique that further reduces the arbitrariness of the method. However, they computed costs on the basis of the number of records rather than the number of data pages accessed. Thus, the issue of access path was not addressed. Furthermore, the overall clustering approach is subject to the criticism that it focuses on the affinity between a pair of attributes, which may not capture the affinity among larger groups of attributes [5].

The studies in the second category recognize the need to address the complexity caused by the interaction between attribute partitioning and access path selection and try to solve the problem by pre-fixing an access path for a transaction prior to partitioning. In this category are [2] and [8]. Both studies aimed at developing optimal algorithms that relaxed the requirement of [7] and [lo] that one segment must be accessed by all transactions. However, as pointed out earlier, the optimal access path for a transaction cannot be determined prior to partitioning but rather varies with a partition scheme. Pre-selecting an access path for a transaction is not entirely feasible.

The studies in the third category are capable of selecting the optimal access path that varies with a partition scheme. In this category are [5] and [ll]. In [5], Hammer and Namir developed “hill climbing” heuristics to group and re-group attributes in a bottom up fashion in search for good partition schemes. March and Scudder [l l] extended [5] to include the consideration of backup and recovery. However, [6] questioned the viability of conducting a bottom up search, because the optimal solution is closer to the group composed of all attributes than to groups that are single-attribute partitions. More importantly, it was pointed out in [8] that the quality of the solutions generated by these heuristics had not been established.

It is noteworthy that all the above studies employ the attribute as the decision variable. When the number of attributes is large, a condition favoring attribute partitioning, these methods may become computationally intractable. In fact, clustering approach [3,6] represents an effort to tackle this difficulty.

Other related studies include Chang and Cheng [12] that discussed both horizontal and vertical partitioning, Schkolnick [13] that considered the problem of attribute partitioning with an IMS-type hierarchical structure, and Batory [14] that discussed minimizing the cost of queries for partitioned files.

3. THE TRANSACTION ORIENTED APPROACH

This section presents two procedures for solving the attribute partitioning problem: MAX and FORWARD SELECTION. The development of these two procedures is inspired by two

332 PAI-CHENG Cl-W

methods that statisticians use for regression analysis, with which attribute partitioning shares some common basic features and computational complexity. MAX corresponds with RSQUARE procedure [ 15, pp. 71 l-7241, FORWARD SELECTION with FORWARD STEPWISE procedure [15, pp. 763-7741 in regression analysis. Of the two procedures, MAX is more comprehensive in its search and generates better solutions. FORWARD SELECTION applies a greedy heuristics to reduce the search effort and is more efficient. Both procedures treat the transaction as the decision variable.

Like most previous methods [2,6,7,8, lo], these procedures generate binary partitioning initially. They can be applied recursively to further partition a segment according to the strategy suggested by [6, pp. 698-7001 and [2]. As in all previous studies, it is assumed that a set of transactions against a logical relation is defined a priori, that an attribute is accessed by at least one transaction, and that the following information about a transaction is available: (1) the frequency of occurrence of the transaction per unit time period; (2) the subset of the attributes accessed by the transaction; and (3) the number of logical records selected by one occurrence of the transaction (record selectivity).

Segments are stored as an ordered set of contiguous tuples. Each segment has a tuple for each record occurrence of the original relation. Tuples are of fixed length where all values of an attribute require the same amount of storage. A segment resides on a direct device like a disk, which is divided into fixed size blocks called pages. A tuple is identified by a tuple identifier (TID) which has two components: page number and offset. A tuple in a segment is stored in the same relative position as its counterpart in the other segment, so that once a tuple in a segment is located, its counterpart in the other segment can be accessed directly based on the TID.

A page is the unit of transfer between secondary and primary memory. All known access paths are supported. These include but are not limited to file scan (full sequential search), clustered-index scan (primary-index scan), and non-clustered-index scan (secondary-index scan).

3. I. The MAX procedure

Two key terms are used to describe the procedures: suficient and support. A transaction is said to be su#icient in a segment if that segment contains all the attributes required by that transaction. A segment is said to support a transaction or a group of transactions if (1) the transaction(s) is(are) sufficient in the segment and (2) the segment does not contain any attribute that is not required by any of the transactions sufficient in the segment. Let a be an attribute, S be the set of attributes in a segment and A be the union of the attributes required by the transaction(s) sufficient in the segment. The concept of support is defined as:

Without partitioning, all attributes reside in one unpartitioned space. Binary attribute partitioning can be viewed as the process of creating a new segment or a new storage space to which some of the attributes are to be moved from the previously unpartitioned area.

The two heuristic procedures proposed here are based on the ideas (1) that a good partition scheme will result from putting the new segment to an efficient use and (2) that the new segment is put to an efficient use when it singles out a transaction or a group of transactions expressly for support. The fact that a transaction or a group of transactions are sufficient in the segment leads to efficient execution of this(these) transaction(s), since there is no need to access the other segment to complete the transaction. The absence of unneeded attribute in the segment minimizes the size of the segment and further enhances the efficiency in executing the supported transaction(s). There are two situations where the new segment fails to support a transaction or a group of transactions. The first case is that of under-support. It means that the attributes moved into the segment does not make any transaction sufficient in that segment. The second case in that of over-support, where the segment contain attribute(s) not needed by the transaction(s) that are made sufficient in that segment. Under-support and over-support for the new segment are expected to be suboptimal in most cases.

The problem to be solved in which transaction(s) should be supported by the new segment. The MAX procedure solves the problem by maximizing the following value:

Attribute partitioning

Table I. Maior solution steps delineated in a Pascal

333

Procedure A: initialize; Procedure B: compukparameters; Procedure C: determine-transaction-type; Procedure D: compute_cost_and_determine_access_path; Procedure E: output_the_best_of_f_transactiotlmodel;

Begin (* main l ) A: initialize; find the best one-transaction model*; find the best two-transaction model*;

find the best r-transaction model*

end. (* main *)

*Each of these involves iterations through Procedures B, C and D, and one pass through Procedure E.

where CI is the reduction in costs over the unpartitioned case for the transaction(s) supported by the new segment, /3 is the reduction in costs over the unpartitioned case for the transaction(s) not supported by the new segment, C#J is the increase in costs over the unpartitioned case for the transaction(s) not supported by the new segment. p results from reducing the size of the other segment and thus lowering the costs of executing those transactions that are sufficient in that segment. C#J is the result of having to access both segments for transactions that are not sufficient in either segment.

The MAX procedure systematically searches for a solution that maximizes v, just as RSQUARE procedure [15, pp. 71 l-7241 in multiple regression analysis maximizes R squared. To understand the MAX procedure, it is helpful to visualize two segments, one of which, called the target segment, is vacant prior to the partitioning process. The other segment is the holding place for the attributes not in the target segment. Prior to partitioning, all attributes reside in this segment. Between these two segments, attributes move as a group. The procedure begins with evaluating the one-transaction model. In an one-transaction model, the target segment supports only one transaction. This is done by moving to the target segment all the attributes accessed by the transaction. The attributes not required by the transaction remain in the other segment. The resultant partition scheme is evaluated by computing the costs of all transactions against it. In computing the cost of a transaction, the most favorable access path for that transaction is identified and used. Thus, instead of pre-specifying the access path, the procedure identifies the best access path that varies with partition schemes. After all transactions have taken turns in “entering” the target segment, the best one-transaction model is identified. The same process is used to identify the best two-transaction model, and so on. If there are t transactions, at the end, t best cases will be identified, from which the overall best case is derived.

When two segments are stored in the same storage device, there is no need to differentiate the two segments, since one is the mirror image of the other. In the following discussion, the two segments are referred to as “the first segment” and “the second segment” mainly to facilitate discussion.

3.1. I. The algorithm. The solution algorithm was incorporated in a Pascal program. The major modules of the program is presented in Table 1.

Procedure A inputs the required parameters. Procedure B computes values for two parameters, m, and m2, the sizes of the two segments in terms of data pages. An Attribute-status array is used to facilitate computing m, and m2. A value of 1 in a cell of the array indicates that the attribute is in the first segment, a value of zero indicates that the attribute is in the second segment. Thus, in a one-transaction model, if Transaction X is to be supported in the first segment and this transaction needs attributes 3, 5,6 and 7, then the Attribute-status array has the following values:

attribute: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

value: 001011100 0 0 0 0 0 0 0 0 0 0 0

334 PAI-CHENG CHU

Table 2. Transaction profile for Case 1 (n = 10,000)

Attributes

J k/n SA* I 2 3 4 5 6 7 8 9 10 II 12 13 14 15 16 17 18 19 20

TX1 50 0.01 I 1 1 I 1 TX2 10 0.05 3 I 1 1 1 Tx3 5 0.01 4 1 1 1 1 Tx4 5 0.01 II I 1 1 1 1 Tx5 30 0.01 12 1 1 1 1 1 TX6 15 0.01 16 1 1 1

l SA = scan attribute.

Similarly, in a two-transaction model, if Transaction X and Transaction Y are to be supported in the first segment and Transaction Y needs attributes 11, 16 and 20, then the Attribute-status array has the following values:

attribute: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

value: 001011100 0 1 0 0 0 0 1 0 0 0 1

Procedure C determines the type of a transaction under a particular partition scheme. A transaction falls into one of the three categories. A transaction either is sufficient in the first segment (f), or is sufficient in the second segment (s), or has to access both segments to complete its operation (b). The checking is done by comparing the list of attributes required by a transaction against the Attribute-status array.

Procedure D computes the costs of different access paths for a transaction under a particular partition scheme. The path giving the least cost is selected. The sum of the least costs for all transactions under a particular partition scheme is retained in memory for determining the best t-transaction case. The costs are expressed in terms of the number of disk accesses. Other costs can be easily incorporated into the cost model, if necessary. Three types of access paths are evaluated: clustered-index, non-clustered-index and sequential scan. For clustered index, the number of disk access is expressed as follows:

Number of accesses = (k x length of tuple)lblock size (1)

where k stands for the number of records qualifying in a query. For non-clustered index,

Number of accesses = Number of data page accesses + Number of index page accesses, (2)

where the number of data accesses is computed by using Cardenas’ formula [4]. For sequential scan,

Number of accesses = (k x length of tuple)/(block size x prefetch blocking factor), (3)

where the prefetch blocking factor is the number of pages read in each disk access. Throughout this paper, the prefetch blocking factor is assumed to be 1.

3.1.2. Examples. This solution procedure is applied to an example referred to as Case 1. The following data describe the relation considered in this case.

Number of attributes: 20 Lengths of attributes: 8888488 122022486533012866 Cardinality: 10,000 Size of data page: 4,000 bytes Number of transactions accessing the relation: 6

The other information about this case is given in Table 2, where f denotes the frequency of accesses to the relation in a given period of time, n is the total number of records in the logical relation. Thus, k/n indicates record selectivities. SA stands for scan attribute, i.e. the attribute appearing in the selection statement of a query. A “1” in the table indicates that a particular attribute is needed by a transaction. Thus, Transaction 1 requires attributes 1, 2, 3, 4, and Transaction 2 requires attributes 3, 5, 6, 7, and so on. It is assumed the clustered index is built on Attribute 1 and that all other scan attributes are supported by non-clustered indices.

The solution for Case 1 is presented in Table 3. As shown in the table, the best solution is to support one transaction in the first segment: Transaction 2. The total number of disk accesses under this solution is 5950 representing a 30% improvement over the unpartitioned case.

Attribute partitioning 335

Table 3. Solution for Case 1

*Besf solution for 1 transaction in first segment *

Transaction 2 in first segment m,=70; m,=415

Attributes in first segment: 3 5 6 7 Attributes in second segment: 1 2 4 8 9 10 11 12 13 14 15 16 17 18 19 20 Transaction 1 is type b; path = primary index; cost = 300 Transaction 2 is type f, path = sequential scan; cost = 700 Transaction 3 is type s; path = secondary index; cost = 450 Transaction 4 is type s; path = secondary index; cost = 450 Transaction 5 is type S; path = secondary index; cost = 2700 Transaction 6 is type S; path = secondary index; cost = 1350 Total cost = 5950

*Best solution for 2 transactions in first segment’

Transaction I, 2 in first segment m,=l30; m,=355

Attributes in first segment: I 2 3 4 5 6 7 Attributes in second segment: 8 9 IO 11 12 13 14 15 16 17 18 19 20 Transaction I is type f, path = primary index; cost = 100 Transaction 2 is type f; path = secondary index; cost = 1300 Transaction 3 is type b; path = secondary index; cost = 800 Transaction 4 is type s; path = secondary index; cost = 445 Transaction 5 is type S; path = secondary index; cost = 2670 Transaction 6 is type s; path = secondary index; cost = 1335 Total cost = 6650

*Best solution for 3 transactions in &xl segment l

Transaction 3.5.6 in first segment m, = 355; mz = 130

Attributes in first segment: 4 8 9 10 I1 12 13 16 17 18 19 20 Attributes in second segment: I 2 3 5 6 7 14 15 Transaction 1 is type b; path = primary index; cost = 300 Transaction 2 is type S; path = secondary index; cost = 1300 Transaction 3 is type f; path = secondary index; cost = 445 Transaction 4 is type b; path = secondary index; cost = 800 Transaction 5 is type f; path = secondary index; cost = 2670 Transaction 6 is type S; path = secondary index; cost = 1335 Total cost = 6850

*Best solution .for 4 rransncrions in first segmem *

Transaction 3,4,5,6 in first segment m, =375; m2= 110

Attributes in first segment: 4 8 9 IO II 12 13 14 15 16 17 18 19 20 Attributes in second 1 2 segment: 3 5 6 7 Transaction 1 is type b; path = primary index; cost = 300 Transaction 2 is type s; path = sequential scan; cost = 1100 Transaction 3 is type f; path = secondary index; cost = 445 Transaction 4 is type f; path = secondary index; cost = 445 Transaction 5 is type f; path = secondary index; cost = 2670 Transaction 6 is type f; path = secondary index; cost = 1335 Total cost = 6295

‘Best solution for 5 transactions in first segment *

Transaction 1,2,4,5,6 in first segment m, = 350; m2 = 135

Attributes in first segment: I 2 3 4 5 6 7 11 12 13 14 15 16 17 18 19 20 Attributes in second segment: 8 9 10 Transaction 1 is type f; path = primary index; cost = 200 Transaction 2 is type f; path = secondary index; cost = 2690 Transaction 3 is type b; path = secondary index; cost = 805 Transaction 4 is type f; path = secondary index; cost = 445 Transaction 5 is type f; path = secondary index; cost = 2670 Transaction 6 is type f; path = secondary index; cost = 1335 Total cost = 8145

‘The unpurtikmed case *

m,=485; m,=O

Attributes in first segment: I 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Attributes in second segment: Transaction 1 is type /; path = primary index; cost = 250 Transaction 2 is type f; path = secondary index; cost = 3150 Transaction 3 is type f, path = secondary index; cost = 460 Transaction 4 is type f; path = secondary index; cost = 460 Transaction 5 is type f; path = secondary index; cost = 2760 Transaction 6 is type f; path = secondary index; cost = 1380 Total cost = 8460

336 PAI-CHENG CHU

Table 4. Transaction profile for Case 2 (n = 10,000)

Attributes

.r k/n SA* 1 2 3 4 5 6 7 8 9 10 II 12 13 14 15 16 17 18 19 20

Txl 50 1.00 I 1 I 1 1 Tx2 10 0.05 3 1 1 1 I Tx3 5 0.01 4 1 1 1 1 Tx4 5 0.01 I1 1 1 1 1 1 1 1 Tx5 5 0.40 12 1 1 1 1 1 1 TX6 15 0.01 8 1 1 1 1

*SA = scan attribute.

The information generated by the procedure provides insight into the partitioning process. The rationale for the solution can be discerned from the following cost comparisons:

The solution Unpartitioned case

Transaction 1 300 250 Transaction 2 700 3150 Transaction 3 450 460 Transaction 4 450 460 Transaction 5 2700 2760 Transaction 6 1350 1380

All transactions except Transaction 1 benefit from the partition. This is due to the fact that the partition scheme makes all transactions except Transaction 1 sufficient in one of the two segments. However, Transaction 1 has a low record selectivity. Also, its scan attribute is the clustered index. It means that all the required records are packed together in a few pages. Thus, the cost of executing Transaction 1 does not increase significantly even if completing the transaction requires accessing both segments. On the other hand, Transaction 2 has the largest record selectivity and its scan attribute is not the clustered index. These conditions make it the most expensive transaction to execute in the unpartitioned case. It is fitting that in the best solution the first segment is made to support Transaction 2 thereby drastically reducing its cost and the overall costs.

This case also provides further evidence that it is infeasible to prefix an access path for a transaction. Note that the access path selected for Transaction 2 in the best one-transaction model is sequential scc~n. In contrast, the access path selected for the same transaction in the best two-transaction model is secondary index. The difference stems from the value of m,. In the first case, m, = 70. In the second case, m, = 130. As explained in Section 1, a larger m value favors the use of index. Thus, in the first case, sequential scan is more advantageous than access by non-clustered index, whereas in the second case the opposite is true.

Two additional examples are given below. First, consider Case 2 whose transaction profile is presented in Table 4. Attribute 8 is the clustered index. All other scan attributes are supported by non-clustered indices. All other parameters are the same as those in Case 1. These include the number of attributes, lengths of attributes, cardinality, block size and the number of different transactions. Note that Transaction 1 has the highest frequency of execution, that it also has the highest record selectivity, and that its scan attribute is not the clustered index. These conditions make Transaction 1 the dominating transaction.

Logically, a good strategy should focus on reducing the cost of the dominating transaction, because focusing on reducing the cost of any other transaction will not generate as great a payoff. The solution recommended by the MAX procedure (see Table 5) follows exactly this strategy. The best solution is to support Transaction 1 in the first segment. The following is a cost comparison between the recommended solution and the unpartitioned case:

The solution Unpartitioned case

Transaction 1 4000 24250 Transaction 2 1490 920 Transaction 3 745 460 Transaction 4 450 460 Transaction 5 2425 2425 Transaction 6 75 75

Attribute partitioning

Table 5. Solution for Case 2

331

l Besr solution for I transaction in first segment * Transaction I in first segment m,=80; m,=405

Attributes in first segment: 1 2 3 4 Attributes in second segment: 5 6 7 8 9 IO I1 12 13 14 15 16 17 18 19 20 Transaction 1 is type /; path = sequential scan; cost = 4000 Transaction 2 is type b; path = secondary index; cost = 1490 Transaction 3 is type b; path = secondary index; cost = 745 Transaction 4 is type S; path = secondary index; cost = 450 Transaction 5 is type 6; path = secondary scan; cost = 2425 Transaction 6 is type s; path = primary index; cost = 75 Total cost = 9185

‘Bert solution for 2 transactions in first segment * Transaction 1.2 in first segment m,=l30; m,=355

Attributes in first 1 segment: 2 3 4 5 6 7 Attributes in second segment: 8 9 10 11 12 13 14 15 16 17 Transaction 1 is type /; path = sequential scan; cost = 6500 Transaction 2 is type f; path-c secondary index; cost = 710 Transaction 3 is type b; path = secondary index; cost = 800 Transaction 4 is type b; path = secondary index; cost = 800 Transaction 5 is type b; path = secondary index; cost = 2425 Transaction 6 is type s; path = primary index; cost = 60 Total cost = 11295

18 19 20

*Best solution for 3 transactions in Jirst segment l Transaction 1,2,4 in first segment m, = 195; m2 = 290

Attributes in first segment: 1 2 3 4 5 6 7 I1 12 13 14 I5 Attributes in second segment: 8 9 IO 16 17 18 19 20 Transaction 1 is type f; path = sequential scan; cost = 9750 Transaction 2 is type f; path = secondary index; cost = 800 Transaction 3 is type b; path = secondary index; cost = 830 Transaction 4 is type f, path = secondary index; cost = 400 Transaction 5 is type b; path = sequential scan, cost = 2425 Transaction 6 is type b; path = primary index; cost = 75 Total cost = 14280

‘Best solution for 4 transactions in first segment * Transaction 1,2,4,5 in first segment m, = 260; m2 = 225

Attributes in first segment: 1 2 3 4 5 6 7 II 12 13 14 15 17 18 19 Attributes in second segment: 8 9 10 16 20 Transaction I is type f; path = sequential scan; cost = 13000 Transaction 2 is type fi path = secondary index; cost = 850 Transaction 3 is type b: path = secondary index; cost = 835 Transaction 4 is type h path = secondary index; cost = 425 Transaction 5 is type fi path = sequential scan; cost = 1300 Transaction 6 is type b; path = primary index; cost = 90 Total cost = 16500

‘Best solution for 5 transactions in first segment * Transaction 1,2,4,5,6 in first segment m, = 380; m2 = 105

Attributes in first segment: 1 2 3 4 5 6 7 8 11 12 13 14 15 Attributes in second segment: 9 IO Transaction 1 is type f; path = sequential path; cost = 19000 Transaction 2 is type f; path = secondary index; cost = 900 Transaction 3 is type b; path = secondary index; cost = 780 Transaction 4 is type f; path = secondary index; cost = 450 Transaction 5 is type f; path = sequential scan; cost = 1900 Transaction 6 is type f, path = primary index; cost=60 Total cost = 23090

16 17 18 19 20

l The unpartitioned case l m,=485; m,=O

Attributes in first segment: 1 2 3 4 5 6 7 8 9 IO 11 12 13 14 15 16 17 18 19 20 Attributes in second segment: Transaction 1 is type f; path = sequential scan; cost = 24250 Transaction 2 is type /; path = secondary index; cost = 920 Transaction 3 is type f; path = secondary index; cost = 460 Transaction 4 is type f; path = secondary index; cost = 460 Transaction 5 is type fi path = sequential scan; cost = 2425 Transaction 6 is type. f; path = primary index; cost = 75 Total cost = 28590

338 PAI-CHENG 0x1~

As compared with the unpartitioned case, a saving of 19405 disk accesses is realized representing a 68% improvement. The saving stems mainly from reducing the cost of Transaction 1.

Case 3 is derived from Case 2 with one modification: the frequency of execution for Transactions 5 is 30 instead of 5. This change makes Transaction 5 a dominating transaction too. The solution for this case is presented in Table 6. The best solution is the two-transaction model with Transaction 1 and Transaction 5, the two dominating transactions, supported by the first segment. The underlying logic for this solution can be appreciated by comparing the two-transaction model with the one-transaction and the three-transaction model (see Table 6). In the one-transaction model, the best solution is to support Transaction 1 in the first segment. The reason is that Transaction 1 is the most frequently executed transaction with a large record selectivity. Thus, it is most beneficial to support Transaction 1, when only one transaction is to be supported. However, this arrangement is not good to Transaction 5, the other dominating transaction, which has to access both segments. In the best two-transaction model, Transaction 5 along with Transaction 1 is supported by the first segment. The cost of executing Transaction 5 is drastically reduced because there is no more need to access both segments. This reduction more than offsets the increase in the cost of executing Transaction 1 brought about by the increase of m, value from 80 to 180. Moving onward to the three-transaction model and beyond does not pay, because further increase of m, value increases the costs of executing the two dominating transactions. These increases cannot be compensated adequately by the cost reduction from the less frequently executed transactions.

3.2. The FOR WARD SELECTION procedure

The MAX procedure employs equations (l)--(3) to compute transaction costs. It is important to note that the number of attributes is not a factor in these formulas. This is due to the fact that unlike all previous studies which without exception treat attribute as the decision variable, the MAX procedure treats transaction as the decision variable. As a result, the run time of the MAX procedure does not depend on the number of attributes as all previously proposed methods do. This represents a significant benefit considering the fact that attribute partitioning is needed precisely when there are a large number of attributes. Our experience indicates that solving a problem involving 100 attributes and 10 transactions takes about 15 set of CPU time on Prime 9955, a mini-computer.

While the run time of the MAX procedure does not depend on the number of attributes, it is a function oft, the number of transactions accessing the relation. A total of 2’ - 1 models are fitted. When t is also large, the procedure may not be feasible due to its long running time. There is a need for a procedure to handle problems involving a large t.

The FORWARD SELECTION procedure is developed to take care of the cases of large t. In this procedure, the best one-transaction model is found as in the MAX procedure. However, the transaction that “enters” the target segment to be supported by that segment in the best one-transaction model stays in the model. The two-transaction model is derived by adding another transaction from the pool of I - 1 transactions that are not yet in the model. The transaction that produces the largest v is added and retained in the model. Transactions are thus added to the model one by one until the target segment supports all transactions. At the end, there will be t models, from which the best model is derived. The solution for Case 2 presented in Table 5 exemplifies a solution that would have been derived from the FORWARD SELECTION procedure. The 6 candidate solutions are:

Transaction supported by first segment

One-transaction model 1 Two-transaction model 132 Three-transaction model 1,2,4 Four-transaction model 1,2,4,5 Five-transaction model 1,2,4,5,6 Six-transaction model 1,2,4,5,6,3

Note that each succeeding model incorporates the previous model. At each step, a new transaction is added.

Attribute partitioning

Table 6. Solution for Case 3

339

*Best solution for I trmsaction in first segment *

Transaction I in first segment m,=80; m,=405

Attributes in first segment: 1 2 3 4

Attributes in second segment: 5 6 7 8 9 IO II 12 13 14 I5 16 17 18 19 20

Transaction 1 is type /; path = sequential scan; cost = 4000 Transaction 2 is type h; path = secondary index; cost = 1490 Transaction 3 is type b; path = secondary index; cost = 745 Transaction 4 is type s; path = secondary index; cost = 450 Transaction 5 is type b; path = sequential scan; cost = 14550 Transaction 6 is type s; path = primary index; cost = 75 Total cost = 21310

*Best solution for 2 transactions in first segment l

Transaction 1.5 in first segment m, = 180; m, = 305

Attributes in first segment: I 2 3 4 12 13 17 18 19 Attributes in second segment: 5 6 7 8 9 IO 11 14 IS 16 20 Transaction 1 is type f, path = sequential scan; cost = 9000 Transaction 2 is type b; path = secondary index; cost = 1650 Transaction 3 is type b; path = secondary index; cost = 825 Transaction 4 is type b; path = secondary index; cost = 825 Transaction 5 is type f; path = sequential scan; cost = S400 Transaction 6 is type. s; path = primary index; cost = 60 Total cost = 17760

*Best solution for 3 transactions in first segment *

Transaction I, 2,s in first segment m, = 230; m2 = 255

Attributes in first segment: I 2 3 4 5 6 7 12 I3 17 18 19 Attributes in second segment: 8 9 10 II 14 IS 16 20 Transaction I is type f; path = sequential scan; cost = I 1500 Transaction 2 is type f; path = secondary index; cost = 830 Transaction 3 is type b; path = secondary index; cost = 835 Transaction 4 is type b; path = secondary index; cost = 835 Transaction 5 is type f; path = sequential scan; cost = 6900 Transaction 6 is type s; path = primary index; cost = 45 Total cost = 20945

*Best solution for 4 transactions in first segment’

Transaction I, 2,4,5 in first segment m, = 260; m* = 225

Attributes in first segment: I 2 3 4 5 6 7 II 12 13 14 15 17 18 19 Attributes in second segment: 8 9 IO 16 20 Transaction I is type f; path = sequential scan; cost = 13000 Transaction 2 is type f; path = secondary index; cost = 850 Transaction 3 is type b; path = secondary index; cost = 835 Transaction 4 is type f; path = secondary index; cost = 425 Transaction 5 is type f; path = sequential scan; cost = 7800 Transaction 6 is type b; path = primary index; cost = 90 Total cost = 23000

*Best solution for 5 transactions in first segment l

Transaction I, 2,4,5,6 in first segment in, = 380; m* = 105

Attributes in first segment: I 2 3 4 5 6 7 8 II 12 13 14 IS 16 17 18 19 20 Attributes in second segment: 9 10 Transaction I is type f; path = sequential path; cost = 19000 Transaction 2 is type f; path = secondary index; cost = 900 Transaction 3 is type b; path = secondary index; cost = 780 Transaction 4 is type fi path = secondary index; cost = 450 Transaction 5 is type f; path = sequential scan; cost = 11400 Transaction 6 is type f, path = primary index; cost = 60 Total cost = 32590

l The unpartitioned case *

m,=485; m,=O

Attributes in second segment: Transaction I is type f; path = sequential scan; cost = 24250 Transaction 2 is type f; path = secondary index; cost = 920 Transaction 3 is type f; path = secondary index; cost = 460 Transaction 4 is type f; path = secondary index; cost = 460 Transaction 5 is type f; path = sequential scan; cost = 14550 Transaction 6 is type f; path = primary index; cost = 75 Total cost = 407 15

340 PAI-CHENG CHU

The rationale for the FORWARD SELECTION procedure is that a solution that takes care of the dominating transaction(s) is a good solution, even if it may not be the optimal solution. It is similar to the FORWARD STEPWISE procedure in multiple regression analysis, where an independent variable that accounts for the largest variance in the dependent variable is the first to enter the model and stays in the model thereafter [15, pp. 763-7741.

The FORWARD SELECTION procedure is very efficient. Since it uses the same formulas as the MAX procedure to compute transaction costs, its run time is not dependent on the number of attributes either. To find the best one-transaction model, t models are fitted, to find the best two-transaction model, t - 1 models are fitted, and so on. The total number of models fitted is t(t + 1)/2.

4. PERFORMANCE EVALUATION

Few previous studies had conducted extensive validation of their proposed methods. Validation of partitioning schemes is difficult because it is computationally expensive. However, it is extremely important to validate a method no matter whether the method is heuristic in nature or is based on an optimizing technique. For a heuristic method, we need to know how close its solutions are to the optimal solutions. For a method based on an optimizing technique, we need to assess the effect of its assumptions on the quality of its solution. Such a method invariably incurs certain restrictive assumptions.

To evaluate the quality of the solutions generated by the two procedures, we produced a total of 500 test cases. Each case consists of a randomly generated transaction profile with the number of records fixed at 10,000, the number of attributes at 20, and the number of transactions at 10. The impact of load concentration on the effectiveness of the two procedures is assessed. Load concentration is measured by computing the percentage of processing accounted for by the top 20% of transactions in the unpartitioned case. For example, a load concentration of 80% means that the top 20% of the transactions account for 80% of the total amount of processing. High load concentrations indicate the presence of dominating transactions. Low load concentrations indicate that the amount of processing is spread rather evenly among transactions and that there is no dominating transaction.

The quality of a solution is gauged by the following measure:

(unpart - method)/(unpart - optimum),

where unpart stands for the cost for processing all transactions in the unpartitioned case, method refers to the cost of processing all transactions for a solution derived from either the MAX or the FORWARD SELECTION procedure, and optimum is the cost for processing all transactions for the best solution derived from complete enumeration of all solutions. Thus, (unpart - optimum) represents the maximum possible improvement, wherease (unpart - method) represents the im- provement generated by a particular solution.

We also compute the percentage of times the solutions generated by the two procedures agree with the optimal solutions.

To handle the large amount of computation, we ran the validation test on Cray supercomputer Y-MP8/864. The results show that the MAX procedure captures 99.9% of the maximum possible improvement. Its solutions agree with the optimal solutions 93% of times. The performance of the MAX procedure matches that of the optimal solution at all load concentration levels, indicating that the MAX procedure is extremely effective.

The FORWARD SELECTION procedure captures 93% of the maximum possible improve- ment. Its solutions agree with the optimal solutions 56% of times. This suggests that while the FORWARD SELECTION does not always hit on the optimal solution, its solutions are not far from the optimal solutions. As indicated in Fig. 1, the performance of the FORWARD SELECTION procedure varies with load concentration. Its performance approaches the optimal as load concentration approaches 1. This makes sense because the FORWARD SELECTION procedure takes care of the dominating transaction(s) and is expected to perform well when dominating transactions exist.

Attribute partitioning 341

E

I! g

1.0

F ++ 9+ +I++* *++ +Ht +k

e + +r+++++ ++% p 0.9

+t+ ++ +++ ++ ++

E *?++ % +++ + + +

._ ++, +

2 0.9 .-

I 1 + g 0.7

E z 0.6 E .- t :: E o5

IJ I I I III III 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.6 0.9 1.0

z

Q. Load Concentration

0

Fig. 1. Evaluation of forward selection procedure.

5. CONCLUSION

Attribute partitioning is an important process for economizing computing resources. This paper presents and evaluates two procedures to solve the attribute partitioning problem. The development of these two procedures is inspired by statistical methods used in regression analysis.

Both procedures account for the interaction between attribute partitioning and access path selection. Instead of ignoring or pre-fixing the access path, these procedures dynamically choose an access path for each transaction to fit a partition scheme and then evaluate the partition scheme on the basis of the chosen access path. This not only ensures that partition schemes are evaluated soundly but also makes it possible to couple the process of attribute partitioning with that of query optimization.

In a departure from all previous studies, these two procedures treat the transaction instead of the attribute as the decision variable. This creates the benefit that the run time of these procedures does not depend on the number of attributes as all previous methods do. When the number of attributes is large, a condition favoring partitioning, these two procedures can be efficiently executed, whereas all previous attribute-based methods may become computationally infeasible.

Both procedures have been subject to extensive validation. The MAX procedure generates solutions nearly as good as those from the complete enumeration of solutions. It is the procedure of choice for cases involving a modest number of transactions (< 15). The FORWARD SELECTION procedure also generates good solutions. It is efficient for cases involving a large number of transactions.

The focus on transactions allows both procedures to generate detailed information not available from all previous studies. This information provides new insight concerning the trade-off process in attribute partitioning. As a result, attribute partitioning ceases to be a “black-box” process incapable of generating logical explanations as it was previously held to be. Furthermore, both procedures are easy to understand and implement. They achieve excellent results without making use of complicated mathematics.

Acknowledgements-This research was supported by a grant from the Ohio Supercomputer Center. The author would like to thank an anonymous reviewer for his/her comments.

REFERENCES

[I] S. T. March. Techniques for structuring database records. ACM Comput. Suru. 15(l), 45-79 (1983). [2] D. W. Cornell and P. S. Yu. A vertical partitioning algorithm for relational databases. In Proc. 3rd hf. Conf. on Data

Engineering, Los Angeles, pp. 30-35 (1987). [3] J. A. Hoffer and D. G. Severance. The use of cluster analysis in physical data base design. In Proc. Int. ConJ Very

Large Data Bases, Framingham, MA, pp. 69-86. ACM, New York (1975). [4] A. F. Cardenas. Analysis and performance of inverted data base structure. CACM 18(5), 253-263 (1975). f51 M. Hammer and B. Niamir. A heuristic aDoroach to attribute oartitionina. In Proc. ACM SIGMOD Conf, Boston, . .

MA, pp. 93-101. ACM, New York (1979’): [6] S. Navathe, S. Ceri, G. Wiederhold and J. Dou. Vertical partitioning algorithms for database design. ACM Trans.

Database Syst. 9(4), 680-710 (1984). [7] M. J. Eisner and D. G. Severance. Mathematical techniques for efficient record segmentation in large shared database.

J. ACM 23(4), 619-635 (1976).

342 PAI-CHENG CHU

[8] P. De, J. S. Park and H. Pirkul. An integrated model of record segmentation and access path selection for databases. Information Systems 13(l), 13-30 (1988).

[9] J. A. Hoffer. An integer programming formulation of computer data base design problems. Informution Sciences 11, 29-48 (1976).

[IO] S. T. March and D. G. Severance. The determination of efficient record segmentations and blocking factors. ACM Trans. Database Syst. 2(3), 279-296 (1977).

[I I] S. T. March and G. C. Scudder. Gn the selection of efficient record segmentations and backup strategies for large shared databases. ACM Trans. Da&base Syst. 9(3), 409-438 (1984).

[12] S. K. Chang and W. H. Cheng. A methodology for structured database decomposition. IEEE Trans. So/iware I&gng SE-6(2), 205-218 (1980).

[13] M. Schkolnick. A clustering algorithm for hierarchical structure. ACM Truns. Database Sysr. 2(l), 2744 (1977). [14] D. S. Batory. On searching transposed files. ACM Trans. Database Sysr. 4(4), 531-544 (1979). [IS] SAS Institute Inc. SAS User’s Guide: Staristics. Cary, NC (1985). [16] S. B. Navathe and M. Ra. Vertical parititioning for database design: a graphical algorithm. In Proc. ACM SIGMOD

Inr. Conf: on the Management of Data, Portland, OR, pp. 449-449 (1989).