efficient use of sas' data set indexes in sas ... painter.pdf · efficient use of sas'...

5

Click here to load reader

Upload: vuongtuong

Post on 09-Nov-2018

212 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Efficient Use of SAS' Data Set Indexes in SAS ... Painter.pdf · Efficient Use of SAS' Data Set Indexes in SAS' Applications Sally Painter, SAS Institute Inc., Cary, NC ABSTRACT By

Efficient Use of SAS' Data Set Indexes in SAS' Applications Sally Painter, SAS Institute Inc., Cary, NC

ABSTRACT

By indexing your SAS data sets, you can run certain types of appti­cations more efficiently. The ability to index SAS data sets is avail­able in Release 6.06 of the SAS System.

This paper discusses the costs associated with indexing a SAS data set and the types of applications where the benefits of having a SAS data set indexed outweigh the costs of im~menting and maintain­ing the index. The target audience is users who will be designing applications for Release 6.06 and later releases of the SAS System.

INTRODUCTION

The ability to index SAS data sets is an enhancement in Version 6 of the SAS System. Having a SAS data set indexed can, in some cases, provide faster access to a subset of the data. A secondary benefit is that all values returned via an index are returned in sorted order. However, you should not create an index just to keep from sorting the data set. In most cases, the costs of maintaining the index would more than likely outweigh the CPU reduction gained by eliminating the SORT procedure step. This paper wiU attempt to shed some light on the types of applications that will most likely ben­efit from the use of an index.

WHAT AN INDEX IS

An index is an auxiliary data structure that does not appear as a separate file to the SAS System. On all operating systems except MVS, it does appear as a separate file to the operating system. You should never manipulate the index (move, delete, and so on) except with the SAS System.

Logically, the index is an inverted tree structure that contains data values and location identifiers for the values of a key variable or vari­ables. The SAS index structure is implemented as a self-balanced tree, which means that every leaf is the same distance from the root. This is important because it provides a uniform cost to access the leaves. In addition, any change to the SAS data set that affects the data values of the indexed variables will cause the index to be modi­fied.

HOW AN INDEX IS BUILT

By diagramming the logic of the index structure for a particular case, you will be able to see why values returned via the index are returned in sorted order. Also, the diagram illustrates the concept of a balanced structure.

Note: The intemal structure of the index would look different.

Suppose that you have a variable named REGION that you want to be the key variable for your index. The values of REGION are N. E. S. and W. The number of levels on the index structure is based on the number of unique values of the key variable. which in this case is REGION. Since REGION has only 4 values, our logical dia­gram of the index will contain three levels: the root (level 2), the nodes (level 1), and the leaves (level 0). Level 2 is the root of the index. There is always only one root. Level 1 are the parent nodes. Each parent contains the highest value of the REGION that lives under it. The parent node also contains a NID (child node identifier),

408

which is the location of each child. Under each parent are two chil­dren. Each child, or leaf, contains a unique value of the key variable and the RID (record identifier). The complete set of IeveJ 0 nodes contains all values of the key vartable. Also, if you read the leaves from left to right, you will notice that the values are in sorted order.

I ~ .. ------ l.eveI2(Root) L....,---1

N,nU1 f----~I W,nU1 ..... f---l.eveIl(Nodes)

TYPES OF INDEXES

The SAS System under Version 6 supports two types of indexes: regular and composite.

A regular index is based on the value of one key variable. The name of a regular index is the same as the name of the key variable on whkfl the index is based.

A composite index is based on the value of two or more key vari­ables whose values are concatenated together to form one string. When choosing a name for a composite index, you must select a name that is different than any variables in the data set. The first value in the concatenation of a composite index can also be used by the SAS System just as if it were a regular index. For example. suppose you have created a composite index on the key variables LNAME and FNAME (in this order) and are using LNAME for BY processing in your SAS application. The SAS System could use this composite index to retrieve observations BY LNAME since it is first in the concatentation.

In Release 6.06, all key variables of a composite index are used in BY processing. For example, if your composite index is composed of three key variables and you set your data set by the same three variables (using the SET and BY statements), we will retrieve the observations via the index. For WHERE dause optimization, we only use the first key variable of the composite index regardless of how many of the key variables are specified in the WHERE state­ment. An example of this is given later in this paper.

You can create multiple regular or composite indexes on any Ver­sion 6 disk format SAS data set. You can also have any combination of regular and composite indexes. However. be aware that having a large number of indexes on any data set can be costly in terms of upkeep.

Page 2: Efficient Use of SAS' Data Set Indexes in SAS ... Painter.pdf · Efficient Use of SAS' Data Set Indexes in SAS' Applications Sally Painter, SAS Institute Inc., Cary, NC ABSTRACT By

CREATING AN INDEX

When creating an index, you can also specify attributes. Valid attri­butes are:

UNIQUE used when the key variable will not contain duplicate values in the data set. This attribute, once assigned, makes it impossible for duplicate values of the key variable to be added to the data set. This is handy when working with data that should not, for any reason, contain duplicates.

NOMISS used when missing values will not be included in the index, but may exist in the data set. If the NOM ISS attribute is specified, you cannot use a WHERE statement that expects to retum missing values. Also, your index structure will be physically smaller because it will not have nodes to identify the missing values.

There are several ways to create an index on a SAS data set. You can:

• use the PROC DATASETS procedure

• use the Sal procedure

• use the IMl procedure

• create interactively through the ACCESS window.

Note: Not all types and attributes are available with each method.

The syntax for creating or deleting an index using PROC DATASETS is

The CONTENTS procedure and the ACCESS window will give detailed information about the index. The DIR window will report the presence of the index, and the VAR window will report the variables used as index keys.

10U:rPUt fROK PROC COIITli:RU CONn .... "TIOR or THi: UD..x

CONnlnS PROCBDllRB

Data Set Ra .. e: LIB.ONE Kellbe. ry~: Boglne: V6G6 cuated: 8:55 lIondor, Dec.llber 10. 1990 Last K~dif1<Od: 9:3' 1I01"ldor, Dec.IIb •• 10, 1990 Data set rype:

L.bel:

-----llIgll1e/Il05t Dependellt InformaUon ____ _

Data ht page SI.e: 61_~

IItllllber of Data Set page", 35 Indu File Page Sise, 6H~

Number of Indn rile Pagn, 36 Physical M ..... : SJ,SSJP.SAS6.LlBlI.All:Y Release Creat.d: 6.06 Release Last lIodlfied, 6.06 created hy: SASSJP Lut Modlll.d by, APPBIID Subntent., 1 :rot&1 IIlnch Used, 15

Ol1servation., 3011 Varial1le5: 9 lndeus: Ol1servation Lengtll: 83 Deleted Observations: 0 Compressed: fiS hus. Spa"e, Tn

-----Alpll.h.Uc Lht of vartobl .. and Athlbutes-----

Variable :rype Leo Pos

IILUUB Char " CREAnD Char " " "" Char " " D80!!G Char " USCL Cbu " nCfl! Cb .. " TRU Cllar " VOLSBR Cllar " n Cbar '" _____ AlpbabeUc List of Indues aDd Attrlbut88 ____ _

lrulu

DBORG LIlECL RECFK

PROC DATASETS IN=l1bref; VOLSE!! MODIFY SALdatiLSet; L _________________________ "

INDEX CREATB indeL.Jlame I attributes; *uqulilr index* INDEX CREATE indeL.Jlame"variable-1ist I attributes; *coRlposite indelC* INDEX DELETE indeJL1lame;

The syntax for creating or deleting an index using PROC Sal is

PROC SQL; CREATE INDEX <UNIQUE> indeL.Jlame ON SAS....1!atiLS9t(key_variable(s)); DROP INDEX indeLlli1.Jlle FROM SA5....datiLSet;

Using the ACCESS window, you should follow these steps:

• type ACCESS on any SAS Display Manager System command line

• enter a C beside the data set that will contain the index

• type INDEX CREATE on the command line of the CONTENTS window

• enter the index name, attributes, and key variable(s} and issue RUN on the command line

• issue the END command to go back to display manager.

CONFIRMATION OF THE INDEX

Once an index has been created, the next logical step is to confirm its presence in the data set. One approach is to look for the file in the operating system file structure (valid for all systems except MVS). The choices using SAS software are to use the CONTENTS procedure and to look interactively using the ACCESS, DIR, or VAR wiooows.

409

Notice the Index File Page Size and Number of Index File Pages. This information shows the amount of DASD that the index requires. The page size is chosen by the SAS System and cannot be modified by the SAS user.

A SAS System option, MSGLEVEl=l, can be set so that an infor­mation note is written to the SAS log when the index is used.

WHEN AN INDEX IS USED

Once an index has been created, there is no way that you can force its use for WHERE optimization. That decision is left to the SAS Sys­tem. However, there are some cases when an index is never used. They are

• FINO and SEARCH commands available with PROC FSEDIT

• a subsetting IF statement in the DATA step

• when a BY statement conflicts with a WHERE statement. For example, there is not one index that will satisfy both the BY and WHERE statements.

An index will be used for BY processing if one is available.

The SAS System can choose to use the index when using a WHERE statement. The costing algorithm compares the number of 1/0's it would take to retrieve the observations via a sequential pass of the

Page 3: Efficient Use of SAS' Data Set Indexes in SAS ... Painter.pdf · Efficient Use of SAS' Data Set Indexes in SAS' Applications Sally Painter, SAS Institute Inc., Cary, NC ABSTRACT By

data versus the cost of retrieving the observations via the index. The most cost-efficient method is selected.

Because of the design of the costing algorithm, some guidelines are suggested to help you determine whether or not to index your SAS data sets.

.. 00 not index a data set to be used with WHERE processing if you expect to retrieve more than one-third of the total number of observations.

• 00 not index unless the data set occupies at least three pages (shown by PROC CONTENTS or PROC DATASETS).

• Keep the number of indexes per data set to a minimum.

• Index data sets where the values of the key variables are unifonnly distributed.

Each of these suggestions is addressed in detail in the following sections.

Select a Small Subset of Observations

If you use a WHERE statement to select observations from a SAS data set, having your data set indexed can be an advantage if your WHERE statement selects a small number of observations from the input data set, generally one-third or less. This guideline is based on the fact that processing a data set sequentially is often more effi­cient when a large percentage of the data set observations are setected. For example, compare the idea of an index with the card catalog system in a library. If you are going to choose 75 percent of all the books in the library, it would be faster to walk through the shelves and gather the books than to look for each title in the card catalog and retum to the appropriate shelf multiple times.

Do Not Index Small Data Sets

You should not create an index unless the data set is at least three pages large. This suggestion is based on the fact that the index file will contain all the values of the key variable as well as a record iden­tifier. With a small data set, your index file could be almost as large as the data file itself. Also, it is usually just as efficient to make a sequential pass of the data as it would be to find the appropriate node in the index tree and then retrieve the observation(s).

Keep the Number of Indexes to a Minimum

Once an index is created, it is automatically maintained by the SAS System. Anytime you add or delete observations from the data set, the index structure must be changed to reflect the data set changes. Also, changing a value of a key index variable in the data set requires that the node in the index structure be deleted and then a new node added to represent the new value. With multiple indexes on one data set. the resource use costs can increase dramatically.

All indexes for a particular SAS data set are stored in the same index file, and the size of the file is determined by the number of indexes you have created. You should consider the size of the index file when deciding whether to create multiple indexes.

Index Unifonnly Distributed Data

The costing algorithm employed to determine whether to use an index determines the minimum value and the maximum value of the key variable, and then k>oks at the selection criteria (the WHERE statement specified) to detennine approximately how many obser­vations will be selected. The algorithm assumes that the data are evenly distributed between the minimum and maximum. If this is not the case, the algorithm may decide to retrieve the observations via

410

the index when a sequential pass of the data would in fact be more efficient

AN EXAMPLE

The following example illustrates some of the factors that should be considered when deciding whether to index a SAS data set.

let us look at an application using a composite index in the CMS environment The application is the generation of a report using data from the automotive industry. It prepares a report of vehides tested at specifIC test sites across the country. You have approximately 12000 unique values of VEHICLE, vehide identification number, that are unifonnly distributed. The values of TESTSITE, city and state, are not uniformly distributed - the value CARY,NC accounts for approximately 2/3 of the data values for TESTSITE. Your final report will list about 34 observations.

In this example, a composite index was created using the variables TESTSITE and VEHICLE, in this order. Next a WHERE clause was used to subset the data using these two variables as selection criteria.

WHERE TESTSITE-'value1' AND VEHICl.B"'value2';

The results from this subset of the data were an increase in the VCPU and TCPU statistics (reported by the STIMER and STATS system options).

VCPU, virtual CPU time, represents the CPU time spent executing within your virtual machine. TCPU, total CPU time, represents the VCPU time plus CPU time spent executing CP systems devices on behaH of your job. The TCPU statistic reflects the 110 resources.

To explain this decrease in performance, you must look at several factors. First, the variable TESTSITE is not a good candidate for an index since one value represents so many of the observations of the data. The variable VEHICLE is a good candidate since its data are evenly distributed.

Secondly, only the first key variable in a composite index is used for WHERE clause optimization. This means that our composite index had approximately the same effect as a simple index on the variable TESTSITE. We have already stated that TESTSITE is not a good candidate for indexing.

To improve performance in this example, delete the composite index and create a simple index on the most discriminating variable. In our example, this would be VEHICLE. The SAS System will then go and retrieve the observations meeting the selection criteria for VEHICLE = via the index. Next, it will sequentially process this sub­set looking for the appropriate values of TESTSITE.

CONCLUSION

Indexes can be very effective in some situations, but can actually degrade performance in others. You should evaluate your data and application carefully before you decide to index your SAS data sets. As with any performance feature, there are advantages and disad­vantages to weigh. Your decision should be based on which resources are most important to conserve in your computing envi­ronment. If disk space is at a premium, then you should consider the fact that the index or indexes take extra disk space. On the other hand, if 1/0 time is important, then you should consider creating an index to reduce time to retrieve observations from your SAS data sets. Keep in mind that you can only provide the index, not force the SAS System to use it.

Page 4: Efficient Use of SAS' Data Set Indexes in SAS ... Painter.pdf · Efficient Use of SAS' Data Set Indexes in SAS' Applications Sally Painter, SAS Institute Inc., Cary, NC ABSTRACT By

REFERENCES

Beatrous, Stephen and William Clifford. ~Version 6 SAS" Data Base System Architecture: Current and Future Features. ft Proceedings of the Thirteenth Annual SAS Users Group International Conference.

Clifford, William D. et at "Using New SAS" Database Features and Options." Proceedings of the Fourteenth Annual SAS Users Group International Conference.

411

Page 5: Efficient Use of SAS' Data Set Indexes in SAS ... Painter.pdf · Efficient Use of SAS' Data Set Indexes in SAS' Applications Sally Painter, SAS Institute Inc., Cary, NC ABSTRACT By

412