8 sorting sas data sets

9
Lesson 8: Sorting SAS Data Sets Summary SAS ® Programming 3: Advanced Techniques and Efficiencies 1 Copyright © 2010 SAS Institute Inc., Cary, NC, USA. All rights reserved. Main Points Understanding the SORT Procedure Sorting data is useful for reordering data for reporting, reducing data retrieval time, and enabling BY-group processing. However, PROC SORT is resource-intensive, using considerable disk space, memory, I/O, and CPU time. You can use options or techniques with PROC SORT to minimize resource usage. SAS supports PROC SORT in all operating environments, so PROC SORT can’t take advantage of any platform-specific sort enhancements. PROC SORT executes in memory up to the limit imposed by the SORTSIZE= option. In fact, PROC SORT minimizes the use of external storage and tries to sort entirely in memory, if possible. By default, PROC SORT executes in parallel using multiple threads. Taking advantage of threaded processing in SAS can help you reduce I/O when you sort data. These are some useful terms related to threaded processing: Term Definition thread a single, independent flow of execution through a program or within a process parallel processing multiple units of work that the operating system schedules for concurrent execution symmetric multiprocessing machines (SMPs) computers with multiple CPUs that share the same memory and a thread-enabled operating system; can spawn and process multiple threads simultaneously You can determine how many CPUs are available in your SAS session by using a PROC OPTIONS step that specifies OPTION=CPUCOUNT. When you specify OPTION=CPUCOUNT, the SAS log displays the number of available processors.

Upload: aubain-hilaire-nzokem

Post on 10-Oct-2014

238 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: 8 Sorting SAS Data Sets

Lesson 8: Sorting SAS Data Sets

Summary

SAS® Programming 3: Advanced Techniques and Efficiencies 1

Copyright © 2010 SAS Institute Inc., Cary, NC, USA. All rights reserved.

Main Points

Understanding the SORT Procedure

Sorting data is useful for reordering data for reporting, reducing data retrieval time, and

enabling BY-group processing. However, PROC SORT is resource-intensive, using

considerable disk space, memory, I/O, and CPU time.

You can use options or techniques with PROC SORT to minimize resource usage.

SAS supports PROC SORT in all operating environments, so PROC SORT can’t take

advantage of any platform-specific sort enhancements.

PROC SORT executes in memory up to the limit imposed by the SORTSIZE= option. In

fact, PROC SORT minimizes the use of external storage and tries to sort entirely in memory,

if possible.

By default, PROC SORT executes in parallel using multiple threads. Taking advantage of

threaded processing in SAS can help you reduce I/O when you sort data. These are some

useful terms related to threaded processing:

Term Definition

thread a single, independent flow of execution through a

program or within a process

parallel processing multiple units of work that the operating system

schedules for concurrent execution

symmetric multiprocessing

machines (SMPs)

computers with multiple CPUs that share the same

memory and a thread-enabled operating system; can

spawn and process multiple threads simultaneously

You can determine how many CPUs are available in your SAS session by using a PROC

OPTIONS step that specifies OPTION=CPUCOUNT. When you specify

OPTION=CPUCOUNT, the SAS log displays the number of available processors.

Page 2: 8 Sorting SAS Data Sets

Lesson 8: Sorting SAS Data Sets

SAS® Programming 3: Advanced Techniques and Efficiencies 2

PROC OPTIONS OPTION=CPUCOUNT; RUN;

This is the process that PROC SORT uses for parallel processing. This example uses four

threads:

Steps in Parallel Processing Using PROC SORT

1. PROC SORT breaks the SAS data set into chunks by dividing the total

number of observations by the total number of threads available to do the

parallel processing.

2. PROC SORT creates the processing threads.

3. The threads read and process data:

Thread 1 starts reading and processing data chunk 1.

Thread 2 reads and processes chunk 2.

Thread 3 reads and processes chunk 3.

Thread 4 reads and processes chunk 4.

4. PROC SORT collates the partial results.

Using threaded processing completes the sort in less real time than handling each task

sequentially, although the CPU time is generally increased.

Other SAS tasks besides sorting can also exploit threading. These tasks include subsetting

using WHERE expressions, filtering variables using DROP or KEEP statements or data set

options, indexing, and summarizing data. In addition to PROC SORT, these Base SAS

procedures are multithreaded: PROC MEANS, PROC SUMMARY, PROC REPORT, PROC

SQL (using the GROUP BY and ORDER BY clauses), and PROC TABULATE. Many

SAS/STAT procedures are also multithreaded.

When you benchmark using the threaded procedures, compare real time rather than CPU

time. The back-end collating process to re-create the single data set might increase total CPU

time while reducing real or elapsed time.

Page 3: 8 Sorting SAS Data Sets

Lesson 8: Sorting SAS Data Sets

SAS® Programming 3: Advanced Techniques and Efficiencies 3

Controlling Threaded Processing in PROC SORT

You can enable or disable threaded sorting in two ways. You can use the THREADS |

NOTHREADS system option , or you can specify the THREADS | NOTHREADS option in

the PROC SORT statement. In both cases, the default is THREADS.

Specifying the THREADS | NOTHREADS option in a PROC statement overrides the

THREADS | NOTHREADS system option.

The THREADS | NOTHREADS option interacts with the TAGSORT option. If you specify

the TAGSORT option with PROC SORT, SAS disables threading. The TAGSORT option

stores only the BY variables and the observation numbers, named tags, in temporary files.

When the sorting process completes, PROC SORT uses the tags to retrieve observations from

the input data set in sorted order.

To control the number of processors that are available for SAS to use, you can specify the

CPUCOUNT= system option. The default CPU count is the actual number of CPUs

available. Specifying a numeric value for the CPUCOUNT= option can only decrease the

number of CPUs available to SAS. If you don’t have the number of CPUs specified as the

CPUCOUNT= value, SAS uses the actual number of CPUs available. However, this might

result in reduced overall performance, as SAS may allocate more threads than available

processors. Your system administrator might limit the number of CPUs that are available for

SAS processing. So ACTUAL might be lower than the total number of CPUs in the machine

that SAS is using.

OPTIONS CPUCOUNT=ACTUAL | 1-1024;

Improving Sort Performance

When you use the SAS sort, a quick rule of thumb for sort space is four times the size of the

SAS data set. Even when you sort in place (sort a data set back to the same name), you need

enough space in the library for two copies of the data.

Sorting takes place in the PROC SORT utility work space. This work space is shared by

memory and disk. But if you can sort the data all in memory, the sort runs faster, because you

avoid writing and reading temporary utility swap files.

Determining how much sort space you need is not an exact science. The amount of space that

the SAS sort needs depends on four conditions:

Page 4: 8 Sorting SAS Data Sets

Lesson 8: Sorting SAS Data Sets

SAS® Programming 3: Advanced Techniques and Efficiencies 4

o The first is whether PROC SORT can use threading. Threaded sorts take less space than

non-threaded sorts. Threaded sorts generally require three times the size of the SAS data

set being sorted.

o The second condition concerns the data itself and has two parts: the length of the

observations, and the number of variables in the BY statement and their storage lengths.

These factors are important because the utility work space requires enough room to hold

an entire observation and two copies of the BY variable values for every observation. The

SAS sort routine uses the duplicate BY variable values to retrieve BY values quickly

without having to reread the entire observation.

o The third condition is the operating environment where PROC SORT executes, which

plays a big part in allocating space.

o The final condition is the library where PROC SORT writes the sorted data. You need

enough space in the source library for one data set and enough space in the target library

for one copy of the data set.

For more information about calculating sort space, see Calculating Sort Space in the

appendix Details.

Determining sort space requirements has no specific guidelines, because the required sort

space depends on your data. However, if you don’t have enough memory or virtual memory

allocated to PROC SORT, the procedure won’t have enough memory to divide the space for

each thread.

To avoid this problem, you can use the SORTSIZE= option in the PROC SORT statement to

specify the amount of memory that's available to PROC SORT. The SORTSIZE= option can

also improve the sort performance by restricting the operating system’s swapping of memory

to disk. The possible SORTSIZE= values depend on your operating environment.

SORTSIZE=n | nK | nM | nG | MAX | SIZE

The default SORTSIZE= value in the Windows operating environment is 64 megabytes.

A SORTSIZE= value as large as the required sort space ensures that the sort occurs in

memory. This reduces processing time.

If PROC SORT needs more memory than you specify, it creates a temporary utility file to

complete the sort. This increases processing time. For the multi-threaded SAS®9 sort, if the

SORTSIZE= value is too small, the sort fails to complete at all.

Page 5: 8 Sorting SAS Data Sets

Lesson 8: Sorting SAS Data Sets

SAS® Programming 3: Advanced Techniques and Efficiencies 5

For optimal performance, you should set the SORTSIZE= option to a value smaller than the

available physical memory. This enables the programs and the operating environment to stay

resident in memory.

You should investigate how changing the value of the SORTSIZE= option affects resources.

In some cases, using a host sort utility might be the most effective way to sort data. A host

sort utility is the operating system's native sort utility, such as IBM's DFSORT, or a third-

party sort utility such as SYNCSORT. Host sort utilities are available in the Windows,

UNIX, and z/OS operating environments. Ask your system administrator whether a host sort

utility is available at your site.

Generally, the SAS sort is more efficient for smaller data sets, because it is an in-memory

sort, whereas a host sort is more efficient for larger data sets.

You can use several SAS system options to specify the sort utility that PROC SORT uses.

o The SORTPGM= option specifies whether PROC SORT uses the SAS sort utility or the

host sort utility. This OPTIONS statement specifies SORTPGM=HOST to always sort

using the host sort utility. If you specify BEST, SAS chooses a sort utility based on two

factors: the number of bytes being sorted and the value of the SORTCUTP= option.

o The SORTCUTP= option specifies the cutoff point between the SAS sort and the host

sort. If the data set contains more bytes than the SORTCUTP= value, the host sort utility

sorts the entire data set. The default values are 0 in the Windows and UNIX operating

environments, which means the SAS sort is used, and four megabytes in the z/OS

operating environment. To determine the optimal value of the SORTCUTP= option, you

should specify the SORTPGM= option and benchmark a PROC SORT step with larger

and larger data sets.

o The SORTPGM= option also interacts with the SORTNAME= option. If the value of the

SORTPGM= option is BEST or HOST and you happen to have multiple host sort utilities

available, you can use the SORTNAME= option to specify which host sort utility PROC

SORT uses.

Setting the Sort Indicator and the Validation Indicator

Even when PROC SORT creates a separate output data set, if the data is already sorted, the

procedure only copies the data set. When SAS sorts a data set, it sets a sort indicator. When

the sort indicator is YES and you try to re-sort the data by the same BY variables, SAS

doesn't perform another sort.

Page 6: 8 Sorting SAS Data Sets

Lesson 8: Sorting SAS Data Sets

SAS® Programming 3: Advanced Techniques and Efficiencies 6

At the bottom of PROC CONTENTS output, SAS prints sort information for the data set,

including the BY variables used for the sort, a validation indicator for whether or not SAS

validated the sort, and the collating sequence used to order the data.

You can set the sort indicator and the validation indicator in several ways.

If the input data is already in sorted order, you can specify the order by using the

SORTEDBY= data set option. This option applies to the output data set. The BY clause

indicates the data order, and _NULL_ removes any existing sort information from the

descriptor portion of the data set.

(SORTEDBY=BY-clause | _NULL_ )

The SORTEDBY= option sets the sort indicator on the data set to YES and asserts that the

data is ordered by order date. However, because SAS hasn't yet validated the data order, it

has to check the order while processing the data set.

You can use two methods for asking SAS to validate that a data set really is sorted, sort the

data set only if necessary, and set the validation indicator to YES:

o The first is the SORTVALIDATE system option. This option causes the SORT procedure

to validate that a data set is sorted correctly when a user-specified sort indicator is set. If

the data set is sorted correctly, SAS sets the validation indicator to YES. If the data set is

not sorted correctly, SAS sorts the data set and then sets the validation indicator to YES.

OPTIONS SORTVALIDATE;

o The second way is using the PRESORTED option in the PROC SORT statement. The

PRESORTED option is available beginning in SAS 9.2. Here's the syntax—very simple.

This option tells PROC SORT to check the input data set to determine whether the

observations are in order before sorting the data. By specifying the PRESORTED option,

you can avoid the cost of sorting the data set. The PRESORTED option is powerful. It

validates the sequencing of the data, sorts the data if it is not sequenced properly, and sets

both the sort indicator and the validation indicator to YES.

PROC SORT DATA=SAS-data-set PRESORTED;

Controlling the Sort Order

When you sort data, you can control the sort order in two ways: by specifying a collating

sequence, and by specifying whether or not the observations in a BY group remain in the

same order in the output data set. Controlling the order of observations is also a potential way

to improve sort performance.

Page 7: 8 Sorting SAS Data Sets

Lesson 8: Sorting SAS Data Sets

SAS® Programming 3: Advanced Techniques and Efficiencies 7

The character set determines the sort order of characters. By default, PROC SORT uses the

ASCII collating sequence in the Windows and UNIX operating environments, and the

EBCDIC collating sequence in the z/OS operating environment. To change the collating

sequence, you can specify one collating option in the PROC SORT statement.

In addition, by default PROC SORT maintains the order of the observations within a BY

group in the output data set. You can also use the EQUALS | NOEQUALS option in the

PROC SORT statement to specify the order of the observations within a BY group in the

output data set. EQUALS preserves the original order of observations withing BY groups in

the input data in BY groups in output data. EQUALS is the default, but it’s more expensive in

terms of CPU time, memory, and I/O. NOEQUALS does not guarantee the original order of

observations within BY groups. However, NOEQUALS can save CPU time, memory, and

I/O.

Both EQUALS and NOEQUALS guarantee the order of the data that you specify in the BY

statement.

To detect and remove observations with duplicate BY values, you can use the NODUPKEY

option in the PROC SORT statement. To specify the output data set where SAS writes the

duplicate observations, you can use the DUPOUT= option. The DUPOUT= option is new in

SAS 9.

PROC SORT DATA=SAS-data-set NODUPKEY DUPOUT=SAS-data-set;

To replace the default ASCII or EBCDIC collating sequence, you can specify one collating

sequence option in the PROC SORT statement. You can specify REVERSE to reverse the

default sequence. Or you can specify DANISH, FINNISH, NORWEGIAN, POLISH, or

SWEDISH. You can also specify NATIONAL for a customized sequence. Finally, you can

specify the SORTSEQ= option to specify a collating sequence, a translation table such as

POLISH or SPANISH, an encoding, or the keyword LINGUISTIC.

PROC SORT DATA=SAS-data-set <collating-sequence-option>;

In SAS 9.2, you can use SORTSEQ=LINGUISTIC to specify linguistic collation, which sorts

characters according to rules of a specified language. In turn, the setting of the SAS system

option LOCALE determines the language.

Within SORTSEQ=LINGUISTIC, the NUMERIC_COLLATION=ON collating rule orders

integer values within the text by their numeric values instead of by the characters used to

represent the numbers.

You can also specify other collating rules for the LINGUISTIC option, including

CASE_FIRST= and STRENGTH=. For more information about these collating rules, see

Collating Rules in the appendix Details.

Page 8: 8 Sorting SAS Data Sets

Lesson 8: Sorting SAS Data Sets

SAS® Programming 3: Advanced Techniques and Efficiencies 8

Sample Code

Using the THREADS | NOTHREADS Option

options nothreads;

proc sort data=orion.order_fact threads;

by Order_Date;

run;

Using the CPUCOUNT= Option

options cpucount=5;

Using the SORTSIZE= Option

proc sort data=orion.order_fact sortsize=300M;

by Order_Date;

run;

Using the SORTPGM=, SORTCUTP=, and SORTNAME= Options

options sortpgm=best sortcutp=40M sortname="syncsort";

Using the SORTEDBY= Option

filename M1 'mon1.dat'; * change the filepath as needed;

data january(sortedby=Order_Date);

infile M1 dlm=',';

input Customer_ID Order_ID Order_Type

Order_Date : date9.

Delivery_Date : date9.;

run;

proc contents data=january;

run;

Page 9: 8 Sorting SAS Data Sets

Lesson 8: Sorting SAS Data Sets

SAS® Programming 3: Advanced Techniques and Efficiencies 9

Using the PRESORTED Option

proc sort data=orion.salesstaff presorted;

by Emp_Hire_Date;

run;

Using the EQUALS | NOEQUALS Option

proc sort data=orion.customer

out=customer_equals equals;

by Country;

run;

proc print data=customer_equals(obs=10);

var Customer_ID Country;

title 'With EQUALS Option';

run;

Using the NODUPKEY and DUPOUT= Options

proc sort data=orion.salesstaff nodupkey

out=oneemp

dupout=extra;

by Employee_ID;

run;

Using the SORTSEQ= Option with the NUMERIC_COLLATION=ON Collating Rule

proc sort data=orion.customer out=customer

sortseq=linguistic(numeric_collation=on);

by Customer_Address;

run;