8 sorting sas data sets
TRANSCRIPT
Lesson 8: Sorting SAS Data Sets
Summary
SAS® Programming 3: Advanced Techniques and Efficiencies 1
Copyright © 2010 SAS Institute Inc., Cary, NC, USA. All rights reserved.
Main Points
Understanding the SORT Procedure
Sorting data is useful for reordering data for reporting, reducing data retrieval time, and
enabling BY-group processing. However, PROC SORT is resource-intensive, using
considerable disk space, memory, I/O, and CPU time.
You can use options or techniques with PROC SORT to minimize resource usage.
SAS supports PROC SORT in all operating environments, so PROC SORT can’t take
advantage of any platform-specific sort enhancements.
PROC SORT executes in memory up to the limit imposed by the SORTSIZE= option. In
fact, PROC SORT minimizes the use of external storage and tries to sort entirely in memory,
if possible.
By default, PROC SORT executes in parallel using multiple threads. Taking advantage of
threaded processing in SAS can help you reduce I/O when you sort data. These are some
useful terms related to threaded processing:
Term Definition
thread a single, independent flow of execution through a
program or within a process
parallel processing multiple units of work that the operating system
schedules for concurrent execution
symmetric multiprocessing
machines (SMPs)
computers with multiple CPUs that share the same
memory and a thread-enabled operating system; can
spawn and process multiple threads simultaneously
You can determine how many CPUs are available in your SAS session by using a PROC
OPTIONS step that specifies OPTION=CPUCOUNT. When you specify
OPTION=CPUCOUNT, the SAS log displays the number of available processors.
Lesson 8: Sorting SAS Data Sets
SAS® Programming 3: Advanced Techniques and Efficiencies 2
PROC OPTIONS OPTION=CPUCOUNT; RUN;
This is the process that PROC SORT uses for parallel processing. This example uses four
threads:
Steps in Parallel Processing Using PROC SORT
1. PROC SORT breaks the SAS data set into chunks by dividing the total
number of observations by the total number of threads available to do the
parallel processing.
2. PROC SORT creates the processing threads.
3. The threads read and process data:
Thread 1 starts reading and processing data chunk 1.
Thread 2 reads and processes chunk 2.
Thread 3 reads and processes chunk 3.
Thread 4 reads and processes chunk 4.
4. PROC SORT collates the partial results.
Using threaded processing completes the sort in less real time than handling each task
sequentially, although the CPU time is generally increased.
Other SAS tasks besides sorting can also exploit threading. These tasks include subsetting
using WHERE expressions, filtering variables using DROP or KEEP statements or data set
options, indexing, and summarizing data. In addition to PROC SORT, these Base SAS
procedures are multithreaded: PROC MEANS, PROC SUMMARY, PROC REPORT, PROC
SQL (using the GROUP BY and ORDER BY clauses), and PROC TABULATE. Many
SAS/STAT procedures are also multithreaded.
When you benchmark using the threaded procedures, compare real time rather than CPU
time. The back-end collating process to re-create the single data set might increase total CPU
time while reducing real or elapsed time.
Lesson 8: Sorting SAS Data Sets
SAS® Programming 3: Advanced Techniques and Efficiencies 3
Controlling Threaded Processing in PROC SORT
You can enable or disable threaded sorting in two ways. You can use the THREADS |
NOTHREADS system option , or you can specify the THREADS | NOTHREADS option in
the PROC SORT statement. In both cases, the default is THREADS.
Specifying the THREADS | NOTHREADS option in a PROC statement overrides the
THREADS | NOTHREADS system option.
The THREADS | NOTHREADS option interacts with the TAGSORT option. If you specify
the TAGSORT option with PROC SORT, SAS disables threading. The TAGSORT option
stores only the BY variables and the observation numbers, named tags, in temporary files.
When the sorting process completes, PROC SORT uses the tags to retrieve observations from
the input data set in sorted order.
To control the number of processors that are available for SAS to use, you can specify the
CPUCOUNT= system option. The default CPU count is the actual number of CPUs
available. Specifying a numeric value for the CPUCOUNT= option can only decrease the
number of CPUs available to SAS. If you don’t have the number of CPUs specified as the
CPUCOUNT= value, SAS uses the actual number of CPUs available. However, this might
result in reduced overall performance, as SAS may allocate more threads than available
processors. Your system administrator might limit the number of CPUs that are available for
SAS processing. So ACTUAL might be lower than the total number of CPUs in the machine
that SAS is using.
OPTIONS CPUCOUNT=ACTUAL | 1-1024;
Improving Sort Performance
When you use the SAS sort, a quick rule of thumb for sort space is four times the size of the
SAS data set. Even when you sort in place (sort a data set back to the same name), you need
enough space in the library for two copies of the data.
Sorting takes place in the PROC SORT utility work space. This work space is shared by
memory and disk. But if you can sort the data all in memory, the sort runs faster, because you
avoid writing and reading temporary utility swap files.
Determining how much sort space you need is not an exact science. The amount of space that
the SAS sort needs depends on four conditions:
Lesson 8: Sorting SAS Data Sets
SAS® Programming 3: Advanced Techniques and Efficiencies 4
o The first is whether PROC SORT can use threading. Threaded sorts take less space than
non-threaded sorts. Threaded sorts generally require three times the size of the SAS data
set being sorted.
o The second condition concerns the data itself and has two parts: the length of the
observations, and the number of variables in the BY statement and their storage lengths.
These factors are important because the utility work space requires enough room to hold
an entire observation and two copies of the BY variable values for every observation. The
SAS sort routine uses the duplicate BY variable values to retrieve BY values quickly
without having to reread the entire observation.
o The third condition is the operating environment where PROC SORT executes, which
plays a big part in allocating space.
o The final condition is the library where PROC SORT writes the sorted data. You need
enough space in the source library for one data set and enough space in the target library
for one copy of the data set.
For more information about calculating sort space, see Calculating Sort Space in the
appendix Details.
Determining sort space requirements has no specific guidelines, because the required sort
space depends on your data. However, if you don’t have enough memory or virtual memory
allocated to PROC SORT, the procedure won’t have enough memory to divide the space for
each thread.
To avoid this problem, you can use the SORTSIZE= option in the PROC SORT statement to
specify the amount of memory that's available to PROC SORT. The SORTSIZE= option can
also improve the sort performance by restricting the operating system’s swapping of memory
to disk. The possible SORTSIZE= values depend on your operating environment.
SORTSIZE=n | nK | nM | nG | MAX | SIZE
The default SORTSIZE= value in the Windows operating environment is 64 megabytes.
A SORTSIZE= value as large as the required sort space ensures that the sort occurs in
memory. This reduces processing time.
If PROC SORT needs more memory than you specify, it creates a temporary utility file to
complete the sort. This increases processing time. For the multi-threaded SAS®9 sort, if the
SORTSIZE= value is too small, the sort fails to complete at all.
Lesson 8: Sorting SAS Data Sets
SAS® Programming 3: Advanced Techniques and Efficiencies 5
For optimal performance, you should set the SORTSIZE= option to a value smaller than the
available physical memory. This enables the programs and the operating environment to stay
resident in memory.
You should investigate how changing the value of the SORTSIZE= option affects resources.
In some cases, using a host sort utility might be the most effective way to sort data. A host
sort utility is the operating system's native sort utility, such as IBM's DFSORT, or a third-
party sort utility such as SYNCSORT. Host sort utilities are available in the Windows,
UNIX, and z/OS operating environments. Ask your system administrator whether a host sort
utility is available at your site.
Generally, the SAS sort is more efficient for smaller data sets, because it is an in-memory
sort, whereas a host sort is more efficient for larger data sets.
You can use several SAS system options to specify the sort utility that PROC SORT uses.
o The SORTPGM= option specifies whether PROC SORT uses the SAS sort utility or the
host sort utility. This OPTIONS statement specifies SORTPGM=HOST to always sort
using the host sort utility. If you specify BEST, SAS chooses a sort utility based on two
factors: the number of bytes being sorted and the value of the SORTCUTP= option.
o The SORTCUTP= option specifies the cutoff point between the SAS sort and the host
sort. If the data set contains more bytes than the SORTCUTP= value, the host sort utility
sorts the entire data set. The default values are 0 in the Windows and UNIX operating
environments, which means the SAS sort is used, and four megabytes in the z/OS
operating environment. To determine the optimal value of the SORTCUTP= option, you
should specify the SORTPGM= option and benchmark a PROC SORT step with larger
and larger data sets.
o The SORTPGM= option also interacts with the SORTNAME= option. If the value of the
SORTPGM= option is BEST or HOST and you happen to have multiple host sort utilities
available, you can use the SORTNAME= option to specify which host sort utility PROC
SORT uses.
Setting the Sort Indicator and the Validation Indicator
Even when PROC SORT creates a separate output data set, if the data is already sorted, the
procedure only copies the data set. When SAS sorts a data set, it sets a sort indicator. When
the sort indicator is YES and you try to re-sort the data by the same BY variables, SAS
doesn't perform another sort.
Lesson 8: Sorting SAS Data Sets
SAS® Programming 3: Advanced Techniques and Efficiencies 6
At the bottom of PROC CONTENTS output, SAS prints sort information for the data set,
including the BY variables used for the sort, a validation indicator for whether or not SAS
validated the sort, and the collating sequence used to order the data.
You can set the sort indicator and the validation indicator in several ways.
If the input data is already in sorted order, you can specify the order by using the
SORTEDBY= data set option. This option applies to the output data set. The BY clause
indicates the data order, and _NULL_ removes any existing sort information from the
descriptor portion of the data set.
(SORTEDBY=BY-clause | _NULL_ )
The SORTEDBY= option sets the sort indicator on the data set to YES and asserts that the
data is ordered by order date. However, because SAS hasn't yet validated the data order, it
has to check the order while processing the data set.
You can use two methods for asking SAS to validate that a data set really is sorted, sort the
data set only if necessary, and set the validation indicator to YES:
o The first is the SORTVALIDATE system option. This option causes the SORT procedure
to validate that a data set is sorted correctly when a user-specified sort indicator is set. If
the data set is sorted correctly, SAS sets the validation indicator to YES. If the data set is
not sorted correctly, SAS sorts the data set and then sets the validation indicator to YES.
OPTIONS SORTVALIDATE;
o The second way is using the PRESORTED option in the PROC SORT statement. The
PRESORTED option is available beginning in SAS 9.2. Here's the syntax—very simple.
This option tells PROC SORT to check the input data set to determine whether the
observations are in order before sorting the data. By specifying the PRESORTED option,
you can avoid the cost of sorting the data set. The PRESORTED option is powerful. It
validates the sequencing of the data, sorts the data if it is not sequenced properly, and sets
both the sort indicator and the validation indicator to YES.
PROC SORT DATA=SAS-data-set PRESORTED;
Controlling the Sort Order
When you sort data, you can control the sort order in two ways: by specifying a collating
sequence, and by specifying whether or not the observations in a BY group remain in the
same order in the output data set. Controlling the order of observations is also a potential way
to improve sort performance.
Lesson 8: Sorting SAS Data Sets
SAS® Programming 3: Advanced Techniques and Efficiencies 7
The character set determines the sort order of characters. By default, PROC SORT uses the
ASCII collating sequence in the Windows and UNIX operating environments, and the
EBCDIC collating sequence in the z/OS operating environment. To change the collating
sequence, you can specify one collating option in the PROC SORT statement.
In addition, by default PROC SORT maintains the order of the observations within a BY
group in the output data set. You can also use the EQUALS | NOEQUALS option in the
PROC SORT statement to specify the order of the observations within a BY group in the
output data set. EQUALS preserves the original order of observations withing BY groups in
the input data in BY groups in output data. EQUALS is the default, but it’s more expensive in
terms of CPU time, memory, and I/O. NOEQUALS does not guarantee the original order of
observations within BY groups. However, NOEQUALS can save CPU time, memory, and
I/O.
Both EQUALS and NOEQUALS guarantee the order of the data that you specify in the BY
statement.
To detect and remove observations with duplicate BY values, you can use the NODUPKEY
option in the PROC SORT statement. To specify the output data set where SAS writes the
duplicate observations, you can use the DUPOUT= option. The DUPOUT= option is new in
SAS 9.
PROC SORT DATA=SAS-data-set NODUPKEY DUPOUT=SAS-data-set;
To replace the default ASCII or EBCDIC collating sequence, you can specify one collating
sequence option in the PROC SORT statement. You can specify REVERSE to reverse the
default sequence. Or you can specify DANISH, FINNISH, NORWEGIAN, POLISH, or
SWEDISH. You can also specify NATIONAL for a customized sequence. Finally, you can
specify the SORTSEQ= option to specify a collating sequence, a translation table such as
POLISH or SPANISH, an encoding, or the keyword LINGUISTIC.
PROC SORT DATA=SAS-data-set <collating-sequence-option>;
In SAS 9.2, you can use SORTSEQ=LINGUISTIC to specify linguistic collation, which sorts
characters according to rules of a specified language. In turn, the setting of the SAS system
option LOCALE determines the language.
Within SORTSEQ=LINGUISTIC, the NUMERIC_COLLATION=ON collating rule orders
integer values within the text by their numeric values instead of by the characters used to
represent the numbers.
You can also specify other collating rules for the LINGUISTIC option, including
CASE_FIRST= and STRENGTH=. For more information about these collating rules, see
Collating Rules in the appendix Details.
Lesson 8: Sorting SAS Data Sets
SAS® Programming 3: Advanced Techniques and Efficiencies 8
Sample Code
Using the THREADS | NOTHREADS Option
options nothreads;
proc sort data=orion.order_fact threads;
by Order_Date;
run;
Using the CPUCOUNT= Option
options cpucount=5;
Using the SORTSIZE= Option
proc sort data=orion.order_fact sortsize=300M;
by Order_Date;
run;
Using the SORTPGM=, SORTCUTP=, and SORTNAME= Options
options sortpgm=best sortcutp=40M sortname="syncsort";
Using the SORTEDBY= Option
filename M1 'mon1.dat'; * change the filepath as needed;
data january(sortedby=Order_Date);
infile M1 dlm=',';
input Customer_ID Order_ID Order_Type
Order_Date : date9.
Delivery_Date : date9.;
run;
proc contents data=january;
run;
Lesson 8: Sorting SAS Data Sets
SAS® Programming 3: Advanced Techniques and Efficiencies 9
Using the PRESORTED Option
proc sort data=orion.salesstaff presorted;
by Emp_Hire_Date;
run;
Using the EQUALS | NOEQUALS Option
proc sort data=orion.customer
out=customer_equals equals;
by Country;
run;
proc print data=customer_equals(obs=10);
var Customer_ID Country;
title 'With EQUALS Option';
run;
Using the NODUPKEY and DUPOUT= Options
proc sort data=orion.salesstaff nodupkey
out=oneemp
dupout=extra;
by Employee_ID;
run;
Using the SORTSEQ= Option with the NUMERIC_COLLATION=ON Collating Rule
proc sort data=orion.customer out=customer
sortseq=linguistic(numeric_collation=on);
by Customer_Address;
run;