sausag 69 – 20 feb 2014 smarter sorts jerry le breton (softscape solutions) & doug lean (dhs)...

23
SAUSAG 69 – 20 Feb 2014 Smarter Sorts Jerry Le Breton (Softscape Solutions) & Doug Lean (DHS) Beyond the Obvious

Upload: lorraine-mcgee

Post on 08-Jan-2018

219 views

Category:

Documents


0 download

DESCRIPTION

Sorting - The Obvious First proc sort data=claims; by claim client; Its important to know your data How many variables How many distinct data values for each Sort puts your records in order - BY the values of the variables you list. SAUSAG 69 – 20 Feb 2014

TRANSCRIPT

Page 1: SAUSAG 69 – 20 Feb 2014 Smarter Sorts Jerry Le Breton (Softscape Solutions) & Doug Lean (DHS) Beyond the Obvious

SAUSAG 69 – 20 Feb 2014

Smarter Sorts

Jerry Le Breton (Softscape Solutions) & Doug Lean (DHS)

Beyond the Obvious

Page 2: SAUSAG 69 – 20 Feb 2014 Smarter Sorts Jerry Le Breton (Softscape Solutions) & Doug Lean (DHS) Beyond the Obvious

Sorting –The Obvious First Why Sort ?

“Data and information is almost always presented in a sorted or structured way”

Page 3: SAUSAG 69 – 20 Feb 2014 Smarter Sorts Jerry Le Breton (Softscape Solutions) & Doug Lean (DHS) Beyond the Obvious

Sorting - The Obvious Firstproc sort data=claims; by claim client;

Its important to know your data• How many variables• How many distinct data values for each

Sort puts your records in order- BY the values of the variables

you list.

SAUSAG 69 – 20 Feb 2014

Page 4: SAUSAG 69 – 20 Feb 2014 Smarter Sorts Jerry Le Breton (Softscape Solutions) & Doug Lean (DHS) Beyond the Obvious

Sorting – Do You Need To?proc sort data=claims; by claim;Proc tabulate ...; class claim; ... An unnecessary SORT

Some PROCS do their own sorting:TABULATEMEANSREPORTSQL(which can run out of memory for really big data sets)

SAUSAG 69 – 20 Feb 2014

Page 5: SAUSAG 69 – 20 Feb 2014 Smarter Sorts Jerry Le Breton (Softscape Solutions) & Doug Lean (DHS) Beyond the Obvious

Sorting – Do You Need To? Only use PROC SORT before REPORT,

TABULATE, MEANS if there’s another reason later. For PROC MEANS substitute BY with CLASS

e.g. PROC MEANS NWAY; CLASS x y z;

Is similar to PROC SORT; BY x y z;

PROC MEANS; BY x y z;

And saves significant time by avoiding the SORT

SAUSAG 69 – 20 Feb 2014

Page 6: SAUSAG 69 – 20 Feb 2014 Smarter Sorts Jerry Le Breton (Softscape Solutions) & Doug Lean (DHS) Beyond the Obvious

Sort Only What You Needproc sort data=claims out=Sorted_claims; where client =: 'A'; by claim;

Sort just the rows you want…

… and just the columns you want…proc sort data=claims(keep = c:) out=Sorted_claims; by claim;

Leaving out unwanted rows and columns can produce dramatic performance improvements.

SAUSAG 69 – 20 Feb 2014

Page 7: SAUSAG 69 – 20 Feb 2014 Smarter Sorts Jerry Le Breton (Softscape Solutions) & Doug Lean (DHS) Beyond the Obvious

Sorting – Proc Sort vs Proc SQL/* SORT Procedure */proc sort data=claims; by client claim;run;

/* SQL Procedure */proc sql; create table claims as select * from claims order by client claim; quit;

Both will sort your data. No significant performance difference. Choose according to clarity, functional requirement and

efficiency. Make it as clear and simple as possible!

SAUSAG 69 – 20 Feb 2014

Page 8: SAUSAG 69 – 20 Feb 2014 Smarter Sorts Jerry Le Breton (Softscape Solutions) & Doug Lean (DHS) Beyond the Obvious

Sorted Status of a Data Set

proc sort data=claims; by claim client;

Sort Information

Sortedby CLAIM CLIENT Validated YES Character Set ANSI

Sort status is saved as part of a SAS data set.

So SAS won’t waste time re-sorting if it’s already in the required order.

SAUSAG 69 – 20 Feb 2014

Page 9: SAUSAG 69 – 20 Feb 2014 Smarter Sorts Jerry Le Breton (Softscape Solutions) & Doug Lean (DHS) Beyond the Obvious

Setting Sorted Status of a Data Set

data client_claims (sortedby = client ); merge clients claims; by client ;

Sort Information

Sortedby CLIENT Validated NO Character Set ANSI

If you know a data set is sorted, say so with the SORTEDBY= option!.

So SAS won’t waste time re-sorting later.

SAUSAG 69 – 20 Feb 2014

Page 10: SAUSAG 69 – 20 Feb 2014 Smarter Sorts Jerry Le Breton (Softscape Solutions) & Doug Lean (DHS) Beyond the Obvious

Presorted or Notsorted

SAUSAG 69 – 20 Feb 2014

proc sort data=claims out=sorted presorted; by claim;

PRESORTED option for when data probably sorted!SAS will check and only sort if necessary.

proc print data=grouped_claims; by claim NOTSORTED;

No need to sort if data is grouped BY the required variable – it doesn’t matter its NOTSORTED (you just have to say so).

Page 11: SAUSAG 69 – 20 Feb 2014 Smarter Sorts Jerry Le Breton (Softscape Solutions) & Doug Lean (DHS) Beyond the Obvious

Sorting and Maintaining Order

proc sort data=claims; by claim ;

By default, SAS maintains the original order of records within a BY group.

proc sort data=claims noequals; by claim ;

Using the NOEQUALS option means SAS won’t necessarily retain the original ordering.

More efficient but, directly affects the results of using NODUPKEY

SAUSAG 69 – 20 Feb 2014

Page 12: SAUSAG 69 – 20 Feb 2014 Smarter Sorts Jerry Le Breton (Softscape Solutions) & Doug Lean (DHS) Beyond the Obvious

Sorting Duplicatesproc sort data=claims out=no_duplicates nodupkey; by claim;

proc sort data=claims out=no_duplicates

dupout=dups nodupkey; by claim;

NODUPKEY effectively keeps the first record of any duplicates.

DUPOUT= puts the duplicates to a separate table.

SAUSAG 69 – 20 Feb 2014

Page 13: SAUSAG 69 – 20 Feb 2014 Smarter Sorts Jerry Le Breton (Softscape Solutions) & Doug Lean (DHS) Beyond the Obvious

Separating Unique & Duplicate Rows

proc sort data=claims out=sorted ; by claim;run;data unique_claims dup_claims; set sorted; by claim; if first.claim and last.claim then output unique_claims; else output dup_claims;run;

It works, but needs an extra pass of the data.

SAUSAG 69 – 20 Feb 2014

Page 14: SAUSAG 69 – 20 Feb 2014 Smarter Sorts Jerry Le Breton (Softscape Solutions) & Doug Lean (DHS) Beyond the Obvious

Separating Unique & Duplicate Rows- the smarter way

proc sort data=claims out=duplicates uniqueout=uniques nouniquekey ; by claim;run;

NOUNIQUEKEY ensures no records with a unique key are

written to the OUT= table.

…and the UNIQUEOUT= option directs the unique records to a

separate table

SAUSAG 69 – 20 Feb 2014

Page 15: SAUSAG 69 – 20 Feb 2014 Smarter Sorts Jerry Le Breton (Softscape Solutions) & Doug Lean (DHS) Beyond the Obvious

Sorting – Case Insensitiveproc sort data=names out=simply_sorted;by name;

data names2; set names; upcase_name = upcase(name);proc sort data=names2 out=upcase_sorted(keep=name); by upcase_name;

Upper case letters are before lower case in the ASCII collating sequence.

Creating an upper (or lower) case copy of the variable is the old solution.

SAUSAG 69 – 20 Feb 2014

Page 16: SAUSAG 69 – 20 Feb 2014 Smarter Sorts Jerry Le Breton (Softscape Solutions) & Doug Lean (DHS) Beyond the Obvious

Sorting – Case Insensitive - Smarter

proc sort data=names out=linguistic_sorted sortseq=linguistic;by name;

SORTSEQ option specifies the collating sequence (ASCII/EBCDIC/other languages) or, LINGUISTIC option modifies the current collating sequence.

The affect is to make the sort case insensitive.

SAUSAG 69 – 20 Feb 2014

Page 17: SAUSAG 69 – 20 Feb 2014 Smarter Sorts Jerry Le Breton (Softscape Solutions) & Doug Lean (DHS) Beyond the Obvious

Sorting – Case Insensitive – with SQL

proc sql;create table sql_sorted asselect * from namesorder by upcase(name);

PROC SQL allows the use of functions in the Order By (and other) clauses.

The result is different from Proc SORT using the sorteq=linguistic.

SAUSAG 69 – 20 Feb 2014

Page 18: SAUSAG 69 – 20 Feb 2014 Smarter Sorts Jerry Le Breton (Softscape Solutions) & Doug Lean (DHS) Beyond the Obvious

Sorting Out Spaces

proc sort data=names out=simply_sorted;by name;

data names_temp; set names; temp_name = upcase(compress(name));run;proc sort data=names_temp out=temp_sorted(keep=name);by temp_name;

A standard sort is obviously no use.

Creating another variable for sorting, without spaces, is the old solution.

Page 19: SAUSAG 69 – 20 Feb 2014 Smarter Sorts Jerry Le Breton (Softscape Solutions) & Doug Lean (DHS) Beyond the Obvious

Sorting Out Spaces

Proc SORT can too! This sub-option of the LINGUISTIC sortseq option, effectively

ignores spaces as well as being case-insensitive.

proc sql;create table sql_sorted asselect * from namesorder by upcase(compress(name));

proc sort data=names out=alt_handling_sorted sortseq = linguistic(alternate_handling = shifted);by name;

Proc SQL can do it too.

SAUSAG 69 – 20 Feb 2014

Page 20: SAUSAG 69 – 20 Feb 2014 Smarter Sorts Jerry Le Breton (Softscape Solutions) & Doug Lean (DHS) Beyond the Obvious

Sorting by Numbers

proc sort data=students out=simply_sorted;by student;

Sorting text with numeric prefixes e.g. student id and name …

… results in nothing useful!

SAUSAG 69 – 20 Feb 2014

Page 21: SAUSAG 69 – 20 Feb 2014 Smarter Sorts Jerry Le Breton (Softscape Solutions) & Doug Lean (DHS) Beyond the Obvious

Sorting by Numbers

An extra data step can create a numeric variable to sort with (as can SQL of course)

data students_temp; set students; student_num = input(scan(student,1), 2.);run;proc sort data=students_temp out=temp_sorted(keep=student);by student_num;

proc sql;create table sql_sorted asselect * from studentsorder by input(scan(student,1), 2.);

SAUSAG 69 – 20 Feb 2014

Page 22: SAUSAG 69 – 20 Feb 2014 Smarter Sorts Jerry Le Breton (Softscape Solutions) & Doug Lean (DHS) Beyond the Obvious

Sorting by Numbers

The numeric_collation sub-option of the LINGUISTIC sortseq option, sorts by the

numeric values that prefix the variable values.

proc sort data=students out=num_collation_sorted sortseq = linguistic (numeric_collation=on);by student;

SAUSAG 69 – 20 Feb 2014

Page 23: SAUSAG 69 – 20 Feb 2014 Smarter Sorts Jerry Le Breton (Softscape Solutions) & Doug Lean (DHS) Beyond the Obvious

Questions? Did you learn something new from this presentation?

SAUSAG 69 – 20 Feb 2014