identifying duplicates made easy - old.wuss.orgold.wuss.org/proceedings16/84_final_paper_pdf.pdf ·...

5
1 Identifying Duplicates Made Easy Elizabeth Guerrero Angel, University of California, Davis, CA Yunin Ludena, University of California, Davis, CA ABSTRACT Have you ever had trouble removing or finding the exact type of duplicate you want? SAS® offers several different ways to identify, extract, and/or remove duplicates, depending on exactly what you want. We will start by demonstrating perhaps the most commonly used method, PROC SORT, and the types of duplicates it can identify and how to remove, or store them. Then, we will present the other less commonly used methods which might give information that PROC SORT cannot offer, including the data step (FIRST./LAST.), PROC SQL, and PROC FREQ. The programming is demonstrated at a beginner’s level. INTRODUCTION Working with a wide variety of data, from statewide files with millions of records, to local data sets with thousands of observations, but easily over 10,000 raw variables, we have learned that identifying and removing duplicates is absolutely essential. It is also important to understand where these duplicates came from, whether it be a programming mistake or data entry duplication, and which of the duplicates the true value is. We will present in this paper four different ways of identifying duplicates. We will start with perhaps the most common method, PROC SORT, where we also will show how convenient it is to output and delete duplicates. Then, we will look at what the other methods have to offer to identify duplicates, including the PROC SORT/FIRST.ID and LAST.ID method, PROC SQL, and PROC FREQ. We will introduce the basics of each of these methods, along with our example data set, to better illustrate how these methods work. THE BASIC PROC SORT TO IDENTIFY DUPLICATES One of the main tasks performed by PROC SORT is rearranging data. PROC SORT orders observations by the values of one or more character or numeric variables in the BY statement. A BY statement tells SAS how the order or arrangement should be for the observations in a data set. The variables in the BY statement are called BY variables. The BY statement must appear in the SORT procedure. There are two keywords for the sort order that can be used in the BY statement: ASCENDING and DESCENDING. ASCENDING (lowest to highest values) is the default sort order and DESCENDING reverses the sort order (highest to lowest values). DESCENDING applies only to the variable which immediately follows. Missing values are treated as the lowest value possible for numeric and character variables. By using the DATA= and the OUT= option, you may keep both the unsorted and sorted version of your data. PROC SORT may be the fastest way to output sorted data sets. If OUT= is not used the unsorted or original data will be replaced by the sorted data. Here is a basic form of this procedure: proc sort data=example out = best; by <descending> variable-1 … variable-n; run; EXAMPLE The following code creates the Example data set. Then, the Example data set is sorted by ascending ID values and within each value of ID by descending AGE value. The sorted data set is named Best. data example; input ID fname $ lname $ age gender $; datalines; 1 John Smith 37 M 2 Adam Thompson 33 M 3 Sophia Rose 20 F 4 Caleb Guerrero 30 M 5 Maria Rose 27 F 2 Adam Thompson 42 M 3 Sophia Rose 20 F; run;

Upload: nguyendat

Post on 25-Jul-2018

217 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Identifying Duplicates Made Easy - old.wuss.orgold.wuss.org/Proceedings16/84_Final_Paper_PDF.pdf · Identifying Duplicates Made Easy ... (FIRST./LAST.), PROC SQL, and PROC FREQ

1

Identifying Duplicates Made Easy

Elizabeth Guerrero Angel, University of California, Davis, CA

Yunin Ludena, University of California, Davis, CA

ABSTRACT

Have you ever had trouble removing or finding the exact type of duplicate you want? SAS® offers several different ways to identify, extract, and/or remove duplicates, depending on exactly what you want. We will start by demonstrating perhaps the most commonly used method, PROC SORT, and the types of duplicates it can identify and how to remove, or store them. Then, we will present the other less commonly used methods which might give information that PROC SORT cannot offer, including the data step (FIRST./LAST.), PROC SQL, and PROC FREQ. The programming is demonstrated at a beginner’s level.

INTRODUCTION

Working with a wide variety of data, from statewide files with millions of records, to local data sets with thousands of observations, but easily over 10,000 raw variables, we have learned that identifying and removing duplicates is absolutely essential. It is also important to understand where these duplicates came from, whether it be a programming mistake or data entry duplication, and which of the duplicates the true value is. We will present in this paper four different ways of identifying duplicates. We will start with perhaps the most common method, PROC SORT, where we also will show how convenient it is to output and delete duplicates. Then, we will look at what the other methods have to offer to identify duplicates, including the PROC SORT/FIRST.ID and LAST.ID method, PROC SQL, and PROC FREQ. We will introduce the basics of each of these methods, along with our example data set, to better illustrate how these methods work.

THE BASIC PROC SORT TO IDENTIFY DUPLICATES

One of the main tasks performed by PROC SORT is rearranging data. PROC SORT orders observations by the values of one or more character or numeric variables in the BY statement. A BY statement tells SAS how the order or arrangement should be for the observations in a data set. The variables in the BY statement are called BY variables. The BY statement must appear in the SORT procedure. There are two keywords for the sort order that can be used in the BY statement: ASCENDING and DESCENDING. ASCENDING (lowest to highest values) is the default sort order and DESCENDING reverses the sort order (highest to lowest values). DESCENDING applies only to the variable which immediately follows. Missing values are treated as the lowest value possible for numeric and character variables.

By using the DATA= and the OUT= option, you may keep both the unsorted and sorted version of your data. PROC SORT may be the fastest way to output sorted data sets. If OUT= is not used the unsorted or original data will be replaced by the sorted data.

Here is a basic form of this procedure:

proc sort data=example out = best;

by <descending> variable-1 … variable-n;

run;

EXAMPLE

The following code creates the Example data set. Then, the Example data set is sorted by ascending ID values and within each value of ID by descending AGE value. The sorted data set is named Best.

data example; input ID fname $ lname $ age gender $;

datalines; 1 John Smith 37 M

2 Adam Thompson 33 M

3 Sophia Rose 20 F

4 Caleb Guerrero 30 M

5 Maria Rose 27 F

2 Adam Thompson 42 M

3 Sophia Rose 20 F;

run;

Page 2: Identifying Duplicates Made Easy - old.wuss.orgold.wuss.org/Proceedings16/84_Final_Paper_PDF.pdf · Identifying Duplicates Made Easy ... (FIRST./LAST.), PROC SQL, and PROC FREQ

Identifying Duplicates Made Easy, continued

2

proc sort data=example out=best;

by id descending age;

proc print data=best noobs;

run;

The output 1 from PROC PRINT looks like this:

ID Fname Lname Age Gender

1 John Smith 37 M

2 Adam Thompson 42 M

2 Adam Thompson 33 M

3 Sophia Rose 20 F

3 Sophia Rose 20 F

4 Caleb Guerrero 30 M

5 Maria Rose 27 F

Output 1. Output from PROC PRINT Statement for DATA

In Output 1, the “best” data set has two observations for ID 3 that are exactly the same across all variables, and for ID 2 the observations are the same for all variables except for age. Once the duplicates are identified, you are ready to remove them by adding the NODUP or the NODUPKEY options to the PROC SORT statement.

REMOVING DUPLICATES IN PROC SORT: NODUP VS NODUPKEY

The NODUP (NODUPRECS or NODUPLICATES) and NODUPKEY options compare consecutive observations in the sorted data set. If an exact match is found, the observation is not written to the output data set. So, the second observation of an exact match or duplicate value will be eliminated in the sorted data set. Here are the differences between those two options:

NODUP compares consecutive observations across all the variables in your data set. Nonconsecutive observations that are duplicates may not be detected.

NODUPKEY compares consecutive observations only across the BY variables in your data set.

The DUPOUT= option creates a data set with the observations that are not written to the sorted output data set, the duplicates.

proc sort data=example <nodup/nodupkey> out= justoneobs dupout= thedupsonly;

by id;

run;

proc print data=justoneobs noobs; run;

Table 1 shows the different notes and the outputs from PROC PRINT using NODUP and NODUPKEY options:

NODUP NODUPKEY

NOTE: 1 duplicate observations were deleted.

NOTE: 2 observations with duplicate key values were deleted.

ID Fname Lname Age Gender

1 John Smith 37 M

2 Adam Thomspon 33 M

2 Adam Thompson 42 M

3 Sophia Rose 20 F

4 Caleb Guerrero 30 M

5 Maria Rose 27 F

ID Fname Lname Age Gender

1 John Smith 37 M

2 Adam Thomspon 33 M

3 Sophia Rose 20 F

4 Caleb Guerrero 30 M

5 Maria Rose 27 F

Table 1. NODUP and NODUPKEY: Notes from the Log and Outputs from PROC PRINT

Notice that the second observation for ID 3 was eliminated using NODUP because all the variable values were the

Page 3: Identifying Duplicates Made Easy - old.wuss.orgold.wuss.org/Proceedings16/84_Final_Paper_PDF.pdf · Identifying Duplicates Made Easy ... (FIRST./LAST.), PROC SQL, and PROC FREQ

Identifying Duplicates Made Easy, continued

3

same. However, the second observation for ID 2 was not eliminated because age score is different for these two observations. On the other hand, using NODUPKEY, the second observation for ID’s 2 and 3 were eliminated because only the BY variable, ID, is the same.

The better option to remove duplicates depends on each data and what is important to keep. In table 1, the correct age for ID 2 is 42. By using NODUPKEY instead of NODUP you eliminate the second observation which is the correct age.

FINDING THE DUPLICATES – OTHER METHODS

In addition to using proc sort, there are other methods within SAS we can use to find duplicates. There are cases in which these other methods may be a better route, and can provide more flexibility on what to do with the duplicates.

PROC SORT AND THE DATA STEP

Sometimes we want to look at our duplicates with greater scrutiny, or perhaps select our observations based on certain criteria. This is when you will want to use this technique.

The first step is to sort your data by your ID, as well as any other variables you want to look at. For example, suppose you want to keep the observations with the most complete data, include these other variables in your sort statement.

proc sort data=example;

by ID descending age;

run;

Once the data is sorted by the ID variable and any other variables of interest, we will use the FIRST.ID and LAST.ID variables from SAS to check duplicates. FIRST.ID will return a value of 1 for “true” or 0 for “false” if the observation is the first occurrence of the ID. Similarly, LAST.ID will return a 1 or 0 if it is the last occurrence of the ID. Depending on what you want to do with the data, you have some options on how to run your data step. It is important to include the BY statement in the data step with your ID.

data justoneobs2 thedupsonly2 obstocheck;

set example;

by ID;

/* This line will only output the first occurrence of the ID in question*/

if first.ID then output justoneobs2;

/*This line will only output the subsequent occurrences of the ID*/

if not first.ID then output thedupsonly2;

/*This line will output all observations that have more than one occurrence of

the ID. If the ID does not have a duplicate in the data set, it will not be

output. */

if first.ID = 0 or last.ID = 0 then output obstocheck;

run;

proc print data=justoneobs2 noobs; run;

Output 2 is the output of the first data set, “justoneobs2”.

ID Fname Lname Age Gender

1 John Smith 37 M

2 Adam Thompson 42 M

3 Sophia Rose 20 F

4 Caleb Guerrero 30 M

5 Maria Rose 27 F

Output 2. Output from PROC PRINT Statement for DATA

The “justoneobs2” data set removes all the appropriate duplicates and includes just one observation per ID. However, if we had not included the DESCENDING option in the PROC SORT, we would have selected the incorrect duplicate for Adam Thompson. Let’s take a look at the other data sets we created.

proc print data=thedupsonly2 noobs; run;

Output 3 is the output for the data set, “thedupsonly2”.

Page 4: Identifying Duplicates Made Easy - old.wuss.orgold.wuss.org/Proceedings16/84_Final_Paper_PDF.pdf · Identifying Duplicates Made Easy ... (FIRST./LAST.), PROC SQL, and PROC FREQ

Identifying Duplicates Made Easy, continued

4

ID Fname Lname Age Gender

2 Adam Thomspon 33 M

3 Sophia Rose 20 F

Output 3. Output from PROC PRINT Statement for DATA

The “thedupsonly2” data set outputs only the duplicates removed from the “justoneobs2” data set. Here we can see the two true duplicates, and it may be worth looking at when examining which duplicates were the ones removed.

proc print data=obstocheck noobs; run;

Output 4 is the output for the data set, “obstocheck”.

ID Fname Lname Age Gender

2 Adam Thompson 42 M

2 Adam Thomspon 33 M

3 Sophia Rose 20 F

3 Sophia Rose 20 F

Output 4. Output from PROC PRINT Statement for DATA

The data set “obstocheck” outputs all observations that have duplicates, including the correct observation and its corresponding duplicate. If the observation does not have a duplicate, it is not output into the data set. We can see that this data set would be useful in determining which observation among the duplicates is the correct one. For our ID 3, it does not matter, since the observations are identical, but for ID 2, we can see the error in age, and correctly choose the appropriate observation.

A SIMPLE USE FOR PROC SQL

While, at first, PROC SQL may seem intimidating, using its power to identify duplicates is simple and easy to learn. Using this method, PROC SQL will use the COUNT summary function to count the number of times each ID appears in the data set.

proc sql;

create table FINDDUPS as

select *, count(ID)as ID_COUNT

from example

group by ID;

quit;

proc print data=FINDDUPS noobs; run;

The CREATE TABLE statement is creating our output data set. Without this statement, SAS will print the results of the query, but there will not be a data set resulting from it. The SELECT statement creates our variable of interest (using the COUNT function), while also selecting the other variables we want to be included as output. In this case, we use the asterisk, which includes all variables from the data set. The FROM statement tells the SQL procedure which data set to pull the data from. The GROUP statement is vital, as it is allows us to count within the same ID. Below is the output of our code.

ID Fname Lname Age Gender ID_COUNT

1 John Smith 37 M 1

2 Adam Thompson 42 M 2

2 Adam Thomspon 33 M 2

3 Sophia Rose 20 F 2

3 Sophia Rose 20 F 2

4 Caleb Guerrero 30 M 1

5 Maria Rose 27 F 1

Output 5. Output from PROC PRINT Statement for DATA

We can easily determine from the table, that the observations with duplicates have an ID_COUNT value of 2 (or more).

Page 5: Identifying Duplicates Made Easy - old.wuss.orgold.wuss.org/Proceedings16/84_Final_Paper_PDF.pdf · Identifying Duplicates Made Easy ... (FIRST./LAST.), PROC SQL, and PROC FREQ

Identifying Duplicates Made Easy, continued

5

PROC FREQ CAN HELP TOO

Perhaps not the most commonly used method, PROC FREQ can offer a simple, straightforward method to identify duplicates.

proc freq data=example noprint;

table ID / out=FreqDups (keep=ID count);

run;

proc print data=FreqDups noobs; run;

Using the OUT option, a data set will result from the frequency, containing the ID variable and the COUNT variable. The COUNT variable will tell us how many times each unique ID appears in the original data set. Below is the output for our code:

ID Count

1 1

2 2

3 2

4 1

5 1

Output 6. Output from PROC PRINT Statement for DATA

One major downside to using PROC FREQ to find duplicates is that, while it is easy and straightforward, to see exactly which observations from our data set are the duplicates, it is necessary to merge the resulting data set with our original data.

CONCLUSION

These methods are some basic techniques to keep in mind when looking at data. Depending on the size and complexity of the data, some methods may be more insightful and practical than others. PROC SORT has proven, in most cases, to be the easiest to use and fastest procedure to find and eliminate duplicates. However, among the multiple ways available to identify duplicates, we’ve found PROC SQL especially useful when dealing with very large data sets, since it can, for example, point out where some data has been entered even more than twice.

CONTACT INFORMATION

Your comments and questions are valued and encouraged. Contact the author at:

Elizabeth Guerrero Angel University of California, Davis 1 Shields Ave. Davis, CA 95616 (530) 752-8863 [email protected] Yunin Ludena University of California, Davis 1616 Da Vinci Court Davis, CA 95616 (530) 752-9321 [email protected]

SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. ® indicates USA registration.

Other brand and product names are trademarks of their respective companies.