chapter 1: introduction to sas (2/2) -...
TRANSCRIPT
Chapter 1: Introduction to SAS (2/2)
Chapter 1: Introduction to SAS (2/2)
Junshu Bao
University of Pittsburgh
1 / 39
Chapter 1: Introduction to SAS (2/2)
Table of contents
1.5 Modifying SAS Data
1.6 Proc Step
1.7 Global Statements
* More about Modifying and Combining Data Sets
1.8 SAS Graphics
2 / 39
Chapter 1: Introduction to SAS (2/2)
1.5 Modifying SAS Data
Creating and Modifying Variables
The assignment statement can be used both to create newvariables and modify existing ones. The basic form is
variable = expression
For examples
weightloss=startweight-weightnow;
startweight=startweight*0.4536;
SAS has the normal set of arithmetic operators: +, -, /(divide), * (multiply), and ** (exponential), plus variousarithmetic, mathematical and statistical functions.
3 / 39
Chapter 1: Introduction to SAS (2/2)
1.5 Modifying SAS Data
Assignment Statements
Here are examples of basic types of assignment statements:
Type of expression Assignment statement
numeric constant NewVar = 10;character constant NewVar = `ten';a variable NewVar = OldVar;a function of variable(s) NewVar = function(OldVariable);
Whether the variable NewVar is numeric or character dependson the expression that de�nes it.
4 / 39
Chapter 1: Introduction to SAS (2/2)
1.5 Modifying SAS Data
Example: Survey of Home Gardeners
Gardeners were asked to estimate the number of pounds theyharvested for four crops: tomatoes, zucchini, peas, and grapes.
Gregor 10 2 40 0Molly 15 5 10 1000Luther 50 10 15 50Susan 20 0 . 20
The following program reads the data and then modi�es the data.
DATA homegarden;
INFILE 'c:\MyRawData\Garden.dat';
INPUT Name $ 1-7 Tomato Zucchini Peas Grapes;
Zone = 14;
Type = `home';
Zucchini = Zucchini * 10;
Total = Tomato + Zucchini + Peas + Grapes;
PerTom = (Tomato / Total) * 100;
RUN;
See SAS program and output.5 / 39
Chapter 1: Introduction to SAS (2/2)
1.5 Modifying SAS Data
Missing Values
I The result of an arithmetic operation performed on a missingvalue is itself a missing value.
I Missing values for numeric variables are represented by a period.
I A numeric variable can be set to a missing value by anassignment statement such as:
age = .;
I A missing value may be assigned to a character variable asfollows:
team=` ';
6 / 39
Chapter 1: Introduction to SAS (2/2)
1.5 Modifying SAS Data
Using SAS Functions
SAS has hundreds of functions in general areas including:
Character Character String Matching Date and TimeDistance Financial Descriptive StatisticsMacro Mathematical ProbabilityRandom Number State and Zip Code Variable Information
For example,
AvgScore = MEAN(Scr1, Scr2, Scr3, Scr4, Scr5);
DayEntered = DAY(Date);
Type = UPCASE(Type);
I The MEAN function returns the mean of non-missing arguments.
I The DAY function returns the day of the month.
I The UPCASE function transform the variable values touppercase. * SAS is case sensitive when it comes to variablevalues; a 'd' is not the same as 'D'.
7 / 39
Chapter 1: Introduction to SAS (2/2)
1.5 Modifying SAS Data
Using IF-THEN Statements
Frequently, you want an assignment statement to apply to someobservations, but not all. This is called conditional logic and you do itwith IF-THEN statements:
IF condition THEN action;
Example: IF Model=`Mustang' THEN Make=`Ford';
Here are the basic comparison operators:
Symbolic Mnemonic Meaning= EQ equals^= and ~= NE not equal> GT greater than< LT less than>= GE greater than or equal<= LE less than or equal
The IN operator also makes comparisons. Here is an example:
IF Model IN (`Corvette', `Camaro') THEN Make=`Chevrolet'; 8 / 39
Chapter 1: Introduction to SAS (2/2)
1.5 Modifying SAS Data
DO-END Keywords
A single IF-THEN statement can only have one action. If you add thekeywords DO and END, then you can execute more than one action.The basic form is as follows:
IF condition THEN DO;
action1;
action2;
END;
For example,
IF Model=`Mustang' THEN DO;
Make=`Ford';
Size=`compact';
END;
9 / 39
Chapter 1: Introduction to SAS (2/2)
1.5 Modifying SAS Data
Specifying Multiple Conditions
You can also specify multiple conditions with the keywords AND andOR:
IF condition1 AND condition2 THEN action;
For example
IF Model=`Mustang' AND Year<1975 THEN Status=`classic';
Like the comparison operators, AND and OR may be symbolic ormnemonic:
Symbolic Mnemonic Meaning& AND all comparisons must be true| or ! OR at least one comparison must be true
10 / 39
Chapter 1: Introduction to SAS (2/2)
1.5 Modifying SAS Data
ExampleThe following data about used cars contain values for model, year,make, number of seats, and color:
Corvette 1955 . 2 black
XJ6 1995 Jaguar 2 teal
Mustang 1966 Ford 4 red
Miata 2002 . . silver
CRX 2001 Honda 2 black
Camaro 2000 . 4 red
We will �ll in missing data, and create a new variable, Status.
DATA sportscars;
INFILE `c:\MyRawData\UsedCars.dat';
INPUT Model $ Year Make $ Seats Color $;
IF Year < 1975 THEN Status = `classic';
IF Model = `Corvette' OR Model = `Camaro' THEN Make = `Chevy';
IF Model = `Miata' THEN DO;
Make = `Mazda';
Seats = 2;
END;
RUN;
11 / 39
Chapter 1: Introduction to SAS (2/2)
1.5 Modifying SAS Data
Grouping Observations with IF-THEN/ELSE
One common use of IF-THEN statements is for groupingobservations. By adding the keyword ELSE to your IF statements,you can tell SAS that these statements are related.
IF-THEN/ELSE logic takes this basic form:
IF condition1 THEN action1;
ELSE IF condition2 THEN action2;
ELSE IF condition3 THEN action3;
... ...
ELSE action;
The last ELSE statement contains just an action. An ELSE of thiskind becomes a default which is automatically executed for allobservations failing to satisfy any of the previous IF statements. Forexample,
IF Cost = . THEN CostGroup = 'missing';
ELSE IF Cost < 2000 THEN CostGroup = 'low';
ELSE IF Cost < 10000 THEN CostGroup = 'medium';
ELSE CostGroup = 'high';
* SAS considers missing values to be smaller than non-missing values.12 / 39
Chapter 1: Introduction to SAS (2/2)
1.5 Modifying SAS Data
Simplifying Programs with Arrays
When the same operation is to be carried out on several variables, itis often convenient to use an array and an iterative do loop incombination
For example, suppose you have 20 variables, q1 to q20, for which "notapplicable" has been coded -1 and we wish to set those to missingvalues, we might do it as follows:
array qall{20} q1-q20;
do i = 1 to 20;
if qall{i} = -1 then qall{i} = . ;
end;
The array statement de�nes an array by specifying the name of thearray, `qall' here, the number of variables to be included in it inbraces and the list of variables to be included.
* All the variables in the array must be of the same type, that is all
numeric or all character.13 / 39
Chapter 1: Introduction to SAS (2/2)
1.5 Modifying SAS Data
Deleting Variables
Variables may be removed from the data set being created byusing the drop and keep statements.I The drop statement names a list of variables that are to be
excluded from the data set. For example:
data gradebook_final;
set gradebook;
drop quiz5;
run;
I The keep statement names a list of variables that are to bethe only ones retained in the data set. For example:
data gradebook_final;
set gradebook;
keep quiz1 quiz2 quiz3 quiz4;
run;14 / 39
Chapter 1: Introduction to SAS (2/2)
1.5 Modifying SAS Data
Deleting Observations
It may be necessary to delete observations from the data seteither because they contain errors or because the analysis is tobe carried out on a subset of the data.
I Deleting erroneous observations is best done by using the ifthen statement with the delete statement. For example,if weightloss>startweight then delete;
I In the case above, it would also be useful to write out amessage giving more information about the observationthat contains the error.if weightloss>startweight then do;
put 'Error in weight data' idno = startweight = weightloss = ;
delete;
run;
The put statement write text (in quotes) and the values ofvariables to the log.
15 / 39
Chapter 1: Introduction to SAS (2/2)
1.5 Modifying SAS Data
Subsetting Data Sets
It may be necessary to delete observations from the data seteither because they contain errors or because the analysis is tobe carried out on a subset of the data. This can be achievedwith the subsetting if statement in a data step.
For example,
data women;
set bodyfat;
if sex = 'F';
run;
16 / 39
Chapter 1: Introduction to SAS (2/2)
1.6 Proc Step
Proc Statement
I Once data have been read into a SAS data set, SASprocedures can be used to analyze the data.
I The proc step is a block of statements that specify the dataset to be analyzed, the procedure to be used and anyfurther details of the analysis.
I The proc statement names the procedure to be used andmay also specify options for the analysis.
The most important option is data= option that names thedata set to be analyzed. If the option is omitted, theprocedure uses the most recently created data set.
17 / 39
Chapter 1: Introduction to SAS (2/2)
1.6 Proc Step
Var Statement
The var statement speci�es that variables that are to beprocessed by the proc step. For example,
proc print data = SlimmingClub;
var name team weightloss;
run;
restricts the printout to the three variables mentioned, whereasthe default would be to print all variables.
18 / 39
Chapter 1: Introduction to SAS (2/2)
1.6 Proc Step
Where Statement
The where statement selects the observations to be processed.The keyword where is followed by a logical condition, and onlythose observations for which the condition is true are includedin the analysis.
proc print data = SlimmingClub;
where weightloss>0;
run;
only prints out observations with positive weight loss.
19 / 39
Chapter 1: Introduction to SAS (2/2)
1.6 Proc Step
By Statement
The by statement is used to process the data in groups.
I The observations are grouped according to the values of thevariable named in the by statement, and a separate analysisis conducted for each group.
I The data set must �rst be sorted on the by variable.
proc sort data=SlimmingClub;
by team;
proc means;
var weightloss;
by team;
run;
20 / 39
Chapter 1: Introduction to SAS (2/2)
1.6 Proc Step
Class Statement
The class statement is used with many procedures to namevariables that are to be used as classi�cation variables, orfactors.
The variables named may be character or numeric variables andwill typically contain a relatively small range of discrete values.For example
proc logistic data=ghq;
class sex;
model cases/total=sex ghq;
run;
21 / 39
Chapter 1: Introduction to SAS (2/2)
1.7 Global Statements
Global Statements (1) Title
Global statements may occur at any point in a SAS programand remain in e�ect until reset. The title statement is a globalstatement and provides a title that will appear on each page ofprinted output and each graph until reset. An example would be
title `Analysis of Slimming Club Data';
I The text of the title must be enclosed in quotes.
I Multiple lines of titles can be speci�ed with the title2statement for the second line, title3 for the third line, andso on up to 10.
I The title statement is synonymous with title1.
22 / 39
Chapter 1: Introduction to SAS (2/2)
1.7 Global Statements
Global Statements (2) Comments
Comment statements are global statements in the sense thatthey can occur anywhere. There are two forms of commentstatement.
I The �rst form begins with an asterisk and ends with asemicolon, for example,* this is a comment;
I The second form begins with /* and ends with */:
/* this is also a
comment
*/
Comments may appear on the same line as a SASstatement, for example
bmi=weight/height**2; /* Body Mass Index */
23 / 39
Chapter 1: Introduction to SAS (2/2)
1.7 Global Statements
Global Statements (3) Options
The options statement is used to set SAS system options. Mostof these can be safely left at their default values. Some usefuloptions are:
I Nocenter aligns the output at the left, rather thancentering it on the page.
I Nodate suppresses printing of the date and time on theoutput.
I Pageno=n sets the page number for the next page ofoutput. Alternatively, nonumber turns page numberingo�.
For example
options nodate nocenter nonumber;
24 / 39
Chapter 1: Introduction to SAS (2/2)
* More about Modifying and Combining Data Sets
Concatenating Data Sets - Adding Observations
The set statement can be used to concatenate or stack the data setsone on top of the other.
This is useful when you want to combine data sets with all or most ofthe same variables but di�erent observations. The basic form is:
data new-dataset;
set dataset1 dataset2;
run;
I The number of observations in the new data set will equal thesum of the number of observations in the old data sets.
I If one of the data sets has a variable not contained in the otherdata sets, then the observations from the other data sets willhave missing values for that variable.
25 / 39
Chapter 1: Introduction to SAS (2/2)
* More about Modifying and Combining Data Sets
Example: Stacking Data Sets
The Fun Times Amusement Park has two entrances where theycollect data about their customers.
South Entrance Data:
Entrance Pass Number Size of Party AgeS 43 3 27S 44 3 24S 45 3 2
North Entrance Data:
Entrance Pass Number Size of Party Age Parking LotN 21 5 41 1N 87 4 33 3N 65 2 67 1N 66 2 7 1
Note that the north entrance data set has one more variable, parking
lot number. The north entrance only has one parking lot. 26 / 39
Chapter 1: Introduction to SAS (2/2)
* More about Modifying and Combining Data Sets
Example: Stacking Data Sets (cont.)
Suppose we would like to combine the data of the two entrancesand create a new variable, AmountPaid, which tells how mucheach customer paid based on their age.
DATA both;
SET southentrance northentrance;
IF Age = . THEN AmountPaid = .;
ELSE IF Age < 3 THEN AmountPaid = 0;
ELSE IF Age < 65 THEN AmountPaid = 35;
ELSE AmountPaid = 27;
RUN;
See SAS program and output.
27 / 39
Chapter 1: Introduction to SAS (2/2)
* More about Modifying and Combining Data Sets
Merging Data Sets - Adding Variables (1)
Data for a study may arise from more than one source, or at di�erenttimes, and need to be combined.
I For matching purpose, you will want to have a common variableor several variables which taken together uniquely identify eachobservation. If the data are not already sorted, use the sortprocedure to sort all data sets by the common variables.
I The basic form is as follows:
proc sort data=dataset1;
by variable-list;
proc sort data=dataset2;
by variable-list;
data new-dataset;
merge dataset1 dataset2;
by variable-list;
* If the two data sets have variables with the same names, then the
variables from the second data set will overwrite any variables having
the same name in the �rst data set. 28 / 39
Chapter 1: Introduction to SAS (2/2)
* More about Modifying and Combining Data Sets
Example: Belgian Chocolatier
A Belgian chocolatier keeps track of the number of each type ofchocolate sold each day.
I The code number for each chocolate and the number of piecessold that day are kept in a �le.
I In a separate �le she keeps the names and descriptions of eachchocolate as well as the code number.
In order to print the day's sales along with the descriptions of thechocolates, the two �les must be merged together using the codenumber as the common variable.
See SAS program and output.
29 / 39
Chapter 1: Introduction to SAS (2/2)
* More about Modifying and Combining Data Sets
One-to-Many Match Merge
Sometimes you need to combine two data sets by matching oneobservation from one data set with more than one observation inanother.
Suppose you had data for every state in the U.S. and wanted tocombine it with data for every county. This would be a one-to-manymatch merge.
The statements for a one-to-many match merge are identical to thosefor a one-to-one match merge:
data new-dataset;
merge dataset1 dataset2;
by variable-list;
I The order of the data sets in the merge statement does not a�ectthe matching. In other words, a one-to-many merge will matchthe same observations as a many-to-one merge.
I Before you merge two data sets, they must be sorted by one ormore common variables.
I You cannot do a one-to-many merge without a by statement.
30 / 39
Chapter 1: Introduction to SAS (2/2)
* More about Modifying and Combining Data Sets
Example: One-to-Many Match Merge
A distributor of athletic shoes is putting all its shoes on sale at 20 to30% o� the regular price. The distributor has two data sets:
I Data set 1: information about each type of shoe. It contains onerecord for each shoe with values for style, type of exercise(running, walking, or cross-training), and regular price.
I Data set 2: discount factor. It contains one record for each typeof exercise and its discount.
To �nd the sale price, we need to merge the two data sets andcalculate a new price after the discount.
See SAS program and output.
31 / 39
Chapter 1: Introduction to SAS (2/2)
* More about Modifying and Combining Data Sets
Tracking and Selecting Observations
When you combine two data sets, you can use in= options to trackwhich of the original data sets contributed to each observation in thenew data set.
For example, the data step below creates a data set named both bymerging two data sets state and county. Then the in= optionscreate two variables named InState and InCounty.
data both;
merge state (in=InState) county (in=InCounty);
by StateName;
SAS gives the in= variables a value of 0 or 1. A value of 1 means thatdata set did contributes to the current observation, and a value of 0means no contribution.
* You can use this in= variable to subset data sets.
32 / 39
Chapter 1: Introduction to SAS (2/2)
* More about Modifying and Combining Data Sets
Example: The IN= Option
A sporting goods manufacturer wants to send a sales rep to contactall customers who did not place any orders during the third quarter ofthe year. The company has two data �les:
I Data �le 1: customer information
I Data �le 2: orders placed during the third quarter
To compile a list of customers without orders, you merge the two datasets using the IN= option, and then select customers who had noobservations in the orders data set.
See SAS program and output.
33 / 39
Chapter 1: Introduction to SAS (2/2)
* More about Modifying and Combining Data Sets
Selecting Observations with the WHERE= Option
The where= data set option is the most �exible of all ways to subsetdata. You can use it in data steps or proc steps. The basic form of awhere= option is:where = (condition)
I If used in a set or merge statement, the where= option will beapplied to the data set that is being read. For example,
data gone;
set animals (where = (Status = 'Extinct'));
I If used in a data statement, the where= option will be applied tothe data set that is being written. For example,
data uncommon (where = (Status IN ('Endangered', 'Threatened'));
set animals;
34 / 39
Chapter 1: Introduction to SAS (2/2)
* More about Modifying and Combining Data Sets
Example: WHERE= Option
The following data contain information about the Seven Summits, thehighest mountains on each continent. Each line of data includes thename of a mountain, its continent, and height in meter.
Kilimanjaro Africa 5895
Vinson Massif Antarctica 4897
Everest Asia 8848
Elbrus Europe 5642
McKinley North America 6194
Aconcagua South America 6962
Kosciuszuko Australia 2228
We will create two data sets named "tallpeaks" (above 6000 meters)and "American".
See SAS program and output.
35 / 39
Chapter 1: Introduction to SAS (2/2)
1.8 SAS Graphics
SAS Graphics
When the SAS/GRAPH module has been licensed, there are anumber of ways of producing high-quality graphical output. Threemain approaches:
I Graphical options within a statistical procedure
I Traditional graphics procedures (gplot, gchart, etc.)
Graphics procedures that existed in versions of SAS prior to 9.2.
I Statistical graphics procedures (sgplot, sgpanel, sgmatrix andsgrender)
New graphics procedures which can produce a wide range ofattractive graphics.
We will focus on the statistical graphics procedures for now. The
speci�c graphical options that are available within statistical
procedures will be dealt with in later chapters.36 / 39
Chapter 1: Introduction to SAS (2/2)
1.8 SAS Graphics
xy Plots - Proc sgplot
An xy plot is one in which the data are represented in two dimensionsde�ned by the values of two variables. For example, to create ascatterplot,
proc sgplot data=bodyfat;
scatter y=pctfat x=age;
run;
The syntax is straightforward:
I A scatter statement is used to tell SAS to create a scatterplot.
I In the scatter statement, both the x and y variables are speci�edexplicitly.
For di�erent types of plot, a statement other than scatter is used. See
next page.
37 / 39
Chapter 1: Introduction to SAS (2/2)
1.8 SAS Graphics
Types of xy Plots
Type of Plot Plotting StatementScatter plot - data values are plotted scatterLine plot - data values are joined with lines seriesStep plot - data values joined with stepped lines stepNeedle plot - vertical line joins the value to the x axis needleRegression plot - a scatter plot with a regression line regLocally weighted regression loessPenalized Beta splines pbspline
* For line plots and step plots the points will be plotted in theorder in which they occur in the data set, so sort the data by thex axis variable �rst.
* A common variant of the xy plot distinguish groups in the databy using di�erent symbols/lines. This is done by the group=varoption. For example: scatter y=pctfat x=age/group=sex;
38 / 39
Chapter 1: Introduction to SAS (2/2)
1.8 SAS Graphics
Overlaying Plots
It is often useful to combine the information from two or more plotsby overlaying them. Sgplot does this automatically. For example, aplot to compare the �ts from linear regression and locally weightedregression could be produced as follows:
proc sgplot data=bodyfat;
reg y=pctfat x=age;
loess y=pctfat x=age/nomarkers;
run;
The nomarkers option is speci�ed to prevent the data points being
plotted twice as sgplot uses di�erent plotting symbols for each.
39 / 39