Transcript

Week 8 Introduction to SAS – The DATA Step

week 08 8.1

Unit 4

SAS for Data Management

Week 8: Introduction to SAS – The Data Step Welcome. As mentioned in the introduction to this unit (click on the Unit 4 tab) , the two principal

building blocks of a SAS program are the DATA step and the PROC step. This

reading is a detailed introduction to the DATA step. The emphasis is on using the

DATA step for purposes of reading, displaying, and writing data. Not described, but

possible, is use of the DATA step to accomplish other tasks, such as simulations.

The latter is beyond the scope of this course.

Goals of Week 8: Introduction to SAS – The Data Step 1. To understand the nature of, and purposes of, the DATA step;

2. To be able to read data into SAS from a variety of platforms (instream, external

file, other SAS data set);

3. To appreciate, and be competent in, the formatting of data for ease of readability;

4. To be able to view data;

5. To be able to write SAS data out to a variety of platforms;

6. To be familiar with the SAS viewtable feature and to appreciate that this is not recommended for use in data editing; and

7. To appreciate, and be competent in, the minimization of SAS storage of data.

Week 8 Introduction to SAS – The DATA Step

week 08 8.2

Week 8 Outline – Introduction to SAS: The Data Step Section Topic Page 1.

How SAS Represents Data ……………………………. …………….. 3

2.

How to Input Data Instream (the CARDS statement) ……………. 5

3.

How to Input Data Stored Text Format (INFILE and INPUT) ….... 6

4.

How to Input Another SAS Data Set (the LIBNAME statement) …. 7

5.

More on LIBNAME and LIBREF …………………………………. 11

6.

How to Read and Write From One or More SAS Data Sets to Another (the SET statement) ……….………………………………. 15

7.

Writing Data to ASCII from SAS (the FILE and PUT statements)…. 16

8.

Data Input/Output from ASCII to ASCII……………………………… 17

9.

The INPUT Command ……….………………………………………… a. List input ……………………………………………………….. b. Character ($) and Imbedded Blanks (&) …………………. c. Column or Formatted Input ………………………………… d. Easy Column Input Using the At Symbol (@) ……………

1818232425

10.

Advanced INPUT Features …………………………………………..

a. Reading Data With Multiple Lines Per Record (# and Slash) b. Reading Multiple Records from the Same Line of Data…….. c. Reading Varying Numbers of Lines per Record …………….

28283133

11.

How to Handle Missing Values …………………………………………..

a. SAS Missing Value Codes …………………………………….. b. The MISSING Statement ………………………………………. c. The INVALIDDATA Option …………………………………….

36363940

12.

How to Describe SAS Data Sets ……………………………………….

a. How to Label Variables ………………………………………… b. How to Label a Data Set ………………………………………. c. The PROC CONTENTS Procedure ………………………….. d. How to Use FORMAT to Document Variable Values ………. e. Using the VIEWTABLE …………………………………………

434445485054

13

Minimizing the Space Taken by a SAS Data Set ……………………… 58

Week 8 Introduction to SAS – The DATA Step

week 08 8.3

1. How SAS Represents Data

SAS represents data in tabular or rectangular form, where each column represents a

field or variable, which must be named, and each row represents a record or

observation. Observations are numbered sequentially. When data is sorted on

some field, such as age, the observations will be renumbered sequentially after

sorting. The observation number is not stored with the data, but is printed or

displayed as a convenience.

Typical Listing of Data in SAS Listing from Print Procedure Using Print Procedure: Displayed in HTML Table Format: Obs sid age height 1 1 17 56 2 2 26 62 3 3 41 60 4 4 29 66

View of Data using SAS VIEWTABLE:

Obs sid age height

1 1 17 56

2 2 26 62

3 3 41 60

4 4 29 66

Week 8 Introduction to SAS – The DATA Step

week 08 8.4

The DATA step is the most common method of data input or output from the SAS

system. The DATA step consists of several SAS statements, where the particular

statements required depend upon the source of data input. All data steps begin with

the keyword DATA.

Week 8 Introduction to SAS – The DATA Step

week 08 8.5

2. How to Input Data Instream (the CARDS statement)

When you have a small amount of data that can be entered directly by typing it in

within a program, you may choose instream data entry using the CARDS statement.

This is most common when trying a small example or testing out a new program.

The following example creates a temporary SAS dataset called A1 with 3 variables

and 4 observations.

DATA A1; /* A1 is name of new dataset */ INPUT SID AGE HEIGHT; /* INPUT specifies variable names */ CARDS; / * CARDS indicates data follows */ 1 17 56 2 26 62 3 41 60 4 29 66 ; /* The semicolon indicates end of data*/; RUN; / * RUN indicates end of data step */ Notice the provision of /* comments*/ to explain the meaning of the code.

• The DATA statement names the dataset to be created.

• The INPUT statement names the variables or fields that are to be read.

• The CARDS statement indicates that data lines follow, and the semicolon (;) on

the line after the data, indicates the end of the data lines.

• A RUN statement is used at the end of each DATA or PROC step in SAS so

that the group of statements will be executed. This is optional if the data step is

followed by another data step or proc step – but you must have it at the end of a

program or the last step will not be executed.

Week 8 Introduction to SAS – The DATA Step

week 08 8.6

3. How to Input Data Stored Text Format (the INFILE and INPUT statements)

More commonly data is read in from other sources, such as ASCII data files, or from

other SAS data files rather than appearing instream in the program. The basic

syntax of a DATA step when reading the data from an ASCII file is as follows:

DATA NEW1 ; /* NEW1 is the name of the new SAS data set */ INFILE 'C:\TEMP\RAW.DTA'; /* specifies the file RAW.DTA on C:\TEMP */ INPUT VAR1 VAR2 ; /* specifies names for variables */ RUN;

The INFILE statement can identify an ASCII data file stored on a disk drive or from

directories on the hard drive by specifying the appropriate path. The path and

filename must be enclosed in single quotes. Many options are available to tailor the

INFILE statement to a particular data set. For example, the number of columns to be

read can be controlled with a linesize or logical record length specification on the

INFILE statement. For more details see the SAS Language Guide or SAS HELP.

Following the INFILE statement in SAS will be an INPUT statement that specifies the

correspondence between variable names assigned in SAS and columns in the ASCII

data file. This is where variable names are assigned. This statement will be

discussed in more detail later.

Week 8 Introduction to SAS – The DATA Step

week 08 8.7

4. How to Input Another SAS Data Set (the LIBNAME Statement)

When the data file to be input is itself a SAS data file, the DATA step takes on a

slightly different form. A SAS data file already has the columns identified with

variable names, and so the INPUT statement is not needed. The following example

reads a previously stored SAS data file called example3, and creates a temporary

SAS data file called A2.

LIBNAME SDATA 'C:\TEMP'; /* specifies location of SAS data files */ DATA A2; /* names new dataset to be created */ SET SDATA.EXAMPLE3; /* names SAS dataset to be read */ ... /* ( other SAS statements here) */ RUN;

• The LIBNAME statement is just a “nickname” (SAS calls this the libref) together with its companion pointer to the path (the drive and directory) where

the SAS data set is to be saved. Consider the libname statement

LIBNAME sdata ‘c:\temp’; The “nickname” (called the libref in SAS) is sdata

Thus, sdata is the nickname for the path c:\temp

• The SET statement names the SAS data set that is to be read in.

• When a single level name (single word, no dot ‘.’ followed by an extension) is

used in creating a dataset, as A2 in this example, it is saved as a "working"

(meaning temporary) data set while you are running the SAS system. Thus,

as soon as you close SAS the "working" data sets are erased. Working data

Week 8 Introduction to SAS – The DATA Step

week 08 8.8

sets are stored in the SAS WORK library. You can view active SAS libraries in

the Explorer Window:

• To save a SAS data set as a permanent data set – one that will be there after

you exit from the SAS software – a two level (libref.dsn) name must be given

in the DATA statement. This example saves a copy of a temporary SAS

dataset.

o The first part of the name (the library reference or libref) matches

exactly the nickname (which points to the path comprised of drive and

subdirectory) assigned in a LIBNAME statement.

In order to create a permanent (saved) SAS data set, you need to run the following

lines in a SAS Program Editor window.

Week 8 Introduction to SAS – The DATA Step

week 08 8.9

LIBNAME IN ‘A:\HW3’; DATA IN.A2; SET A2; RUN;

o Stored or permanent SAS data files all have an automatic filename

extension added. You will see this extension when you look at the file

in the Windows Explorer or My Computer. This extension is assigned

by SAS, and is not specified in any SAS statements. In version 8, the

extension added is .sas7bdat .

The location, or path (disk drive and directory) of SAS data files, is specified in a

LIBNAME statement. If you double-click on this icon, the SAS Windows will open,

and the data file will open in VIEWTABLE format.

This is the location, in single quotes, of the physical directory where you would like to save the permanent SAS data

This is the name you would like to call your permanent SAS data set. The libref (IN) before the dot (.) must match the name you wrote on a LIBNAME

This is the name (A2) of the temporary SAS data set that you want to save.

Icon and name for saved V8 SAS data set, as seen in Windows Explorer.

Week 8 Introduction to SAS – The DATA Step

week 08 8.10

DO NOT change the name of a SAS data file in Windows Explorer or My Computer.

Information on the external file name is saved within the file. If you rename

A2.sas7bdat to be A3.sas7bdat you will get an error message when you try to open

or use the file in SAS.

Week 8 Introduction to SAS – The DATA Step

week 08 8.11

5. More on LIBNAME and LIBREF

You can think of the directories on hard disk or floppy disks as libraries for storing

data. The LIBNAME statement is simply a pointer, an instruction that says “I’m

pointing to” a location. The location that is pointed to is a directory and subdirectory

path address that is contained in single or double quotes (I recommend double

quotes) It gives a convenient way of indicating a code word or library reference

(SDATA and IN, in the above examples) that refers to a specific location (library) for

reading and/or storing SAS data files.

Libname IN “z:\bigelow\consulting\jurgens 2003\sasdata”;

• LIBNAME is informing SAS that an address (where stuff can be found) is

being provided.

• Here it is given the nickname libref IN .

• “z\bigelow\consulting\jurgens 2003\sasdata” is the actual directory and

subdirectory path location.

Week 8 Introduction to SAS – The DATA Step

week 08 8.12

• Libraries can also be defined from the toolbar.

Using the new library button lets you define the LIBREF (or code word for that

library), the ENGINE (or data format) and the PATH (drive and directory).

• TIP: The advantage of using a libname statement within a program is that

the definition of the library becomes part of your program, and will be re-

defined each time the program is run. If you use the toolbar to set your library,

you must remember to set up your libraries each time you re-open SAS.

New library button:

Week 8 Introduction to SAS – The DATA Step

week 08 8.13

You must have a separate library defined for each version (engine) of SAS

Older versions of SAS stored data in different formats. SAS refers to these as

“engines”. For example, version 6.12 of SAS used a default extension of .SD2.

Earlier DOS versions (6.04) of SAS used the extension .SSD . If you know you are

reading SAS data files that were saved with an earlier version of SAS, you must have

these data sets stored in a different directory or subdirectory from V8 SAS data files.

A separate LIBNAME statement must be used for each (sub)-directory.

For example, the following lines could be used to read an old SAS data set (version

6.12), and save a copy of it in the new SAS (version 8.2) format:

LIBNAME OLD V612 ‘C:\OLDSAS’; /* Old uses v612 engine, .sd2 format */ LIBNAME NEW V8 ‘C:\TEMP’; /* New v8 engine, .sas7bdat */ DATA NEW.D1; SET OLD.D1; RUN;

Two libname statements are used to name 2 directories, the first called OLD, which

contains the file D1.SD2, version 6.12 format. The new data set, D1.SAS7bdat will

be saved in the C:\TEMP directory. The “engine” or version of SAS that created the

data set (in this example, they are v612 and v8) can be named before the path

specification on the libname statement. If you are unsure of the engine, it is not

required, as long as only one type of SAS file can be found in that directory.

Week 8 Introduction to SAS – The DATA Step

week 08 8.14

Take care that data stored by older versions of SAS or other formats that will

be used in SAS, are stored in separate directories, otherwise you will get an

error message indicating that the data cannot be read.

Do not use the engine names for library names.

Note that the SAS engine names begin with V for version. Therefore, avoid using a

library name such as Vnnn, where nnn is a number. A list of engine names can be

found in the “new library” window.

Week 8 Introduction to SAS – The DATA Step

week 08 8.15

6. How to Read and Write Data from One or more SAS Data Sets to Another

(the SET statement)

When data is already in SAS format, use a SET statement after the DATA statement

to point to the SAS data set you are reading from.

The next example reads two SAS data files, and concatenates them, storing the

result as a single new SAS data set in the same directory. If you want to store the

new data file in a different location (directory), a separate libname statement is

required.

LIBNAME SDATA 'C:\TEMP\'; /* specifies location of SAS data */ DATA SDATA.NEW1; /* creates a file named NEW1.SAS7BDAT on C:\TEMP */ SET SDAT.TEST1 SDAT.TEST2; /* concatenates files TEST1 and TEST1 */; … /* other SAS instructions would go here */ RUN;

The SET statement in the DATA step can list a single SAS data file, or many files.

Various options are available using the SET statement to help tailor how the two files

will be combined. The SET statement may also be replaced by a MERGE statement

when data records are to be combined on a record-by-record basis. Each of these

applications will be discussed in greater detail in a later section.

Week 8 Introduction to SAS – The DATA Step

week 08 8.16

7. Writing Data to ASCII Files from SAS (the FILE and PUT statements) It is also possible to create ASCII files from SAS datasets. This can be useful for

transferring data into other programs for specific applications. Creation of ASCII

output data files from SAS data sets makes use of a combination of the LIBNAME

and SET statements and a FILE statement. Data are specified for output using a

PUT statement with the following syntax:

LIBNAME OLD 'C:\TEMP'; /* specifies location of SAS data files */ DATA _NULL_; /* uses a special SAS name that will not be saved */ SET OLD.EX5; /* specifies the SAS dataset EX5.sas7bdat on C:\TEMP that will be read in the DATA step */; FILE 'C:\TEMP\EX5.DTA'; /* names the ASCII data file to be created */ PUT VAR1 VAR2; /* specifies the variables that are to be written to the data file EX1.DTA */ RUN;

• The FILE statement is the counterpart of the INFILE statement. Use FILE to

write data to an ASCII file, and use INFILE to read data from an ASCII or text

file.

• The PUT statement corresponds to the INPUT statement. PUT names the

SAS variables to be ‘put’ or written into the ASCII file; INPUT names the

variables to be read from an ASCII file.

• Since the purpose of the DATA step is to create an ASCII file, there is no need

to create another SAS data file – hence the dummy name _NULL_ is used.

This name is a special SAS name, used when you want to process data, but

do not want to create a new SAS data set.

Week 8 Introduction to SAS – The DATA Step

week 08 8.17

8. Data Input/Output from ASCII to ASCII

SAS can also be used for processing data, even when you don’t plan to create or

save a SAS data set. An ASCII data set can be read in, computations made (new

variables created), or variables reformatted, and a new ASCII file written that can be

used in another application.

For example, you may prefer to use the graphics or analysis features of another

software package, but find it easier to manipulate data (e.g., create or modify

variables, change the data file structure) in SAS, and then use the data in another

program.

DATA _NULL_; INFILE ‘C:\TEMP\EX1.DTA’; /* Names ASCII file to read in */ FILE ‘C:\TEMP\EX2.DTA’; /* Names ASCII file to be created */ INPUT GRP X Y Z; /* Names variables to read in */ TOTAL = SUM(X,Y,Z); /* New var TOTAL sums X, Y and Z */; PUT GRP 1-3 /* PUT tells SAS to write out data */ X 5-6 /* e.g. X is written out to columns 5-6 */ Y 8-9 /* e.g. Y is written out to columns 8-9 */ Z 11-12 TOTAL 14-16; RUN;

Data in file EX1.DTA that looked like: 11 25 32 21 146 29 71 13 24 5 9 22

Would look like the following in file EX2.DTA:

11 25 32 21 78 146 29 71 13 113 24 5 9 22 36

Week 8 Introduction to SAS – The DATA Step

week 08 8.18

9. The INPUT Command

Variables names are assigned to values in data sets using an INPUT statement.

There are four ways in which values can be associated with variables. These are

• list (free-format) input

• column input (formatted input with data in specified columns)

• named input of data

• formatted input, including INFORMAT statements.

Refer to the SAS Language Manual for more details, and SAS Language and

Procedures for more examples.

a. List Input

Warning!! List input should not be used as the routine method of data input

unless missing values are appropriately handled on the input (ASCII) data file.

One of the simplest forms of data input is list input or free-format. This method of

input is appropriate for reading small data sets, or creating test data. One or more

blank spaces or other delimiters on a record must separate values of variables to be

input. A delimiter is a defined marker that separates the value for one variable from

another. A blank space is a commonly used delimiter. Other commonly used

delimiters are commas or tabs. By default, when list input is used, SAS assumes a

blank space as the delimiter. To read data with a different delimiter, such as a

Week 8 Introduction to SAS – The DATA Step

week 08 8.19

comma, use the DELIMITER option on the INFILE statement. The following

example uses list input to read three variables from each line.

Note that columns do not necessarily line up for each variable, when the number of

digits varies from record to record.

DATA A1; INPUT SID AGE HEIGHT; CARDS; 1 7 40 2 26 64 3 41 60 14 29 66 ; RUN;

TIP: Each line (or set of lines) must have a complete set of the values in order

to maintain the correct sequence of variables and values. When all the variables

are not found on a given record (some missing values), the next record is read with

values assigned consecutively. If the height 64 were missing on the second data

line, the value ‘3’ would be read in from the next line as the second height, and then

the next line, starting with SID 14 would be read as the 3rd subject.

Week 8 Introduction to SAS – The DATA Step

week 08 8.20

TIP: A single blank space as a missing value results in a miss-match, which reads

in values from the wrong place, and results in both incorrect values as well as missed

observations.

To avoid this problem it is necessary to use the MISSOVER option on the

INFILE statement. When MISSOVER is specified the pointer will not move to a new

line to continue reading data but will assign a SAS missing value. However if the age

value were missing on a line, the value for height would be read in as AGE, unless

there is some place-holder, to indicate a missing value.

For SAS, a period or dot, '.' is used to indicate a missing numeric value.

This is why list input should not be used as the routine method of data input

unless missing values are appropriately handled on the input (ASCII) data file.

Following are some examples to illustrate some problems and solutions with missing

data and list input are given in the program listinput.sas.

*******************************************************************************************; *** ***; *** Project: BE 691F SAS example ***; *** Date: 15 OCT 2000 ***; *** Prog: Penny Pekow ***; *** File: listinput.sas ***; *** RE: LIST input/ missover ***; *******************************************************************************************; *** Input: instream data ***; *********************************************************************************************;

Week 8 Introduction to SAS – The DATA Step

week 08 8.21

** CORRECT - complete data, simple list input *************************************; DATA A1; INPUT SID AGE HEIGHT; CARDS; 1 17 56 2 26 62 3 41 60 4 29 66 ; RUN; Proc print data=a1; title1 'complete data'; run; ** WRONG - missing data last column, not dealt with *********************************; DATA A2; INPUT SID AGE HEIGHT; CARDS; 1 17 56 2 26 3 41 60 4 29 66 ; RUN; Proc print data=a2; title1 'missing ht on line 2'; run; ** WRONG - missing data: using missover *******************************************; DATA A3; infile cards missover; /* use infile statement to use missover option */ INPUT SID AGE HEIGHT; CARDS; 1 17 56 2 26 3 41 60 4 29 66 ; RUN; Proc print data=a3; title1 'missover option used: missing ht on line 2'; run;

Week 8 Introduction to SAS – The DATA Step

week 08 8.22

** WRONG - missing age in middle of line, missover used ***************************; DATA A4; infile cards missover; /* use infile statement to use missover option */ INPUT SID AGE HEIGHT; CARDS; 1 17 56 2 62 3 41 60 4 29 66 ; RUN; Proc print data=a4; title1 'missed age on line 2: ht value read as age'; run; ** CORRECT - missing data: using DOT placeholder *********************************; DATA A5; infile cards missover; /* use infile statement to use missover option */ INPUT SID AGE HEIGHT; CARDS; 1 17 56 2 . 62 3 41 60 4 29 66 ; RUN; Proc print data=a5; title1 'missing age: . placeholder used'; run;

Week 8 Introduction to SAS – The DATA Step

week 08 8.23

b. Character Variables ($) and Imbedded Blanks (&) There are two special codes to be used on the INPUT statement, associated with list

input. The dollar sign special code ($) is used after the variable name to indicate that

character data is to be read – SAS assumes numeric data by default – and the

ampersand special code (&) is used when character variables have single imbedded

blanks. If a single imbedded blank occurs in a character variable, two blanks must be

used to separate this variable from the next variable (that is, the delimiter must be 2

blanks). The example below illustrates the use of these special codes in list or free-

format input statements.

DATA NEW1; INPUT SID FNAME $ LNAME $ STREET & $15.; CARDS; 001 Mary Bako 162 Pond St. 202 Sally Jones 447 Lake Drive 370 Peter McArthur 16 Newberry Rd. ; RUN;

• The example reads an ID variable, first name, last name, and street address

using list or free format input.

• The dollar sign ($) is used to indicate character data for names and addresses.

• Since imbedded blanks occur within street addresses, this variable name is

followed by the special character "&".

• In addition, for character variables, by default, only the first 8 characters will be

read, unless otherwise specified.

• In this example, fifteen characters are to be read for the STREET variable, as

indicated by ‘$15.’ .

Week 8 Introduction to SAS – The DATA Step

week 08 8.24

• Also note, in the data, a double blank space precedes the street address as the

delimiter.

List input must be used when the values to be read are separated by blanks or other

delimiters, but the columns vary from line to line, as in the following data:

1 12 3 2 100 14 3 31 16 In this case it is not possible to specify a particular column for reading the third

variable.

c. Column or Formatted Input

The most common form of input is column or formatted input. Column input

associates the variables with values by specifying the column where the data is

stored. Columns are indicated immediately after the variable name. As in list input, a

dollar sign ($) after the variable name is used to define a character variable. Column

input should be used when possible in all routine data input applications, since errors

due to miss-alignment of variables are minimized. Column input must be used when

no spaces or other delimiters are used between values, or when numeric data are

recorded without explicit inclusion of a decimal point, and values after the decimal

point occur. When this occurs, the number of digits that should be placed after the

decimal point can be specified immediately following the column specification. An

example of data input using column format is given next.

Week 8 Introduction to SAS – The DATA Step

week 08 8.25

DATA NEW1; INPUT HID 1-5 HT 7-9 .1 WT 10-12 ADDRESS $14-25; CARDS; 30192 665125 53 South Maple 42389 740180 114 Pondview ; RUN;

• Three variables are read for two subjects in this example.

• HT is read from columns 7 to 9, and written in SAS with 1 column after the

decimal point.

• WT is read from columns 10 to 12. Values of height and weight read for the

first subject are HT=66.5, WT=125, while values read for the second subject

are HT=74.0, WT=180.

• Note that the ampersand (&) isn’t necessary for an embedded blank in the

address field when column input is used because the columns, including the

space, are specified.

d. Easy Column Input Using the At Symbol (@)

A useful alternative form for column input is available that is easier to read. An @

symbol is used to indicate the beginning column for reading a variable, followed by

the variable name, with the number of columns and format for the variable indicated

immediately after the name. Reading the same data as above using these input

features the INPUT statement is given as:

Week 8 Introduction to SAS – The DATA Step

week 08 8.26

DATA NEW1; INPUT @1 HID 5. @7 HT 3.1 @10 WT 3. @14 ADDRESS $12. ; CARDS; 30192 665125 53 South Maple 42389 740180 114 Pondview ; RUN;

• The above input statement says to start at column 1 and read 5 columns for

HID.

• Then start at column 7 and read 3 columns for HT, writing the data with 1

column after the decimal point.

• WT is read starting in column 10, for 3 columns (nothing after the decimal

point), and ADDRESS is read as character data, for 12 columns starting with

column 14.

• It is not necessary to put each variable on a new line, though this improves

readability, which is advantageous for proofreading, as well as documentation.

Although this form of input requires more lines in a SAS program, the

documentation feature makes the extra lines worthwhile.

• This type of input statement is also used when reading data with a particular or

unusual format. The most common instance is with reading date values. SAS

offers a wide array of choices for formatting dates (see DATE FORMATS in

the SAS Language Guide), and for reading them in (see DATE INFORMATS).

The next example reads in dates that are stored in MM/DD/YY format.

Week 8 Introduction to SAS – The DATA Step

week 08 8.27

INPUT @10 DOB MMDDYY8.;

This statement would read dates from a file, starting at column 10, taking 8

columns (6 numbers plus 2 slashes) in MMDDYY type format, such as 03/18/92 for

March 18, 1992.

Week 8 Introduction to SAS – The DATA Step

week 08 8.28

10. Advanced INPUT Features

Many special features can be used with column input to make input statements

shorter, or tailored to particular applications. It is also possible to mix the ways in

which data are read in a single input statement. Some of these features are

illustrated in a few more examples. Refer the reader to the SAS Language Manual

for others. The examples that follow illustrate (a) reading data for one observation

from multiple lines, (b) reading multiple records from one given line of data, and (c)

reading variable numbers of lines per record.

a Reading Data With Multiple Lines per Record (# and Slash).

Theoretically, the data for each record could span as many columns as you like so

that, in theory, the length of a line of data could be unlimited. In reality, however, this

is not possible. While SAS allows data to be input from very long data lines (up to

32767 columns), many other application programs restrict the number of columns

that can be used. For example, EpiInfo 6.04 writes data out to 80 characters per line,

and uses multiple lines per record, as needed. Printers are also restricted

(depending on the font) to less than 160 columns per line (for 8.5 inch paper).

Historically, when data were input via physical cards, line length was restricted to 80

columns corresponding to keypunch columns on the cards.

Week 8 Introduction to SAS – The DATA Step

week 08 8.29

For these reasons – the restrictions imposed by other software – it is generally a

good idea to keep line length less than 140 columns, though this is not strictly

necessary. When many, many variables are recorded per subject and the number of

columns needed exceeds some limit, then additional variables are entered on

subsequent lines. Many lines can be used for recording variables for a particular

record.

To input data from such records into SAS, the line number is simply noted with a #

symbol prior to reading the variables on the line. A simple example illustrating the

syntax follows:

DATA NEW1; INPUT #1 @1 HID 5. @7 HT 3.1 @10 WT 3. #2 @1 LNAME & $10. FNAME & $ @40 STNO 4. @45 STNAME $10.; CARDS; 23901 684145 Jovanovic Mary 69 North St. 45392 735199 Mc Alligator John Paul 1239 Smith Ave. 38389 770201 Xzavior-McCullagh Nancy 37 Northwestern Ave. ; RUN; PROC PRINT DATA=NEW1; VAR HID HT WT LNAME FNAME STNO STNAME; TITLE1 'Ex: entering multiple lines w/ character truncation'; RUN;

Week 8 Introduction to SAS – The DATA Step

week 08 8.30

• Variables for HID, height and weight are read from the first line.

• Variables for last name, first name, street number, and street name are read

from the second line.

• Although there are six lines of data, only three records are created, since there

are two lines per record.

• This example combines fixed (column) and free (list) format, since the columns

used for the first name differ depending on the last name length.

• Single imbedded blanks are permitted in the last name and first name by

inclusion of the symbol "&". The first name is separated from the last name

by two blanks to indicate a new variable. The number of columns retained in

the variable for last name is specified as 10, while the number of columns

retained for the first name is not specified (and therefore has the default value

of 8 columns). The listing of the data that results follows.

Example of entering multiple lines with character truncation OBS ID HT WT LNAME FNAME STNO STNAME 1 23901 68.4 145 Jovanovic Mary 69 North St. 2 45392 73.5 199 Mc Alligat John Pau 1239 Smith Ave. 3 38389 77.0 201 Xzavior-Mc Nancy 37 Northweste

Week 8 Introduction to SAS – The DATA Step

week 08 8.31

Another option for reading from multiple lines per record is to use a slash (/) in the

input statement to indicate that variables following the slash are to be read from the

next line. It isn’t as easy to proofread, since the current line number as well as the

total number of lines per record is not specified explicitly. The above data could also

be read as:

DATA NEW1; INPUT @1 ID 5. @7 HT 3.1 @10 WT 3. / @1 LNAME & $10. FNAME & $ @40 STNO 4. @45 STNAME $10.; CARDS; 23901 684145 Jovanovic Mary 69 North St. 45392 735199 Mc Alligator John-Paul 1239 Smith Ave. 38389 770201 Xzavior-McCullagh Nancy 37 Northwestern Ave. ; RUN;

b Reading Multiple Records From the Same Line of Data

When testing programs, or entering small data sets for analysis, data for multiple

records may be recorded on the same line. To read such data, the current line read

by the INPUT statement is held by using the trailing @@ symbol.

For example, suppose the variables for subject's identification (SID), subject's age

(AGE), pulse (PULSE) and years of education (EDUC) are recorded for 9 subjects on

three lines of data. The following example illustrates how the trailing @@ can be

used to read these data.

Week 8 Introduction to SAS – The DATA Step

week 08 8.32

DATA NEW1; INPUT SID AGE 2. PULSE 2. EDUC 2. @@; CARDS; 01 221604 02 242216 03 332112 04 594007 05 153308 06 402311 07 232614 08 333016 09 302717 ; PROC PRINT DATA=NEW1; VAR SID AGE PULSE EDUC; TITLE1 'Example of reading multiple records per line'; RUN; The output from this program follows:

Example of reading multiple records per line OBS ID AGE PULSE EDUC 1 1 22 16 4 2 2 24 22 16 3 3 33 21 12 4 4 59 40 7 5 5 15 33 8 6 6 40 23 11 7 7 23 26 14 8 8 33 30 16 9 9 30 27 17 A total of nine records are read from the three lines of data. Since ID is read in free

format, the INPUT statement will automatically go to the next value (or next line)

when searching for the next record ID.

Week 8 Introduction to SAS – The DATA Step

week 08 8.33

One other time saving feature can be illustrated in this example. When several

variables have the same fixed format, the format can be specified for the set of

variables by enclosing the set of variables in parentheses, and the common format in

parentheses. For example, the same data input would have resulted for the previous

example if the INPUT statement had read:

INPUT ID (AGE PULSE EDUC) (2.) @@;

c. Reading Varying Numbers of Lines per Record.

In some applications, different numbers of lines of data will be recorded for different

subjects. This situation will commonly arise when the number of variables recorded

in a questionnaire is so large that there are multiple lines per record. For some

subjects data may be reported only for variables in the first line, with no data for

subsequent lines (i.e., when large sections are blank due to skip patterns). In these

settings, rather than artificially padding the number of lines with missing values, fewer

lines may be recorded.

As a simple example, consider the data given below:

101 John Massey 1 101 114 Plumb St. 2 101 643-2373 3 103 Peter Black 1 103 67 Newberry Ct. 2 104 Jane Newperson 1 104 1782 Blackthorn Rd. 2 104 545-2223 3 105 Jake Wanderer 1 109 Sam Slipper 1 109 33 Hawthorne Ct. 2

Week 8 Introduction to SAS – The DATA Step

week 08 8.34

These data contain information on five subjects, with the subject's name on the first

line, address on the second line, and phone number (if available) on the third line.

For ID=101 and ID=104, all data are reported. For ID=103 and ID=109, only name

and address are reported, and for ID=105 only name is reported.

The first variable on each line of data identifies the subject, while the last variable in

each line identifies the line number for the subject. The data can be input by using a

trailing @ in SAS, where the trailing @ holds the current line of data until a

subsequent input statement has been given.

DATA NEW1; INFILE ‘C:\TEMP\EX2.DTA’; INPUT @28 RECNO 1. @; * @ holds the line for next input statement; IF RECNO=1 THEN INPUT @1 ID 3. @7 FNAME $ LNAME & $10.; ELSE IF RECNO=2 THEN INPUT @1 ID 3. @6 STNO 4. @11 STNAME & $10.; ELSE IF RECNO=3 THEN INPUT @1 ID 3. @7 PHONE $8.; RUN; PROC PRINT DATA=NEW1; TITLE1 'LISTING OF DATA: Varying lines per record'; RUN;

Week 8 Introduction to SAS – The DATA Step

week 08 8.35

There are several features of the program that will be discussed in more detail later,

but are useful to note.

• In order to decide which line (and which format) is appropriate for a particular

line of data, the variable RECNO is read and the line held for subsequent

operation.

• An IF-THEN statement is used next. IF the line number matches a particular

value, THEN a particular input statement is used.

• An ELSE IF statement follows, since the next input statement could only be

used if the first if condition was not met. The output from this program follows:

LISTING OF DATA: Varying lines per record OBS ID FNAME LNAME STNO STNAME PHONE 1 101 John Massey 114 Plumb St. 643-2373 2 103 Peter Black 67 Newberry C 3 104 Jane Newperson 1782 Blackthorn 545-2223 4 105 Jake Wanderer . 5 109 Sam Slipper 33 Hawthorne

Week 8 Introduction to SAS – The DATA Step

week 08 8.36

11. How to Handle Missing Values

One of the real strengths of SAS is its flexibility in the handling of missing values.

Almost all collections of data have some missing values or values that are so

obviously invalid or out of range that they must be replaced with missing values. In

some cases data are not actually missing but are merely not applicable for all cases.

It is often advantageous to keep track of all of these situations and be able to

differentiate among them as, at times, this difference will have an impact on the total

number of subjects used to compute proportions.

a. SAS Missing Value Codes

When reading data into a SAS data set from an ASCII file or another format (e.g.,

Excel or Access), missing data can be represented for both numeric and character

data as either a blank or a single period (.) in the ASCII, Excel or Access file. When

reading an ASCII file using LIST input a period must be used, or else the next value

after the blank will be read in, and all subsequent values, at least for that line, will be

misread. An example was given earlier, in the section on LIST input. When using

COLUMN input, the columns may simply be left blank (or a period can be used).

Blank columns will be read into SAS as missing values in column input.

Week 8 Introduction to SAS – The DATA Step

week 08 8.37

In SAS data sets, missing character values are represented by blanks ( ), and

missing numeric values are represented by a period (.). Therefore, other

missing value conventions must be reassigned to SAS recognizable missing

values prior to their use in SAS.

Example illustrating the “9”, “99” “999” practice – The values of “9” or “99” or “999”

are often used to designate missing values in data entry. As such, they cannot be

used in SAS; they must be recoded to a SAS missing value code so that they will not

be used in computations (unless specifically requested – more on this later). These

recodings are accomplished using programming statements when the data is read

into a SAS data file; eg -

IF VAR1=9 THEN VAR1=.; IF AGE=99 THEN AGE=.; The above lines would replace all values of 9 for VAR1 and 99 for AGE with the SAS

missing value ‘.’ .

SAS actually offers a variety of missing value designations

Believe it or not, you may want to keep track of the different reasons for missingness

(for example - “unknown”, “refused”, “skipped” are three different data entry

scenarios that yield a missing value). To illustrate, suppose you wish to distinguish

between refusals (coded as 7), not applicable (coded as 8) and missing (coded as 9),

the following statements could be used after an input statement:

Week 8 Introduction to SAS – The DATA Step

week 08 8.38

IF VAR1=7 THEN VAR1=.R; ELSE IF VAR1=8 THEN VAR1=.N; ELSE IF VAR1=9 THEN VAR1=.M;

• The SAS special missing value ‘R’ is assigned to refusals, originally entered

as “7”

• The SAS special missing value ‘N’ is assigned to the not applicable , originally

entered as “8”

• The SAS special missing value ‘M’ is assigned to the missing values, originally

entered as “9”

This might be handy later if you want to identify refusers, or in getting a count of

refusals, or, if you want to treat these as missing values for computational purposes.

TIP: The special missing values are stored in the data set and print as a letter

without the accompanying ‘.’; however in programming statements you must

refer to them by preceding the letter with a period (e.g., .R or .N).

SAS orders missing value types. Possible alternatives for the coding of missing

numeric values in SAS; from smallest to largest are:

_ . A B C and so forth Z

note: SAS treats the missing value “_” as the smallest and “.z” the largest.

Week 8 Introduction to SAS – The DATA Step

week 08 8.39

We will see later that SAS offers you choices in the handling of missing values, such

as whether or not they appear in frequency tables, and whether or not they are

included in the computation of totals and percentages.

b The MISSING Statement

Sometimes missing numeric data will be provided to you as letters, rather than as

periods or blanks. The result is a mixture of numeric and character entries in the

same field. This will cause an error (“invalid data” ) unless it is properly handled.

Use the MISSING Statement to manage missing numeric data that has been entered

using a letter.

In particular, take care to place a MISSING statement before an input statement so

that SAS will read these as missing values rather than as invalid numeric data. In the

following example, R and N will be treated in the SAS dataset as missing values.

DATA TEMP; MISSING R N; INPUT AGE; CARDS; 12 R 19 N ; RUN;

Week 8 Introduction to SAS – The DATA Step

week 08 8.40

c. INVALIDDATA option

The INVALIDDATA option is a great device! It allows you to detect invalid data and

provides you with a means of distinguishing it from actual missing data.

• It functions by creating a code (one that you’ve specified) when invalid data

appears in an input line; this can be displayed on the output of your SAS run.

• This proves handy in correcting invalid data.

• Note: INVALIDDATA appears on the OPTIONS statement, not as part of the

particular data step. Following is an example.

OPTIONS INVALIDDATA = ‘X’; DATA TEMP; MISSING R N; INPUT AGE; CARDS; 12 R 19 N 3N ; PROC PRINT DATA=TEMP; RUN; In this example, the value ‘3N’ is invalid; it does not conform to either of the missing

value codes nor to valid numeric data format. Use of the INVALIDDATA option

results in the replacement of the ‘3N’ with an ‘X’. Actually the ‘X’ replaces any invalid

data. The print out would look like the following.

OBS AGE 1 12 2 R 3 19 4 N 5 X Thus, you know to review the data for observation #5.

Week 8 Introduction to SAS – The DATA Step

week 08 8.41

d How to Compute with Missing Values

As mentioned previously, SAS treats missing values as ordered and has a defined

ordering system. A few additional remarks.

• In SAS missing values are considered to have values less than all possible

numeric values (even negative ones). Thus, .Z < −∞

• Tip: When creating new variables from variables that have missing values,

take care!! For example in creating age groups from a variable AGE,

representing age in years, the following statements would include those with

missing AGE in the youngest age group:

Example of flawed SAS code- IF AGE< 20 THEN AGEGR=1; ELSE IF 20<=AGE< 40 THEN AGEGR=2; ELSE IF 40<=AGE THEN AGEGR=3; The problem here is that a missing value of age will be incorrectly binned into agegr=1, instead of retained as missing in the creation of AGEGR.

• To avoid this problem it is necessary to use the instruction IF 0<AGE<20, so

that those with missing AGE will also have missing AGEGR, the variable for

age group.

• To check for all missing values use:

IF VAR1 <= .Z THEN ... Can you see the “period”here?

• For example, to delete all observations with missing values for VAR1 use;

IF VAR1 <= .Z THEN DELETE;

Week 8 Introduction to SAS – The DATA Step

week 08 8.42

In the same way, missing character values (blanks) precede all other characters in

alphabetic sorting, and must be handled appropriately.

• Missing character values can be referred to by enclosing a space in single

quotes. For example, if the variable SEX is represented by F and M, with a

blank for a missing value, all those with missing data for SEX could be deleted

with the statement:

IF SEX=' ' THEN DELETE;

• For further information see the chapter on Missing Values in the SAS Language

Guide, or SAS HELP.

Week 8 Introduction to SAS – The DATA Step

week 08 8.43

12 Describing SAS Data Sets

SAS data sets are stored in a special (SAS-specific) format; thus, they cannot be

read directly by other programs.

• Once a data set has been store in SAS format, some associations are

established. For example, variable names can be automatically associated

with labels and variable values can be automatically associated with the

appropriate word descriptions (more on this later).

• The advantage to its storage in SAS format is that when you refer to a SAS

data sets in the SAS system it is not necessary to keep track of the variable

format and columns. The SAS system does that for you, when you refer to the

variable by name.

• A SAS data set cannot be viewed or printed from a text editor such as Notepad,

or from a word processor. (To get a data listing the print procedure, PROC

PRINT, must be used, as illustrated in the above examples. This procedure

will be discussed in more detail later).

Thus, SAS has documentation features that permit the attachment of labels and

formats to variables, which can be stored with the data. These formats and labels

are then used in printed output created by SAS procedures, rather than the variable

names or values.

Week 8 Introduction to SAS – The DATA Step

week 08 8.44

a How to Label Variables

A variable label is a descriptive phrase that characterizes the variable, thus permitting

more readable output where needed.

• It can be as simple as a less abbreviated name for the variable, or can contain

information on the units of measurement or codes.

• A label can be up to 40 characters in length. Note, however that labels are

often truncated to 16 characters on printed output. (This means put the

important information at the beginning of the label.)

• A label statement can be included anywhere in a DATA step.

• Tip: Define labels for all variables in SAS data sets.

• The syntax for a LABEL statement is:

LABEL PID = 'Patient ID Number' AGE = 'Age In Years' HT = 'Height In Inches' SMOKE = 'Smoker: 1=y/0=n' ;

• The keyword LABEL is followed by a variable name, an equal sign, and the

variable label enclosed in single quotes (or double quotes when the label

contains a single quote or apostrophe), and so on, for each variable to be

labeled. A semi-colon to end the label statement follows the last label.

Multiple labels can be listed on a single line – the label statement ends only

with the semi-colon.

Week 8 Introduction to SAS – The DATA Step

week 08 8.45

• Common errors in using label statements include:

Missing a quote

Improper handling of a label that contains a single quote

Leaving off the semi-colon after the last label

• Tip: Line up the equal signs, listing a single variable and label per line. This

makes proofreading and error checking easier.

• Labels will automatically appear on output for many SAS procedures, and can

be used optionally with other procedures, as in the example below.

b How to Label a SAS Data Set

This is done in the DATA statement itself, when naming the data set, as in the

example that follows.

LIBNAME OLD 'C:\TEMP\'; *********************************; ** Create a SAS data set NEW1 ; *********************************; DATA NEW1; INPUT SID 3. AGE 2. SEX $6. HT 3.1; CARDS; 00124Male 702 00329Female695 00435 Male745 ; RUN; *********************************; ** Print the data set *; *********************************; PROC PRINT DATA=NEW1; TITLE1 'Listing of Class Demographic Information'; RUN;

Week 8 Introduction to SAS – The DATA Step

week 08 8.46

********************************; * Store data with labels *; ********************************; DATA OLD.CLASS1(LABEL='CLASS DEMOGRAPHIC INFO'); SET NEW1; LABEL SID='Student*ID*Number' AGE='Age on*Sept 1' SEX='Sex' HT='Ht in*Inches'; RUN; ************************************; ** Print the data set with labels *; ************************************; PROC PRINT DATA=OLD.CLASS1 SPLIT=‘*’; TITLE2 'Using Labels in place of variable names'; RUN; ********************************; * Get data structure *; ********************************; PROC CONTENTS DATA=OLD.CLASS1; TITLE1 'STRUCTURE OF CLASS DEMOGRAPHIC DATA SET'; RUN;

• Note - Although the LABELs were not created in the original SAS data set, they

were created and saved in the SAS data set CLASS1. The printed output

from the PRINT procedure follows:

Listing of Class Demographic Information

OBS SID AGE SEX HT 1 1 23 Male 70.2 2 3 29 Female 69.5 3 4 35 Male 74.5

Week 8 Introduction to SAS – The DATA Step

week 08 8.47

• Note - The second time the data is printed, the option SPLIT=‘*’ was used on

the PROC PRINT statement. This is a print option that indicates that labels

should be used in place of variable names to head columns, and that the

labels should be split into lines at the character indicated within quotes: ‘*’.

This is the reason that asterisks were used in creating the labels. The width of

the column for printing will depend on either the space required for printing the

data, or the space required for printing the variable name or label. For this

reason, long variable names should be avoided, and split characters should be

used in variable labels.

Listing of Class Demographic Information Using Labels in place of variable names Student ID Age on Ht in OBS Number Sept 1 Sex Inches 1 1 23 Male 70.2 2 3 29 Female 69.5 3 4 35 Male 74.5

Week 8 Introduction to SAS – The DATA Step

week 08 8.48

c The PROC CONTENTS Procedure

It is often useful to have a summary of the variables that are contained in the data

set. The SAS system has a special procedure called PROC CONTENTS that

summarizes information about SAS data sets. An example was given above, to

display the variables in the SAS data set CLASS1.SD2. The output from the PROC

CONTENTS follows:

STRUCTURE OF CLASS DEMOGRAPHIC DATA SET The CONTENTS Procedure Data Set Name: OLD.CLASS1 Observations: 3 Member Type: DATA Variables: 4 Engine: V8 Indexes: 0 Created: 16:42 Sunday, October 15, 2000 Observation Length: 32 Last Modified: 16:42 Sunday, October 15, 2000 Deleted Observations: 0 Protection: Compressed: NO Data Set Type: Sorted: NO Label: CLASS DEMOGRAPHIC INFO -----Engine/Host Dependent Information----- Data Set Page Size: 4096 Number of Data Set Pages: 1 First Data Page: 1 Max Obs per Page: 126 Obs in First Data Page: 3 Number of Data Set Repairs: 0 File Name: C:\TEMP\class1.sas7bdat Release Created: 8.0000M0 Host Created: WIN_98 -----Alphabetic List of Variables and Attributes----- # Variable Type Len Pos Label ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ 2 AGE Num 8 8 Age on*Sept 1 4 HT Num 8 16 Ht in*Inches 3 SEX Char 6 24 Sex 1 SID Num 8 0 Student*ID*Number

Tip: In printing this out in MS Word, I selected the font SAS MONOSPACE - cb

Week 8 Introduction to SAS – The DATA Step

week 08 8.49

• The PROC CONTENTS procedure lists the number of observations in the data

set, the data set label, the variable names, their type, length, position, and

labels.

• Under the Alphabetic List of Variables and Attributes, the # indicates the

ordered position of each variable in the SAS data set. SID is first; AGE is

second, and so on.

• The variable positions (Pos) correspond to the starting position for each

variable in the SAS data set, in bytes. The POSITION option on the PROC

CONTENTS statement can be used to give a second listing of the variables,

ordered by position rather than alphabetically. The first variable's starting

position is 0. Remaining variables are in starting positions formed by

cumulating the length of variables in bytes, following the order of input in the

SAS DATA step. Note that a length of 8 bytes is the default length

assigned to all numeric variables.

Week 8 Introduction to SAS – The DATA Step

week 08 8.50

d How to Use FORMAT to Document Variable Values

Formats enable descriptive labels to be substituted for numeric codes that represent

nominal or ordinal data.

• Formats can be created in many procedures and used for the resulting output of

that one procedure, or they can be created with PROC FORMAT and stored in

a format library. Formats can be permanently assigned to variables in a data

step, or can be assigned for use during a particular procedure, only.

• For example, for a survey where SEX of the respondent was coded as

(1=male, 2=female), and where several variables were coded as (0=no,

1=yes), the documentation might be coded in SAS as follows (more on this

later).

VALUE SEXFMT 1='1.Male'

2='2.Female'; VALUE YNFMT 0='No' 1='Yes'; RUN; • This code represents a dictionary. When the dictionary is requested (using a

FORMAT statement), SAS output will be more readable.

How to Use PROC FORMAT

• PROC FORMAT is used to define formats (with format names up to 8

characters in length) that assign labels or formats to values.

• The keyword VALUE is followed by the format name, followed by the code=,

and the label in single quotes.

Week 8 Introduction to SAS – The DATA Step

week 08 8.51

• A semi-colon follows the final format label (and ONLY the final format label).

Any number of formats can be defined in PROC FORMAT at one time.

• If the variable containing the codes is a character variable, you must create a

character format by beginning the format name with a dollar sign ($) and

enclosing both the code and the label in quotes. (See example).

• On the Proc Format statement, CNTLOUT=filename is used to define the data

set where the format names and values will be saved. In the example below,

a format data set called FMT1.sas7bdat will be saved on C:\TEMP, that

contains the format information. Once formats are saved they can be used on

another occasion by the statements:

PROC FORMAT CNTLIN=SDATA.FMT1; RUN;

How to Use CNTLOUT and CNTLIN in PROC FORMAT

• CNTLOUT writes or saves a SAS format data file

• CNTLIN reads in a SAS format data file that has been saved previously.

• If neither is used, the formats created are only available for use during the SAS

session.

Week 8 Introduction to SAS – The DATA Step

week 08 8.52

**********************************************; ** example to create formats and apply them **; **********************************************; LIBNAME SDATA 'C:\TEMP'; ** create and save formats in c:\temp\fmt1.sas7bdat **; PROC FORMAT CNTLOUT=SDATA.FMT1; VALUE SEXFMT 1='1.Male' 2='2.Female'; VALUE YNFMT 0='No' 1='Yes'; VALUE $CODEFMT 'A'='Always' 'B'='Sometimes' 'C'='Rarely' 'D'='Never' ; RUN; ** create test data with sex, a yes/no and letter coded variables **; DATA TEST1; INPUT SEX YN CVAR $; CARDS; 1 0 A 1 1 B 2 0 C 2 1 D ; RUN; ** PRINT DATA WITHOUT FORMATS **; PROC PRINT DATA=TEST1; TITLE1 'UNFORMATTED LISTING OF TEST1'; RUN; ** ASSIGN FORMATS AND STORE DATA **; DATA SDATA.TEST2; SET TEST1; FORMAT SEX SEXFMT. YN YNFMT. CVAR $CODEFMT.; RUN; ** PRINT FORMATTED DATA, AND GET STRUCTURE **; PROC PRINT DATA=SDATA.TEST2; TITLE1 'FORMATTED VERSION OF TEST DATA'; RUN; PROC CONTENTS DATA=SDATA.TEST2; RUN;

Week 8 Introduction to SAS – The DATA Step

week 08 8.53

The formats are assigned to variables in a data step or in a procedure, by use of a

FORMAT statement. The keyword FORMAT is given, followed by the variable

name(s) followed by the format name ending with a period (.) to indicate a

format. The above example assigns the formats in a second data step, but they can

be assigned in the original data step also.

The resulting output is given below.

UNFORMATTED LISTING OF TEST1 Obs SEX YN CVAR 1 1 0 A 2 1 1 B 3 2 0 C 4 2 1 D FORMATTED VERSION OF TEST DATA Obs SEX YN CVAR 1 1.Male No Always 2 1.Male Yes Sometimes 3 2.Female No Rarely 4 2.Female Yes Never

Tip: Create and save a separate format program for your study. As more

formats are required during the course of a project, add to this format program, and

rerun it to update the format data file. In this way you have a single, complete file of

variable formats used in your project.

Week 8 Introduction to SAS – The DATA Step

week 08 8.54

ALERT: Once formats have been assigned to variables in a saved data set, you

cannot access the data without the format file. As the program tries to read the

data file, it will look for the assigned formats, and if they cannot be found an error

message will be given in the log. So your format file must be available, along with

the data file if you have assigned formats in a data step that creates a stored file.

For more information on creating and using formats see the chapter on PROC

FORMAT in the SAS Procedures Guide.

e Using the Viewtable

The SAS Viewer is a wonderful feature of the SAS program that allows you to view

your SAS data in tabular (spreadsheet) display.

• TIP: Use this to view your data but not to manipulate it.

• You can open, view and edit a SAS data file from the Windows Explorer:

• select your SAS data set, and

• double-click to open it.

Week 8 Introduction to SAS – The DATA Step

week 08 8.55

How to View Your Data from the SAS Explorer Window

• Double-click the library icon where your file is stored

• Double-click to open your SAS file

Temporary data files are stored in the Work library

Icons for SAS data files appear with the first part of the name only in the SAS Explorer. The “.sas7bdat” is not shown.

Week 8 Introduction to SAS – The DATA Step

week 08 8.56

• Your data will appear in spreadsheet format, with rows representing records, and

columns fields.

There are a number of options on the toolbar for viewing the data.

• The Table Attributes and Column Attributes views show the same

information on variables that is displayed by using PROC Contents, and allow

you to modify these. Variables can be hidden from view (such as confidential

information), and the displays can be printed. This adds to the ease of

exploring your data, as well as giving additional display options for

documentation purposes. You can add new records and edit and delete

records directly in the Viewtable when in edit mode.

• The forms view displays the data one observation at a time, similar to forms in

Access.

• TIP: Do NOT use the viewtable for editing data.

Any editing you can do in this view can also be accomplished through

programming statements. Your program then serves as documentation of the

Tables View Edit Mode

Browse ModeAttributes

Forms View

Week 8 Introduction to SAS – The DATA Step

week 08 8.57

changes you have made in your data. This documentation does not occur

when you edit using the Viewtable.

Week 8 Introduction to SAS – The DATA Step

week 08 8.58

13 Minimizing the space taken by SAS data sets.

SAS data sets will always be larger than their corresponding ASCII data files.

This is because SAS data sets save information concerning the variables and

formats, and generally store variables using more bytes per variable than an ASCII

file. In the example in the previous section, the ASCII data set for the three subjects

was contained in 49 bytes whereas the SAS data set CLASS1.sas7bdat required

4096 bytes. It is possible to create SAS data sets so as to use less space. However,

SAS data sets will always be larger than their corresponding ASCII data files.

Reasons for the extra space requirement of a SAS Data Set

• Unlike ASCII, the SAS program includes in its storage of a data set information

on record format, variable names, and variable labels.

• The manner in which variables are stored in the SAS data set is itself more

space consuming.

Large data sets lead to slow read-write operations and slower processing. Although

the variable names and labels carry with them a fixed overhead, the additional size of

the data set due to storage of the data should be kept to a minimum.

Week 8 Introduction to SAS – The DATA Step

week 08 8.59

Using the minimal length for each variable will minimize the space required for storing

the data set. The type of variable considered, as indicated in the Table below,

dictates this minimal length.

• For character variables, the length in bytes is equal to the number of columns,

so the minimum length is 1 byte, for a 1 column variable, and the maximum

allowed is 200 bytes. Character data is stored with 1 byte per column – 1 byte

is needed for the ASCII character code for each letter, number or symbol.

• Numeric data is stored, not as the ASCII representation of the number, but

using binary code. The largest whole number that can be stored in 1 byte is

255 (recall 1 byte = 23 = 8 bits). To store numeric data in SAS, the minimum

length allowed is 3 bytes, and the maximum is 8 bytes. (Actually, there is a

way to specify “double precision” when greater accuracy in computation is

needed – but I’ve never had to use it.)

• By default, SAS will assign a variable length of 8 bytes to all numeric variables

– so unless otherwise specified, the length specification can make a significant

difference in the size of the SAS data set.

Table - Criteria for choosing variable type and length to minimize storage: Measurement Scale Type Length Maximum Value Power of 2 Nominal Character 1-200 Ordinal Numeric 3 8,192 13 Interval/Ratio Numeric 4 2,097,152 21 (Whole numbers) 5 536,870,912 29 6 137,438,953,472 37 7 35,184,372,088,832 45 8 9,007,199,254,740,992 53 Interval/Ratio Numeric 8 (decimals, negatives)

Week 8 Introduction to SAS – The DATA Step

week 08 8.60

• The length of character variables should be defined by the longest possible

value that you want to include.

• For numeric variables there are some choices as represented above. For

whole numbers, the length can be defined by the maximum value that the

variable can take. If the maximum value is less than 8,192 then a length of 3

is adequate, and so on. For example, where values of a variable are really

codes that represent categories, a length of 3 will always be adequate.

• When a variable can take on fractional values, a length of 8 is always required,

or else there will be problems with truncation/round-off errors.

Example - This example illustrates the impact of setting character and numeric

variable lengths. The following program reads a subset of variables into two data

sets. In the first, SAS is allowed to set the variable length by default, and declare all

variables as numeric. In the second data set, we read the same data using character

variables for nominal data, and minimal length appropriate for numeric variables.

Note that the minimal length for numeric variables is length=3. The statements to

read the data using default length are as follows:

* First data set. SAS is allowed to set lengths using defaults *; data old.new1; input @1 HID 7. @8 SID 1. @9 SEGMENT 2. @11 CINTID 2. @13 CNEWPHON 7. @20 CLVN 1. @21 CLV_Day 6. @27 CL_HR 2. @29 CL_MI 2. @31 CL_AP 1. @32 CL_PV 1. @33 COUTCODE 1. @34 CFS 2. @36 CE_HR 2. @38 CE_MI 2. @40 C01 1. @41 C02A 2. @43 C02B 2.

Week 8 Introduction to SAS – The DATA Step

week 08 8.61

@45 C02C 2. @47 C02D 2. @49 C03A 1. @50 C03B 1. @51 C03C 1. @52 C03D 1. @53 C03E 1. @54 C03F 1. @55 C03G 1.; cards; 123438741239874238147231908472319084713209487312098432 123712934730246716566743285621989823406213498213482137 123466458750457864879437852374276161823989238198293883 583849238735783566743783287923409234098213672134902134 838393023623460769857786952396851234786243763248796224 ; run; proc contents data=old.new1; title1 'Data without length statement';

Here, the original ASCII data was stored in 281 bytes, while the resulting SAS data

set NEW1.sas7bdat requires 16384 bytes.

In creating the second data set, a second input format is now specified where

nominal variables (codes) were defined as character variables (so that a length of 1

byte can be used), and numeric variables were given a length of 3 bytes unless they

required more.

• Since HID, the ID variable, consisted of 7 columns, which gives values greater

than 2,097,192, (see table above) the length was set at 5.

• A LENGTH statement allows the specification of a length for each variable, by

giving the variable name followed by the length, or a list of variables separated

by spaces, followed by a length. A $ is used to indicate a character length.

• To reset the default length for variables not specifically named, use DEFAULT=

followed by a length on the length statement. An example, assigning a

character length of 15 to LNAME, a length of 5 to the two variables SID and

HOSPID, and the default to all other variables is:

Week 8 Introduction to SAS – The DATA Step

week 08 8.62

LENGTH LNAME $ 15 ID HOSPID 5 DEFAULT=3;

• Input statements for the same data used above with the addition of a length

statement are:

* Second data set. SAS uses the length statement provided *; data old.new2; length hid 5 default=3; input @1 HID 7. @8 SID $1. @9 SEGMENT $2. @11 CINTID $2. @13 CNEWPHON $7. @20 CLVN $1. @21 CLV_D 6. @27 CL_HR 2. @29 CL_MI 2. @31 CL_AP $1. @32 CL_PV $1. @33 COUTCODE $1. @34 CFS $2. @36 CE_HR 2. @38 CE_MI 2. @40 C01 1. @41 C02A 2. @43 C02B 2. @45 C02C 2. @47 C02D 2. @49 C03A $1. @50 C03B $1. @51 C03C $1. @52 C03D $1. @53 C03E $1. @54 C03F $1. @55 C03G $1.; cards; 123438741239874238147231908472319084713209487312098432 123712934730246716566743285621989823406213498213482137 123466458750457864879437852374276161823989238198293883 583849238735783566743783287923409234098213672134902134 838393023623460769857786952396851234786243763248796224 ; proc contents data=old.new2; title1 'Data with length formats'; run; The resulting SAS data set requires 8192 bytes – about half the size of the file which used the defaults. While the saving in space may not seem great, imagine the effect when you read in data for several hundred or several thousand subjects.

Week 8 Introduction to SAS – The DATA Step

week 08 8.63

Tips on Using the LENGTH Statement.

• The length statement may be specified prior to the input statement, or after the

input statement.

• When the length statement is specified prior to the input statement, the length

statement overrides subsequent length statements specified later in the input

statement.

• For example, the numeric variable HID will have length=5, even though the

variable takes 7 columns. The special SAS statement "DEFAULT=3" specifies

the length for all numeric variables not otherwise specified.

• To reset the default, the length statement must appear before the input

statement to override the standard default. All variables that are measured on

an ordinal scale (in this example) are read as character variables.

• If you are trying to fit data onto a disk, use of the length statement can

sometimes make it happen.


Top Related