lesson 2

25
Lesson 2 • Topic - Reading in data – Chapter 2 (Little SAS Book)

Upload: krikor

Post on 04-Jan-2016

27 views

Category:

Documents


0 download

DESCRIPTION

Lesson 2. Topic - Reading in data Chapter 2 (Little SAS Book). Raw Data. Read in Data Process Data (Create new variables) Output Data (Create SAS Dataset). Data Step. Analyze Data Using Statistical Procedures. PROCs. Raw Data Sources. You type it in the SAS program Text file - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Lesson 2

Lesson 2

• Topic - Reading in data

– Chapter 2 (Little SAS Book)

Page 2: Lesson 2

Raw Data

Read in Data

Process Data(Create new variables)

Output Data(Create SAS Dataset)

Analyze Data Using Statistical Procedures

Data Step

PROCs

Page 3: Lesson 2

Raw Data Sources

• You type it in the SAS program

• Text file

• Spreadsheet (Excel)

• Database (Access, Oracle)

• SAS dataset

Page 4: Lesson 2

Data in Text Files

• Delimited data – variables are separated by a special character (e.g. a comma)

• Fixed position – data is organized into columns

Text files are simple character files that you can create or view in a text editor like Notepad. They can also be created as “dumps” from spreadsheet files like excel.

Page 5: Lesson 2

Data delimited by commas(.csv file)

C,84,138,93,143D,89,150,91,140A,78,116,100,162A,,,86,155C,81,145,86,140• Note: Missing data is identified by multiple

commas.

Page 6: Lesson 2

Column Data

C084138093143D089150091140A078116100162A 086155C081145086140• Note: Missing data values are blank.

Page 7: Lesson 2

INFILE and INPUT Statements

When you write a SAS program to read in raw data, you’ll use two key statements:

• The INFILE statement tells SAS where to find the data and how it is organized.

• The INPUT statement tells SAS which variables to read-in

Page 8: Lesson 2

Program 1* List Directed Input: Reading data values

separated by spaces;DATA bp; INFILE DATALINES; INPUT clinic $ dbp6 sbp6 dbpbl sbpbl; DATALINES;C 84 138 93 143D 89 150 91 140A 78 116 100 162A . . 86 155C 81 145 86 140;RUN ;TITLE 'Data Separated by Spaces';PROC PRINT DATA=bp;RUN;

Obs clinic dbp6 sbp6 dbpbl sbpbl

1 C 84 138 93 143 2 D 89 150 91 140 3 A 78 116 100 162 4 A . . 86 155 5 C 81 145 86 140

Page 9: Lesson 2

PARTIAL SASLOG

1 DATA bp;2 INFILE DATALINES;3 INPUT clinic $ dbp6 sbp6 dbpbl

sbpbl;4 DATALINES;

NOTE: The data set WORK.BP has 5 observations and 5 variables.NOTE: DATA statement used: real time 0.39 seconds cpu time 0.03 seconds

Page 10: Lesson 2

* List Directed Input: Reading .csv files

DATA bp; INFILE DATALINES DLM = ',' DSD ; INPUT clinic $ dbp6 sbp6 dbpbl sbpbl; DATALINES;C,84,138,93,143D,89,150,91,140A,78,116,100,162A,,,86,155C,81,145,86,140;TITLE 'Reading in Data using the DSD Option';PROC PRINT DATA=bp;RUN;

Consecutive commas indicate missing data

Page 11: Lesson 2

* List Directed Input: Reading data values separated by tabs (.txt files);

DATA bp; INFILE DATALINES DLM = '09'x DSD; INPUT clinic $ dbp6 sbp6 dbpbl sbpbl; DATALINES;C 84 138 93 143D 89 150 91 140A 78 116 100 162A 86 155C 81 145 86 140;TITLE 'Reading in Data separated by a tab';PROC PRINT DATA=bp;RUN;

Page 12: Lesson 2

* Column Input: Data in fixed columns.

DATA bp; INFILE DATALINES ; INPUT clinic $ 1-1 dbp6 2-4 sbp6 5-7 dbpbl 8-10 sbpbl 11-13 ; DATALINES;C084138093143D089150091140A078116100162A 086155C081145086140;Title 'Reading in Data using Column Input';PROC PRINT DATA=bp;

Note: missing data is blank

Page 13: Lesson 2

* Reading data using Pointers and Informats

DATA bp; INFILE DATALINES ; INPUT @1 clinic $1. @2 dbp6 3. @5 sbp6 3. @8 dbpbl 3. @11 sbpbl 3. ; DATALINES;C084138093143D089150091140A078116100162A 086155C081145086140;Title 'Reading in Data using Point/Informats';PROC PRINT DATA=bp;

Informats must end with a period.

Page 14: Lesson 2

Program 2* Reading data from an external file

DATA bp; INFILE ‘C:\SAS_Files\bp.csv' DSD FIRSTOBS = 2; INPUT clinic $ dbp6 sbp6 dbpbl sbpbl ; TITLE 'Reading in Data from an External File';PROC PRINT DATA=bp;

clinic,dbp6,sbp6,dbpbl,sbpblC,84,138,93,143D,89,150,91,140A,78,116,100,162A,,,86,155C,81,145,86,140

Content of bp.csv

Page 15: Lesson 2

PARTIAL SAS LOG

7 DATA bp;8 INFILE 'C:\SAS_Files\bp.csv' DSD FIRSTOBS=2 ;9 INPUT clinic $ dbp6 sbp6 dbpbl sbpbl ;

NOTE: The infile 'C:\SAS_Files\bp.csv' is: File Name=C:\SAS_Files\bp.csv, RECFM=V,LRECL=256

NOTE: 5 records were read from the infile 'C:\SAS_Files\bp.csv'. The minimum record length was 10. The maximum record length was 16.

NOTE: The data set WORK.BP has 5 observations and 5 variables.NOTE: DATA statement used (Total process time): real time 0.10 seconds cpu time 0.01 seconds

Page 16: Lesson 2

* Using PROC IMPORT to read in data ;* Can skip data step;

PROC IMPORT DATAFILE=‘C:\SAS_Files\bp.csv' OUT = bp DBMS = csv

REPLACE ; GETNAMES = yes; GUESSINGROWS = 9999;

TITLE 'Reading in Data Using PROC IMPORT';

PROC PRINT DATA=bp;PROC CONTENTS DATA=bp;

Uses first row for variable names

Page 17: Lesson 2

The CONTENTS Procedure

Data Set Name WORK.BP Observations 5Member Type DATA Variables 5

Alphabetic List of Variables and Attributes

# Variable Type Len Format Informat

1 Clinic Char 1 $1. $1.2 DBP6 Num 8 BEST12. BEST32.4 DBPBL Num 8 BEST12. BEST32.3 SBP6 Num 8 BEST12. BEST32.5 SBPBL Num 8 BEST12. BEST32.

Page 18: Lesson 2

SOME INFILE OPTIONS

• OBS - limits number of observations read• FIRSTOBS - start reading from this obs.• MISSOVER and PAD - used to read in data

with short records• TERMSTR= used to read files from different

OS.• LRECL= needed when you have data with

long records (> 256 characters)

Page 19: Lesson 2

Problem when reading past default logical record length;

DATA temp; INFILE ‘\...\tomhs.data' OBS=6 ; INPUT @260 jntpain 2. ;TITLE 'Data not read in correctly because

variable is past LRECL ';PROC PRINT;

Obs jntpain

1 . 2 . 3 .

NOTE: Invalid data for jntpain in line 2 NOTE: SAS went to a new line when INPUT statement reached past the end of a line

Page 20: Lesson 2

*Add LRECL option to fix problem ;

DATA temp; INFILE ‘\…\tomhs.data' OBS=6 LRECL=500; INPUT @260 jntpain 2. ;

TITLE 'Data read in correctly using LRECL option';

PROC PRINT;

Obs jntpain

1 1 2 1 3 1 4 1 5 1 6 2

Page 21: Lesson 2

Reading Special Data

• 04/11/1982 Date• 59,365 Comma in number• 086-59-9054 Long (>8) characters

Informat• 04/11/1982 mmddyy10.• 59,365 comma6.• 086-59-9054 $11.

Page 22: Lesson 2

* Reading special data with fixed position data;

DATA info; INFILE DATALINES; INPUT @1 ssn $11. @13 taxdate mmddyy10. @25 income comma6. ; DATALINES;086-59-9054 04/12/2001 59,365 405-65-0987 03/15/2002 26,925212-44-9054 04/15/2003 44,999;TITLE 'Variables with Special Formats';PROC PRINT DATA=info; FORMAT taxdate mmddyy10.;

Obs ssn taxdate income1 086-59-9054 04/12/2001 593652 405-65-0987 03/15/2002 269253 212-44-9054 04/15/2003 44999

Page 23: Lesson 2

* Reading special data with list input using colon modifier;

DATA info; INFILE DATALINES DLM=‘;’; INPUT ssn : $11. taxdate : mmddyy10. income : comma6. ; DATALINES;086-59-9054;04/12/2001;59,365 405-65-0987;03/15/2002;26,925212-44-9054;04/15/2003;44,999;TITLE 'Variables with Special Formats';PROC PRINT DATA=info; FORMAT taxdate mmddyy10.;

Obs ssn taxdate income1 086-59-9054 04/12/2001 593652 405-65-0987 03/15/2002 269253 212-44-9054 04/15/2003 44999

Page 24: Lesson 2

Summary of Ways of Reading in Data

You may not have a choice - data may come to you in a certain format

• List input - data is separated by a delimiter; must read in all variables.

• Column input - data is in fixed columns;must know where each variable starts and ends; can read in selected variables

• Pointers and Informats - alternative to column input; most flexible; must be used for special data

• PROC IMPORT

Page 25: Lesson 2

Exercise 2

• See exercise 2 in course notes