[statistics and computing] r for sas and spss users || generating data

12

Generating Data

Generating data is far more important to R users than it is to SAS or SPSSusers. As we have seen, many R functions are controlled by numeric, character,or logical vectors. You can generate those vectors using the methods in thischapter, making quick work of otherwise tedious tasks.

What follows are some of the ways we have used vectors as arguments toR functions.

� To create variable names that follow a pattern like q1, q2, etc. using thepaste function For an example, see Chap. 7.

� To create a set of index values to select variables, as with mydata[ ,3:6].Here the colon operator generates the simple values 3, 4, 5, 6. The methodsI present in this chapter can generate a much wider range of patterns.

� To provide sets of column widths when reading data from fixed-width textfiles (Sect. 6.6.1). Our example used a very small vector of value widths,but the methods in this chapter could generate long sets of patterns toread hundreds of variables with a single command.

� To create variables to assist in reshaping data, see Sect. 10.17.

Generating data is also helpful in more general ways for situations similarto those in other statistics packages.

� Generating data allows you to demonstrate various analytic methods, as Ihave done in this book. It is not easy to find one data set to use for all ofthe methods you might like to demonstrate.

� Generating data lets you create very small examples that can help youdebug an otherwise complex problem. Debugging a problem on a largedata set often introduces added complexities and slows down each new at-tempted execution. When you can demonstrate a problem with the small-est possible set of data, it helps you focus on the exact nature of theproblem. This is usually the best way to report a problem when request-ing technical assistance. If possible, provide a small generated examplewhen you ask questions to the R-help e-mail support list.

DOI 10.1007/978-1-4614-0685-3_12, © Springer Science+Business Media, LLC 2011, Statistics and Computing,R.A. Muenchen, R for SAS and SPSS Users 401

402 12 Generating Data

� When you are designing an experiment, the levels of the experimentalvariables usually follow simple repetitive patterns. You can generate thoseand then add the measured outcome values to them later manually. Withsuch a nice neat data set to start with, it is tempting to collect data inthat order. However, it is important to collect it in random order wheneverpossible so that factors such as human fatigue or machine wear do not biasthe results of your study.

Some of our data generation examples use R’s random number generator.It will give a different result each time you use it unless you use the set.seedfunction before each function that generates random numbers.

12.1 Generating Numeric Sequences

SAS generates data using DO loops in a data step. SPSS uses input programsto generate data. You can also use SPSS’s strong links to Python to generatedata. R generates data using specialized functions. We have used the simplestone: the colon operator. We can generate a simple sequence with

> 1:10

[1] 1 2 3 4 5 6 7 8 9 10

You can store the results of any of our data generation examples in a vectorusing the assignment operator. So we can create the ID variable we used with

> id <- 1:8

> id

[1] 1 2 3 4 5 6 7 8

The seq function generates sequences like this, too, and it offers moreflexibility. What follows is an example:

> seq(from = 1, to = 10, by = 1)

[1] 1 2 3 4 5 6 7 8 9 10

The seq function call above has three arguments:

1. The from argument tells it where to begin the sequence.2. The to argument tells it where to stop.3. The by argument tells it the increments to use between each number.

The following is an example that goes from 5 to 50 in increments of 5:

12.2 Generating Factors 403

> seq(from = 5, to = 50, by = 5)

[1] 5 10 15 20 25 30 35 40 45 50

Of course, you do not need to name the arguments if you use them in order.So you can do the above example using this form, too.

> seq(5, 50, 5)

[1] 5 10 15 20 25 30 35 40 45 50

12.2 Generating Factors

The gl function generates levels of factors. What follows is an example thatgenerates the series 1, 2, 1, 2, etc.:

> gl(n = 2, k = 1, length = 8)

[1] 1 2 1 2 1 2 1 2

Levels: 1 2

The gl function call above has three arguments.

1. The n argument tells it how many levels your factor will have.2. The k argument tells it how many of each level to repeat before incre-

menting to the next value.3. The length argument is the total number of values generated. Although

this would usually be divisible by n*k, it does not have to be.

To generate our gender variable, we just need to change k to be 4:

> gl(n = 2, k = 4, length = 8)

[1] 1 1 1 1 2 2 2 2

Levels: 1 2

There is also an optional label argument. Here we use it to generateworkshop and gender, complete with value labels:

> workshop <- gl(n = 2, k = 1, length = 8,

+ label = c("R", "SAS") )

> workshop

[1] R SAS R SAS R SAS R SAS


Levels: R SAS

> gender <- gl(n = 2, k = 4, length = 8,

+ label = c("f", "m") )

> gender

[1] f f f f m m m m

Levels: f m

12.3 Generating Repetitious Patterns (Not Factors)

When you need to generate repetitious sequences of values, you are oftencreating levels of factors, which is covered in the previous section. However,sometimes you need similar patterns that are numeric, not factors. The rep

function generates these:

> gender <- rep(1:2, each = 4, times = 1)

> gender

[1] 1 1 1 1 2 2 2 2

The call to the rep function above has three simple arguments.

1. The set of numbers to repeat. We have used the colon operator to generatethe values 1 and 2. You could use the c function here to list any set ofnumbers you need. Note that you can use any vector here and it will repeatit, the numbers need not be sequential.

2. The each argument tells it often to repeat each number in the set. Herewe need four of each number.

3. The times argument tells it the number of times to repeat the (set byeach) combination. For this example, we only needed one.

Note that while we are generating the gender variable as an easy example,rep did not create gender as a factor. To make it one, you would have to usethe factor function. You could instead generate it directly as a factor usingthe more appropriate gl function.

Next, we generate the workshop variable by repeating each number in the1:2 sequence only one time each but repeat that set four times:

> workshop <- rep(1:2, each = 1, times = 4)

> workshop

[1] 1 2 1 2 1 2 1 2

12.4 Generating Values for Reading Fixed-Width Files 405

By comparing the way we generated gender and workshop, the meaning ofthe rep function’s arguments should become clear.

Now we come to the only example that we actually needed for the rep

function in this book – to generate a constant! We use a variation of thisin Chap. 15, “Traditional Graphics,” to generate a set of zeros for use on ahistogram:

> myZeros <- rep(0, each = 8)

> myZeros

[1] 0 0 0 0 0 0 0 0

12.4 Generating Values for Reading Fixed-Width Files

In Section6.6 we read a data file that had the values for each variable fixed inspecific columns. Now that we know how to generate patterns of values, wecan use that knowledge to read far more complex files.

Let us assume we are going to read a file that contains 300 variables wewant to name x1, y1, z1, x2, y2, z2, . . . , x100, y100, and z100. The values ofx, y, and z are 8, 12, and 10 columns wide, respectively. To read such a file,our first task is to generate the prefixes and suffixes for each of the variablenames:

> myPrefixes <- c("x", "y", "z")

> mySuffixes <- rep(1:300, each = 3, times = 1)

> head(mySuffixes, n = 20)

[1] 1 1 1 2 2 2 3 3 3 4 4 4 5 5 5 6 6 6 7 7

You can see that we have used the rep function in the same way as wediscussed in the previous section. Now we want to combine these parts. Wecan do so using the paste function:

> myVariableNames <- paste(myPrefixes, mySuffixes, sep = "")

> head(myVars, n = 20)

[1] "x1" "y1" "z1" "x2" "y2" "z2" "x3" "y3" "z3" "x4" "y4" "z4"

[13] "x5" "y5" "z5" "x6" "y6" "z6" "x7" "y7"

You might think that you would need to have the same number of prefixand suffix pieces to match, but R will recycle a shorter vector over and over tomake it match a longer vector. Now we need to get 300 values of our variablewidths:


> myWidthSet <- c(8, 12, 10)

> myVariableWidths <- rep(myWidthSet, each = 1, times = 100)

> head(myRecord, n = 20)

[1] 8 12 6 8 12 6 8 12 6 8 12 6 8 12 6 8 12 6 8 12

To make the example more realistic, let us assume there is also an IDvariable in the first five columns of the file. We can add the name and thewidth using the c function:

myVariablenames <- c("id", myVariableNames)

myVariableWidths <- c( 5 , myVariableWidths)

Now we are ready to read the file as we did back in Sect. 6.6:

myReallyWideData <- read.fwf(

file = "myReallyWideFile.txt,

width = myVariableWidths,

col.names = myVariableNames,

row.names = "id")

12.5 Generating Integer Measures

R’s ability to generate samples of integers is easy to use and is quite differentfrom SAS’s and SPSS’s usual approach. It is similar to the SPSSINC TRANSextension command. You provide a list of possible values and then use thesample function to generate your data. First, we put the Likert scale values1, 2, 3, 4, and 5 into myValues:

> myValues <- c(1, 2, 3, 4, 5)

Next, we set the random number seed using the set.seed function, so youcan see the same result when you run it:

> set.seed(1234) # Set random number seed.

Finally, we generate a random sample size of 1000 from myValues using thesample function. We are sampling with replacement so we can use the fivevalues repeatedly:

> q1 <- sample(myValues, size = 1000, replace = TRUE)

Now we can check its mean and standard deviation:

12.5 Generating Integer Measures 407

> mean(q1)

[1] 3.029

> sd(q1)

[1] 1.412854

To generate a sample using the same numbers but with a roughly normaldistribution, we can change the values to have more as we reach the center.

> myValues <- c(1, 2, 2, 3, 3, 3, 4, 4, 5)

> set.seed(1234)

> q2 <- sample( myValues, 1000, replace = TRUE)

> mean(q2)

[1] 3.012

> sd(q2)

[1] 1.169283

You can see from the bar plots in Fig. 12.1 that our first variable follows auniform distribution (left), and our second one follows a normal distribution(right). Do not worry about how we created the plot; we will cover that inChap. 15, “Traditional Graphics.”

We could have done the latter example more precisely by generating 1000samples from a normal distribution and then chopping it into 5 equally spacedgroups. We will cover generating samples from continuous samples in the nextsection.

> boxplot( table(q1) )

> boxplot( table(q2) )

If you would like to generate two Likert-scale variables that have a meandifference, you can do so by providing them with different sets of values fromwhich to sample. In the example below, I nest the call to the c function, togenerate a vector of values, within the call to the sample function. Notice thatthe vector for q1 has no values greater than 3 and q2 has none less than 3.This difference will create the mean difference. Here I am only asking for asample size of 8:

> set.seed(1234)

> q1 <- sample( c(1, 2, 2, 3), size = 8, replace = TRUE)

> mean(q1)

[1] 1.75


1 2 3 4 5

050

100

150

200

1 2 3 4 5

010

020

030

0Fig. 12.1. Bar plots showing the distributions of our generated integer variables

> set.seed(1234)

> q2 <- sample( c(3, 4, 4, 5), size = 8, replace = TRUE)

> mean(q2)

[1] 3.75

12.6 Generating Continuous Measures

You can generate continuous random values from a uniform distribution usingthe runif function:

> set.seed(1234)

> x1 <- runif(n = 1000)

> mean(x1)

[1] 0.5072735

> sd(x1)

[1] 0.2912082

where the n argument is the number of values to generate. You can alsoprovide min and max arguments to set the lowest and highest possible values,respectively. So you might generate 1000 pseudo test scores that range from60 to 100 with

12.7 Generating a Data Frame 409

> set.seed(1234)

> x2 <- runif(n = 1000, min = 60, max = 100)

> mean(x2)

[1] 80.29094

> sd(x2)

[1] 11.64833

Normal distributions with a mean of 0 and standard deviation of 1 havemany uses. You can use the rnorm function to generate 1000 values from sucha distribution with

> set.seed(1234)

> x3 <- rnorm(n = 1000)

> mean(x3)

[1] -0.0265972

> sd(x3)

[1] 0.9973377

You can specify other means and standard deviations as in the followingexample:

> set.seed(1234)

> x4 <- rnorm(n = 1000, mean = 70, sd = 5)

> mean(x4)

[1] 69.86701

> sd(x4)

[1] 4.986689

We can use the hist function to see what two of these distributions looklike, see Fig. 12.2:

> hist(x2)

> hist(x4)

12.7 Generating a Data Frame

Putting all of the above ideas together, we can use the following commandsto create a data frame similar to our practice data set, with a couple of test


scores added. I am not bothering to set the random number generator seed,so each time you run this, you will get different results.

> id <- 1:8

> workshop <- gl( n=2, k=1,

+ length=8, label = c("R","SAS") )

> gender <- gl( n = 2, k = 4,

+ length=8, label=c("f", "m") )

> q1 <- sample( c(1, 2, 2, 3), 8, replace = TRUE)




> pretest <- rnorm( n = 8, mean = 70, sd = 5)

> posttest <- rnorm( n = 8, mean = 80, sd = 5)

> myGenerated <- data.frame(id, gender, workshop,

+ q1, q2, q3, q4, pretest, posttest)

> myGenerated

id gender workshop q1 q2 q3 q4 pretest posttest

1 1 f R 1 5 2 3 67.77482 77.95827

2 2 f SAS 1 4 2 5 58.28944 78.11115

3 3 f R 2 3 2 4 68.60809 86.64183

4 4 f SAS 2 5 2 4 64.09098 83.58218

5 5 m R 1 5 3 4 70.16563 78.83855

6 6 m SAS 2 3 3 3 65.81141 73.86887

60 70 80 90 100

020

4060

8012

0

50 60 70 80 90

050

150

250

350

Fig. 12.2. Histograms showing the distributions of our continuous generated data

12.8 Example Programs for Generating Data 411

7 7 m R 2 4 3 4 69.41194 85.23769

8 8 m SAS 1 4 1 4 66.29239 72.81796

12.8 Example Programs for Generating Data

12.8.1 SAS Program for Generating Data

* GenerateData.sas ;

DATA myID;

DO id=1 TO 8;

OUTPUT;

END;

PROC PRINT; RUN;

DATA myWorkshop;

Do i=1 to 4;

DO workshop= 1 to 2;

OUTPUT;

END;

END;

DROP i;

RUN;

PROC PRINT; RUN;

DATA myGender;

DO i=1 to 2;

DO j=1 to 4;

gender=i;

OUTPUT;

END;

END;

DROP i j;

RUN;

PROC PRINT; RUN;

DATA myMeasures;

DO i=1 to 8;

q1=round ( uniform(1234) * 5 );

q2=round ( ( uniform(1234) * 5 ) -1 );

q3=round (uniform(1234) * 5 );

q4=round (uniform(1234) * 5 );

test1=normal(1234)*5 + 70;


test2=normal(1234)*5 + 80;

OUTPUT;

END;

DROP i;

RUN;

PROC PRINT; RUN;

PROC FORMAT;

VALUES wLabels 1='R' 2='SAS';

VALUES gLabels 1='Female' 2='Male';

RUN;

* Merge and eliminate out of range values;

DATA myGenerated;

MERGE myID myWorkshop myGender myMeasures;

FORMAT workshop wLabels. gender gLabels. ;

ARRAY q{4} q1-q4;

DO i=1 to 4;

IF q{i} < 1 then q{i}=1;

ELSE IF q{i} > 5 then q{i}=5;

END;

RUN;

PROC PRINT; RUN;

12.8.2 SPSS Program for Generating Data

* Filename: GenerateData.sps .

input program.

numeric id(F4.0).

string gender(a1) workshop(a4).

vector q(4).

loop #i = 1 to 8.

compute id = #i.

do if #i <= 5.

+ compute gender = 'f'.

else.

+ compute gender = 'm'.

end if.

compute #ws = mod(#i, 2).

if #ws = 0 workshop = 'SAS'.


if #ws = 1 workshop = 'R'.

do repeat #j = 1 to 4.

+ compute q(#j)=trunc(rv.uniform(1,5)).

end repeat.

compute pretest = rv.normal(70, 5).

compute posttest = rv.normal(80,5).

end case.

end loop.

end file.

end input program.

list.

12.8.3 R Program for Generating Data

# Filename: GenerateData.R

# Simple sequences.

1:10

seq(from = 1,to = 10,by = 1)

seq(from = 5,to = 50,by = 5)

seq(5, 50, 5)

# Generating our ID variable

id <- 1:8

id

# gl function Generates Levels.

gl(n = 2, k = 1, length = 8)

gl(n = 2, k = 4, length = 8)

#Adding labels.

workshop <- gl(n = 2, k = 1, length = 8,

label = c("R", "SAS") )

workshop

gender <- gl(n = 2, k = 4, length = 8,

label = c("f", "m") )

gender

# Generating Repetition Patterns (Not Factors)

gender <- rep(1:2, each = 4, times = 1)

gender

workshop <- rep(1:2, each = 1, times = 4)


workshop

myZeros <- rep(0, each = 8)

myZeros

# Generating Values for Reading a Fixed-Width File

myPrefixes <- c("x", "y", "z")

mySuffixes <- rep(1:300, each = 3, times = 1)

head(mySuffixes, n = 20)

myVariableNames <- paste(myPrefixes, mySuffixes, sep = "")

head(myVars, n = 20)

myWidthSet <- c(8, 12, 10)

myVariableWidths <- rep(myWidthSet, each = 1, times = 100)

head(myRecord, n = 20)

myVariablenames <- c("id", myVariableNames)

myVariableWidths <- c( 5 , myVariableWidths)

mydata <- read.fwf(

file = myfile,

width = myVariableWidths,

col.names = myVariableNames,

row.names = "id")

# Generating uniformly distributed Likert data

myValues <- c(1, 2, 3, 4, 5)

set.seed(1234)

q1 <- sample( myValues, size = 1000, replace = TRUE)

mean(q1)

sd(q1)

# Generating normally distributed Likert data

myValues <- c(1, 2, 2, 3, 3, 3, 4, 4, 5)

set.seed(1234)

q2 <- sample( myValues , size = 1000, replace = TRUE)

mean(q2)

sd(q2)

# Plot details in Traditional Graphics chapter.

par( mar = c(2, 2, 2, 1)+0.1 )

par( mfrow = c(1, 2) )

barplot( table(q1) )


# par( mfrow = c(1, 1) ) #Sets back to 1 plot per page.

# par( mar = c(5, 4, 4, 2) + 0.1 )


# Same two barplots routed to a file.

postscript(file = "F:/r4sas/chapter12/GeneratedIntegers.eps",

width = 4.0, height = 2.0)

par( mar = c(2, 2, 2, 1) + 0.1 )




dev.off()

# Two Likert scales with mean difference

set.seed(1234)

q1 <- sample( c(1, 2, 2, 3), size = 8, replace = TRUE)

mean(q1)

set.seed(1234)

q2 <- sample( c(3, 4, 4, 5), size = 8, replace = TRUE)

mean(q2)

# Generating continuous data

# From uniform distribution.

# mean = 0.5

set.seed(1234)

x1 <- runif(n = 1000)

mean(x1)

sd(x1)

# From a uniform distribution

# between 60 and 100

set.seed(1234)

x2 <- runif(n = 1000, min = 60, max = 100)

mean(x2)

sd(x2)

# From a normal distribution.

set.seed(1234)

x3 <- rnorm(n = 1000)

mean(x3)

sd(x3)

set.seed(1234)

x4 <- rnorm(n = 1000, mean = 70, sd = 5)

mean(x4)

sd(x4)


# Plot details are in Traditional Graphics chapter.

par( mar = c(2, 2, 2, 1) + 0.1 )


hist(x2)

hist(x4)

# par( mfrow = c(1, 1) ) #Sets back to 1 plot per page.

# par( mar = c(5, 4, 4, 2) + 0.1 )

# Same pair of plots, this time sent to a file.

postscript(file = "F:/r4sas/chapter12/GeneratedContinuous.eps",

width = 4.0, height = 2.0)

par( mar = c(2, 2, 2, 1) + 0.1 )


hist(x2, main = "")

hist(x4, main = "")

dev.off()

# par( mfrow = c(1,1) ) #Sets back to 1 plot per page.

# par( mar = c(5, 4, 4, 2) + 0.1 )

# Generating a Data Frame.

id <- 1:8

workshop <- gl( n = 2, k = 1,

length = 8, label = c("R","SAS") )

gender <- gl( n = 2, k = 4,

length = 8, label = c("f","m") )

q1 <- sample( c(1, 2, 2, 3), 8, replace = TRUE)




pretest <- rnorm(n = 8, mean = 70, sd = 5)

posttest <- rnorm(n = 8, mean = 80, sd = 5)

myGenerated <- data.frame(id, gender, workshop,

q1, q2, q3, q4, pretest, posttest)

myGenerated

[statistics and computing] r for sas and spss users || generating data

Documents