stat 342 - wk 5jackd/stat342/lect_wk05.pdf · possible midterm problems, continued. 4) given ...

38
Stat 342 - Wk 5 Random number generaon. Special variables in data steps. Seng labels. Do loops and data step behaviour. Example quesons for the midterm. Stat 342 Notes. Week 3, Page 1 / 38

Upload: others

Post on 15-Mar-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Stat 342 - Wk 5jackd/Stat342/Lect_Wk05.pdf · Possible midterm problems, continued. 4) Given  comment each line and briefly explain what it does. Where

Stat 342 - Wk 5

Random number generation.

Special variables in data steps.

Setting labels.

Do loops and data step behaviour.

Example questions for the midterm.

Stat 342 Notes. Week 3, Page 1 / 38

Page 2: Stat 342 - Wk 5jackd/Stat342/Lect_Wk05.pdf · Possible midterm problems, continued. 4) Given  comment each line and briefly explain what it does. Where

Random number generation

The most common functions to generate random numbers from are the uniform and the normal distribution.

This is done with the RAND() function inside a data step, specifying a distribution, and parameters if necessary.

RAND('UNIFORM') will provide a random value from 0 to 1.

RAND('NORMAL') will provide a random value from the standard normal distribution (mean = 0, sd = 1).

Stat 342 Notes. Week 3, Page 2 / 38

Page 3: Stat 342 - Wk 5jackd/Stat342/Lect_Wk05.pdf · Possible midterm problems, continued. 4) Given  comment each line and briefly explain what it does. Where

Why use random number generation?

1) Sampling. If you had a large dataset of every cellphone number in Vancouver, and you wanted to get the opinion of 1000 randomly selected people. That random selection is done with random number generation.

You may want to...

...weight your sample to account for certain demographics not answering their phones.

...give the possible responses to a multiple choice question in a randomly selected order.Stat 342 Notes. Week 3, Page 3 / 38

Page 4: Stat 342 - Wk 5jackd/Stat342/Lect_Wk05.pdf · Possible midterm problems, continued. 4) Given  comment each line and briefly explain what it does. Where

Why use random number generation?

2) Goodness of fit testing.

If you wanted to find out how a certain set of data would behave if it followed a hypothesized distribution, you could generate values from that distribution and explore that hypothetical situation.

You could see how good that distribution fits your data by comparing hypothetical data to real data. That's one way to assess goodness of fit.Stat 342 Notes. Week 3, Page 4 / 38

Page 5: Stat 342 - Wk 5jackd/Stat342/Lect_Wk05.pdf · Possible midterm problems, continued. 4) Given  comment each line and briefly explain what it does. Where

Why use random number generation?

3) Making data anonymous. (1/2)

If you are going to be sharing a dataset with other researchers or the public, you have an obligation to protect the privacy of any people whose data is recorded.

Sometimes private data like phone numbers or e-mail addresses is used to identify people in a data set. For example, in a record of sales, where one row is one sale, youmight see the same phone number in multiple rows.Stat 342 Notes. Week 3, Page 5 / 38

Page 6: Stat 342 - Wk 5jackd/Stat342/Lect_Wk05.pdf · Possible midterm problems, continued. 4) Given  comment each line and briefly explain what it does. Where

3) Making data anonymous. (2/2)

If that's the case, you would be destroying useful information by getting rid of the phone number variable.

What you can do, however, is scramble the phone numbers. They would need to be scrambled in such a way that the same number gets scrambled the same way every time.

That way, someone else could read the data after it has beenscrambled and still see when one person has made many purchases. They cannot, however, call that person.Stat 342 Notes. Week 3, Page 6 / 38

Page 7: Stat 342 - Wk 5jackd/Stat342/Lect_Wk05.pdf · Possible midterm problems, continued. 4) Given  comment each line and briefly explain what it does. Where

Random Number Generation: Seeds

Computers cannot (typically) generate true random numbers. Instead, they use a complicated formula based on a starting value that has to be provided by an outside source.

When you use a random number function like UNIFORM(x),

The value x is the starting value, or seed, that is used.

Stat 342 Notes. Week 3, Page 7 / 38

Page 8: Stat 342 - Wk 5jackd/Stat342/Lect_Wk05.pdf · Possible midterm problems, continued. 4) Given  comment each line and briefly explain what it does. Where

In SAS, by default

the computer will use the time of its internal clock as its seed.

If you specify a positive integer like 345 in the streaminit() routine with call streaminit(345)

Then that value '345' will be used as the first seed. When a random number is generated, a new seed based on '345' willbe used.

Stat 342 Notes. Week 3, Page 8 / 38

Page 9: Stat 342 - Wk 5jackd/Stat342/Lect_Wk05.pdf · Possible midterm problems, continued. 4) Given  comment each line and briefly explain what it does. Where

Why care about the seed?

If the clock-based seed is used, there is no way to retrieve a seed and use it again. Every time you run an analysis on a time-based seed you will get a different result.

If you want to generate random numbers, but you want to generate the same random numbers every time you run an analysis that includes setting a fixed seed, it will give the same result every time.

Stat 342 Notes. Week 3, Page 9 / 38

Page 10: Stat 342 - Wk 5jackd/Stat342/Lect_Wk05.pdf · Possible midterm problems, continued. 4) Given  comment each line and briefly explain what it does. Where

Here is an example program that sets a fixed seed and generates 10 random numbers from the Cauchy distrubtion.

data random;

call streaminit(123);

do i=1 to 10;

x1=rand('cauchy');

output;

end;

run;

Stat 342 Notes. Week 3, Page 10 / 38

Page 11: Stat 342 - Wk 5jackd/Stat342/Lect_Wk05.pdf · Possible midterm problems, continued. 4) Given  comment each line and briefly explain what it does. Where

The same 10 Cauchy values will be found every time.

Now try after you remove “call streaminit(123);”

Stat 342 Notes. Week 3, Page 11 / 38

Page 12: Stat 342 - Wk 5jackd/Stat342/Lect_Wk05.pdf · Possible midterm problems, continued. 4) Given  comment each line and briefly explain what it does. Where

SAS can generate random numbers from a wide variety of distributions and parameters sets.

RAND('NORMAL', 5,3)

will give you a random normal (aka Gaussian) number from adistribution with mean 5 and standard deviation 3.

RAND('POISSON', 10)

will provide a random number from a Poisson distribution with lambda (mean, variance) of 10.

From SAS Documentation on the RAND function

Stat 342 Notes. Week 3, Page 12 / 38

Page 13: Stat 342 - Wk 5jackd/Stat342/Lect_Wk05.pdf · Possible midterm problems, continued. 4) Given  comment each line and briefly explain what it does. Where

Stat 342 Notes. Week 3, Page 13 / 38

Page 14: Stat 342 - Wk 5jackd/Stat342/Lect_Wk05.pdf · Possible midterm problems, continued. 4) Given  comment each line and briefly explain what it does. Where

Live example of random number generation.

data auto;

call streaminit(123);

do i=1 to 10;

rep78 = RAND('POISSON', 10);

mpg = RAND('NORMAL',5,3);

foreign = RAND('BERNOULLI',0.5);

output;

end;

run;

Stat 342 Notes. Week 3, Page 14 / 38

Page 15: Stat 342 - Wk 5jackd/Stat342/Lect_Wk05.pdf · Possible midterm problems, continued. 4) Given  comment each line and briefly explain what it does. Where

In the random number generator example (consider these practice midterm problems)

1) What happens if you put the seed initialization (call streaminit() ), inside the do-loop?

2) What happens if you get rid of the seed initialization?

3) What could you do to drop the counter i?

Stat 342 Notes. Week 3, Page 15 / 38

Page 16: Stat 342 - Wk 5jackd/Stat342/Lect_Wk05.pdf · Possible midterm problems, continued. 4) Given  comment each line and briefly explain what it does. Where

Special Variables

There are a few variables that are present in every SAS data step that you can use. These are typically for debugging datasteps.

_n_ , which tracks the number of iterations (rows)

that the data step has gone through already.

_error_ , which is 1 if there was an error processing a row, and 0 otherwise.

Stat 342 Notes. Week 3, Page 16 / 38

Page 17: Stat 342 - Wk 5jackd/Stat342/Lect_Wk05.pdf · Possible midterm problems, continued. 4) Given  comment each line and briefly explain what it does. Where

Special Variables example on paper.

Stat 342 Notes. Week 3, Page 17 / 38

Page 18: Stat 342 - Wk 5jackd/Stat342/Lect_Wk05.pdf · Possible midterm problems, continued. 4) Given  comment each line and briefly explain what it does. Where

Something strange happens when you run a do loop AND include a dataset. Try this:

data auto_test;

set auto;

old_i = i;

do i=1 to 5;

what = RAND('NORMAL');

output;

end;

run;

Stat 342 Notes. Week 3, Page 18 / 38

Page 19: Stat 342 - Wk 5jackd/Stat342/Lect_Wk05.pdf · Possible midterm problems, continued. 4) Given  comment each line and briefly explain what it does. Where

This 'data AND set' code highlights a few things about how data sets work.

- By default, a data step only runs for a single iteration.

- If there is/are dataset(s) mentioned in the set command, the data step will run one time for each row in each dataset.

- ALL the processing of a data step (other than set) is done EVERY time the data step runs. If there are multiple times 'output' is run, then you get multiple rows per data step run.

Stat 342 Notes. Week 3, Page 19 / 38

Page 20: Stat 342 - Wk 5jackd/Stat342/Lect_Wk05.pdf · Possible midterm problems, continued. 4) Given  comment each line and briefly explain what it does. Where

What would the variable 'test' look like in this data?

data auto_midterm;

set auto;

keep i what test;

test = _n_;

do i=1 to 5;

what = RAND('NORMAL');

output;

end;

run;

Stat 342 Notes. Week 3, Page 20 / 38

Page 21: Stat 342 - Wk 5jackd/Stat342/Lect_Wk05.pdf · Possible midterm problems, continued. 4) Given  comment each line and briefly explain what it does. Where

What would the variables err_before and err_after look like in this data?data auto_midterm2;

set auto; keep i z sq_z err_before err_after;

err_before = _error_;

do i=1 to 5;

z = RAND('NORMAL');

sq_z = sqrt(z);

err_after = _error_;

output;

end; run;

Stat 342 Notes. Week 3, Page 21 / 38

Page 22: Stat 342 - Wk 5jackd/Stat342/Lect_Wk05.pdf · Possible midterm problems, continued. 4) Given  comment each line and briefly explain what it does. Where

These two variables behave in very specific ways.

_n_ increments when there's a new cycle in data step, NOT when a row is added to the output dataset.

_error_ is reset to zero at beginning a new data iteration. It also does not increase beyond 1 even if there are multiple errors during a data step.

Not all errors cause a data step to crash. Numerical ones like trying to divide by zero, or taking a square root of the

Stat 342 Notes. Week 3, Page 22 / 38

Page 23: Stat 342 - Wk 5jackd/Stat342/Lect_Wk05.pdf · Possible midterm problems, continued. 4) Given  comment each line and briefly explain what it does. Where

variable just get marked by _error_ , and the data step continues without issue.

Try changing these variables directly in the code of a data step and observe what happens.

Stat 342 Notes. Week 3, Page 23 / 38

Page 24: Stat 342 - Wk 5jackd/Stat342/Lect_Wk05.pdf · Possible midterm problems, continued. 4) Given  comment each line and briefly explain what it does. Where

The label command allows you to assign 'labels' to variables.

These do NOT change the names of the variables. They assign a new property to each variable. In SAS, each variable has a blank label by default.

These labels do not have the same restrictions as variable names, and can even be entire sentences.

Stat 342 Notes. Week 3, Page 24 / 38

Page 25: Stat 342 - Wk 5jackd/Stat342/Lect_Wk05.pdf · Possible midterm problems, continued. 4) Given  comment each line and briefly explain what it does. Where

Labels are only retained properly when datasets are saved asa SAS or JMP specific variable. If data is exported into a format for another program, the labels are either not retained, or may not be retained in a way that makes sense.

For example, the .csv format has no space for variable labels.At best an extra row can be inserted above variable names for labels.

Stat 342 Notes. Week 3, Page 25 / 38

Page 26: Stat 342 - Wk 5jackd/Stat342/Lect_Wk05.pdf · Possible midterm problems, continued. 4) Given  comment each line and briefly explain what it does. Where

Labels show in the results from proc contents along side the format of each variable.

Labels example.DATA auto2;

SET auto;

LABEL rep78 ="1978 Repair Record"

mpg ="Miles Per Gallon"

foreign="Where Car Was Made";

RUN;

PROC CONTENTS DATA=auto2;

RUN;

Stat 342 Notes. Week 3, Page 26 / 38

Page 27: Stat 342 - Wk 5jackd/Stat342/Lect_Wk05.pdf · Possible midterm problems, continued. 4) Given  comment each line and briefly explain what it does. Where

Possible midterm problems, continued.

4) Given <sas code> comment each line and briefly explain what it does.

Where <sas code> could include

- data sets with sets, labels, random number generation, summary variables, and retain.

- proc print, proc content, proc import/export with dbms

- proc sql with select, which, order by, and group by.

Stat 342 Notes. Week 3, Page 27 / 38

Page 28: Stat 342 - Wk 5jackd/Stat342/Lect_Wk05.pdf · Possible midterm problems, continued. 4) Given  comment each line and briefly explain what it does. Where

Example 1 of code to explain.

proc import datafile="mtcars.csv"

out=mtcars dbms=csv;

delimiter=',';

getnames=yes;

run;

Stat 342 Notes. Week 3, Page 28 / 38

Page 29: Stat 342 - Wk 5jackd/Stat342/Lect_Wk05.pdf · Possible midterm problems, continued. 4) Given  comment each line and briefly explain what it does. Where

Example 2 of code to explain.

DATA times2 ;

SET times ;

avg = MEAN(trial1, trial2, trial3);

sd = SD(trial1, trial2, trial3);

Ntrials = N(trial1, trial2, trial3);

RUN;

Stat 342 Notes. Week 3, Page 29 / 38

Page 30: Stat 342 - Wk 5jackd/Stat342/Lect_Wk05.pdf · Possible midterm problems, continued. 4) Given  comment each line and briefly explain what it does. Where

5) Write simple SAS code from scratch to do a certain task.

Typical tasks will involve getting the variable names and labels (proc contents), showing the data set or the first few rows (proc print), and very simple sql.

Stat 342 Notes. Week 3, Page 30 / 38

Page 31: Stat 342 - Wk 5jackd/Stat342/Lect_Wk05.pdf · Possible midterm problems, continued. 4) Given  comment each line and briefly explain what it does. Where

Examples of code to write from scratch.

“Find the variable names, formats, and labels of auto2”PROC CONTENTS DATA=auto2;

RUN;

Stat 342 Notes. Week 3, Page 31 / 38

Page 32: Stat 342 - Wk 5jackd/Stat342/Lect_Wk05.pdf · Possible midterm problems, continued. 4) Given  comment each line and briefly explain what it does. Where

“Show the top 10 rows of the variables 'first' and 'second' from the dataset ds.”proc print data=ds (obs=10);

var first second;

run;

Stat 342 Notes. Week 3, Page 32 / 38

Page 33: Stat 342 - Wk 5jackd/Stat342/Lect_Wk05.pdf · Possible midterm problems, continued. 4) Given  comment each line and briefly explain what it does. Where

“Show every row of the variables time1, time2, time3, all theway up to time20.”proc print data=ds;

var time1-time20;

run;

Stat 342 Notes. Week 3, Page 33 / 38

Page 34: Stat 342 - Wk 5jackd/Stat342/Lect_Wk05.pdf · Possible midterm problems, continued. 4) Given  comment each line and briefly explain what it does. Where

“Make a table of the largest country in terms of the variable 'area' in dataset 'world' by continent with an SQL query.”

proc sql;

select country, max(area) as biggest_area

from world

group by continent;

Stat 342 Notes. Week 3, Page 34 / 38

Page 35: Stat 342 - Wk 5jackd/Stat342/Lect_Wk05.pdf · Possible midterm problems, continued. 4) Given  comment each line and briefly explain what it does. Where

Example code from chapters 1 and 2 of 'SAS and R' text, the SQL code given in the lecture, and the scanned pages from the 'Data Step' book are all fair game for this midterm.

You are allowed to bring a single sided A4 size aid sheet into this exam, as long as it is written by hand and does NOT include photocopies.

This should make the 'explain code' portion less of a memorization task.Stat 342 Notes. Week 3, Page 35 / 38

Page 36: Stat 342 - Wk 5jackd/Stat342/Lect_Wk05.pdf · Possible midterm problems, continued. 4) Given  comment each line and briefly explain what it does. Where

Other possible problems

5) Given this SQL table, and this select statement, what will the output be?

For this, I highly recommend you look over the lab notes from weeks 3 and 4, which are SQL heavy.

An SQL table will be provided, like the Motor Trend cars (mtcars.docx) dataset.

Stat 342 Notes. Week 3, Page 36 / 38

Page 37: Stat 342 - Wk 5jackd/Stat342/Lect_Wk05.pdf · Possible midterm problems, continued. 4) Given  comment each line and briefly explain what it does. Where

Example 1 of SQL code to provide a table for

PROC SQL;

select model, mpg as mileage, cyl as cylinders

from mtcars

where disp < 100;

Stat 342 Notes. Week 3, Page 37 / 38

Page 38: Stat 342 - Wk 5jackd/Stat342/Lect_Wk05.pdf · Possible midterm problems, continued. 4) Given  comment each line and briefly explain what it does. Where

Example 2 of SQL code to provide a table for

PROC SQL;

select model, hp, am

from mtcars

where hp == 110

group by am;

Stat 342 Notes. Week 3, Page 38 / 38