nadia ia akseer msc msc . candidate - sas group presentations/hamilton-user-group...example e...
TRANSCRIPT
Using SAS to Analyze Using SAS to Analyze Data Data Data Data
Nadia Nadia Akseer Akseer MSc MSc. Candidate . Candidate Brock University Brock University
Using SAS to Analyze Using SAS to Analyze Data Data Data Data
Akseer Akseer . Candidate . Candidate
Brock University Brock University
Agenda Agenda a. a. Examine Individual Variable Distributions Examine Individual Variable Distributions
i i. . Continuous data: Continuous data: Proc Proc Univariate Univariate Proc Means Proc Means
ii. ii. Categorical data: Categorical data: Proc Freq Proc Freq distributions distributions
b. b. Examine Relationships Between Variables Examine Relationships Between Variables
i i. . Continuous data: Continuous data: Scatter plots Scatter plots Correlation Correlation Spearman, Pearson Spearman, Pearson
ii. ii. Categorical data: Categorical data: Freq tables, probabilities, Chi Freq tables, probabilities, Chi
iii. iii. Continuous and Categorical data: Continuous and Categorical data: Proc t Proc t test test
Agenda Agenda Examine Individual Variable Distributions Examine Individual Variable Distributions
distributions, tests for normality, plots distributions, tests for normality, plots
distributions distributions
Examine Relationships Between Variables Examine Relationships Between Variables
Spearman, Pearson Spearman, Pearson
Freq tables, probabilities, Chi Freq tables, probabilities, Chi square, Fishers Exact square, Fishers Exact Continuous and Categorical data: Continuous and Categorical data:
Example Example
Height, Weight, Height, Weight, BMI, sex and activity level BMI, sex and activity level measurements are available for a group of measurements are available for a group of physically active students physically active students
**Note: 5 activity questions asked ‘1=none…4=very active’ ** **Note: 5 activity questions asked ‘1=none…4=very active’ **
Example Example
BMI, sex and activity level BMI, sex and activity level measurements are available for a group of measurements are available for a group of physically active students physically active students
**Note: 5 activity questions asked ‘1=none…4=very active’ ** **Note: 5 activity questions asked ‘1=none…4=very active’ **
Example Example
n n What are the mean, median, mode of the BMI? What are the mean, median, mode of the BMI? n n How dispersed is the BMI data? How dispersed is the BMI data? n n Is BMI normally distributed? Is BMI normally distributed?
Example Example Con’td Con’td
What are the mean, median, mode of the BMI? What are the mean, median, mode of the BMI? How dispersed is the BMI data? How dispersed is the BMI data? Is BMI normally distributed? Is BMI normally distributed?
Proc Proc Univariate Univariate
n n Provides information on: Provides information on: n n Measures of central tendency Measures of central tendency
n n (mean, median, mode etc.) (mean, median, mode etc.) n n Measures of dispersion Measures of dispersion n n Measures of dispersion Measures of dispersion
n n (standard deviation, range, IQR etc.) (standard deviation, range, IQR etc.) n n Allows us to visualize data Allows us to visualize data
n n (stem (stem leaf, normality & box plots) leaf, normality & box plots)
n n Used for a continuous variable Used for a continuous variable
Univariate Univariate
Provides information on: Provides information on: Measures of central tendency Measures of central tendency
(mean, median, mode etc.) (mean, median, mode etc.) Measures of dispersion Measures of dispersion Measures of dispersion Measures of dispersion
(standard deviation, range, IQR etc.) (standard deviation, range, IQR etc.) Allows us to visualize data Allows us to visualize data
leaf, normality & box plots) leaf, normality & box plots)
Used for a continuous variable Used for a continuous variable
Proc Proc Univariate Univariate
Proc Proc univariate univariate data= data=bmi bmi Var Var bmi bmi; ; Histogram/normal; Histogram/normal; Run; Run;
Univariate Univariate Syntax Syntax
bmi bmi plot normal ; plot normal ;
6 6
Proc Proc Univariate Univariate Univariate Univariate Output Output
Is the data normally distributed? If not? Which way is it skewed? Is the data normally distributed? If not? Which way is it skewed?
8 8
Variable is normally distributed if p Variable is normally distributed if p Variable is normally distributed if p Variable is normally distributed if p value>0.05 value>0.05
Example Example
n n How many individuals have complete data for How many individuals have complete data for height, weight and BMI? height, weight and BMI?
n n What is the range of data for all three What is the range of data for all three variables? variables? variables? variables?
n n What are the means and standard deviations? What are the means and standard deviations?
Example Example Con’td Con’td
How many individuals have complete data for How many individuals have complete data for height, weight and BMI? height, weight and BMI? What is the range of data for all three What is the range of data for all three
What are the means and standard deviations? What are the means and standard deviations?
Proc Means Proc Means Used to obtain mean, standard deviation, min Used to obtain mean, standard deviation, min and max for multiple continuous variables and max for multiple continuous variables
Proc means data= Proc means data=bmi bmi; ; Var Var bmi bmi ht wt; ht wt; Var Var bmi bmi ht wt; ht wt; Run; Run;
Proc Means Proc Means Used to obtain mean, standard deviation, min Used to obtain mean, standard deviation, min and max for multiple continuous variables and max for multiple continuous variables
11 11
Example Example
n n What proportion of the sample are boys? What proportion of the sample are boys? n n What proportion are girls? What proportion are girls? n n What proportion of the sample are not What proportion of the sample are not physically active in the first activity question? physically active in the first activity question? physically active in the first activity question? physically active in the first activity question?
n n What are the physical activity trends in all 5 What are the physical activity trends in all 5 activity questions? activity questions?
Example Example Con’td Con’td
What proportion of the sample are boys? What proportion of the sample are boys? What proportion are girls? What proportion are girls? What proportion of the sample are not What proportion of the sample are not physically active in the first activity question? physically active in the first activity question? physically active in the first activity question? physically active in the first activity question? What are the physical activity trends in all 5 What are the physical activity trends in all 5
Proc Freq Proc Freq
n n Looks at distribution of categorical variables Looks at distribution of categorical variables n n Gives information about frequency and Gives information about frequency and proportions proportions Can look at multiple variables at a time Can look at multiple variables at a time n n Can look at multiple variables at a time Can look at multiple variables at a time
Proc Freq data= Proc Freq data=bmi bmi; ; Table sex active1 Table sex active1 active5; active5; Run; Run;
Proc Freq Proc Freq
Looks at distribution of categorical variables Looks at distribution of categorical variables Gives information about frequency and Gives information about frequency and
Can look at multiple variables at a time Can look at multiple variables at a time Can look at multiple variables at a time Can look at multiple variables at a time
active5; active5;
13 13
14 14
Correlation Correlation n n Correlation Correlation
n n Two variables are considered to be Two variables are considered to be when there is a when there is a relationship relationship
n n ρ ρ (rho) a.k.a. “Correlation Coefficient (r)” (rho) a.k.a. “Correlation Coefficient (r)” n n ρ ρ (rho) a.k.a. “Correlation Coefficient (r)” (rho) a.k.a. “Correlation Coefficient (r)” n n Used to express the strength of the association Used to express the strength of the association between the two variables between the two variables
n n Has a range of values: Has a range of values: n n ||ρ ρ|= 1 |= 1 è è perfect perfect linear linear relationship relationship n n ρ ρà à 0 0è è weak weak linear linear relationship relationship n n ρ ρà à 1 1è è strong strong linear linear relationship relationship
Correlation Correlation
Two variables are considered to be Two variables are considered to be correlated correlated relationship relationship between them between them
(rho) a.k.a. “Correlation Coefficient (r)” (rho) a.k.a. “Correlation Coefficient (r)” (rho) a.k.a. “Correlation Coefficient (r)” (rho) a.k.a. “Correlation Coefficient (r)” Used to express the strength of the association Used to express the strength of the association between the two variables between the two variables Has a range of values: Has a range of values: 1 ≤ 1 ≤ ρ ρ ≤ 1 ≤ 1
relationship relationship relationship relationship relationship relationship
Correlation Correlation
n n Hypotheses Hypotheses n n What is our H What is our H 0 0 in correlation? in correlation?
ρ ρ = 0 = 0è è There is no There is no linear linear
n n What is our H What is our H A A in correlation? in correlation? ρ ρ ≠ 0 ≠ 0è è There is a There is a linear linear
Correlation Correlation
in correlation? in correlation? linear linear correlation correlation
in correlation? in correlation? linear linear correlation correlation
Correlation Correlation
n n Procedure for determining if there is a Procedure for determining if there is a correlation between two variables correlation between two variables
1. 1. Run a scatter plot Run a scatter plot 2. 2. Check Assumptions Check Assumptions 2. 2. Check Assumptions Check Assumptions 3. 3. Run either a Pearson or a Spearman Run either a Pearson or a Spearman 4. 4. Determine if you reject/fail to reject Ho Determine if you reject/fail to reject Ho 5. 5. If you reject, look at correlation coefficient If you reject, look at correlation coefficient
How strong is the relationship? How strong is the relationship?
Correlation Correlation
Procedure for determining if there is a Procedure for determining if there is a correlation between two variables correlation between two variables
Normal distribution Normal distribution Normal distribution Normal distribution
Run either a Pearson or a Spearman Run either a Pearson or a Spearman Determine if you reject/fail to reject Ho Determine if you reject/fail to reject Ho If you reject, look at correlation coefficient If you reject, look at correlation coefficient – – How strong is the relationship? How strong is the relationship?
Review Review 5. 5. If H If H 0 0 is rejected, determine the strength of the is rejected, determine the strength of the
relationship relationship
ρ >0.7 >0.7
0.4 – 0.7 <0.4
Review Review is rejected, determine the strength of the is rejected, determine the strength of the
Relationship Strong Strong Medium Weak
Example Example n n Both the Pearson Correlation and the Spearman Both the Pearson Correlation and the Spearman Correlation will be used on the same example data Correlation will be used on the same example data to show the differences between the two methods to show the differences between the two methods
Table 9.1. Lengths and Weights of Male Bears x Length (in.) 53.0 67.5 72.0 y Weight (lb) 80 344 416
Example Example Both the Pearson Correlation and the Spearman Both the Pearson Correlation and the Spearman Correlation will be used on the same example data Correlation will be used on the same example data to show the differences between the two methods to show the differences between the two methods
Table 9.1. Lengths and Weights of Male Bears 72.0 72.0 73.5 68.5 73.0 37.0
416 348 262 360 332 34
Example Example
1. 1. Run a Scatter Plot Run a Scatter Plot
proc plot; proc plot; proc plot; proc plot; plot weight*length; plot weight*length; run; run;
Example Example
proc plot; plot y*x; title ‘….’; run;
Can you see an association?? association??
Example Example Check Assumptions Check Assumptions
n n Random sample Random sample n n Points approximately on a straight line Points approximately on a straight line n n Outliers examined Outliers examined
Normal distribution for Normal distribution for
þ þ þ ý n n Normal distribution for Normal distribution for þ
Weight
ý Pvalues<0.05
Weight
Example Example
Points approximately on a straight line Points approximately on a straight line
Normal distribution for Normal distribution for both both variables variables Normal distribution for Normal distribution for both both variables variables
Length
Example Example
3. 3. Decide Between a Pearson and a Spearman Decide Between a Pearson and a Spearman n n Only 3/4 assumptions were met, therefore we Only 3/4 assumptions were met, therefore we
should proceed with a…. should proceed with a….
*Normal distribution most important*
Example Example
Decide Between a Pearson and a Spearman Decide Between a Pearson and a Spearman Only 3/4 assumptions were met, therefore we Only 3/4 assumptions were met, therefore we should proceed with a…. should proceed with a….
*Normal distribution most important*
Example Example
Pearson Pearson Proc Proc corr corr; ; Var Var weight length; weight length; Var Var weight length; weight length; Run; Run;
Example Example
Example Example
Correlation coefficient (r) pvalue
Example Example
Is there a linear relationship? Strength?
Example Example
Spearman Spearman Proc Proc corr corr spearman; spearman; Var Var weight length; weight length; Var Var weight length; weight length; Run; Run;
Example Example
Example Example Example Example
Example Example
4. 4. Determine the fate of H Determine the fate of H 0 0 5. 5. Determine the strength of the relationship Determine the strength of the relationship
n n Spearman Spearmanè è r=0.35929, p=0.3821 r=0.35929, p=0.3821 ∴ ∴
n n Spearman Spearmanè è r=0.35929, p=0.3821 r=0.35929, p=0.3821 ∴ ∴ FTR H FTR H 0 0 – – There is no There is no linear linear weight and length of a bear weight and length of a bear
n n Pearson Pearson è è r=0.897, p=0.0025 r=0.897, p=0.0025 ∴ ∴ Reject H Reject H 0 0 – – There is a There is a strong strong between the weight and length of a bear between the weight and length of a bear
Example Example
0 0
Determine the strength of the relationship Determine the strength of the relationship
r=0.35929, p=0.3821 r=0.35929, p=0.3821 r=0.35929, p=0.3821 r=0.35929, p=0.3821 linear linear relationship between the relationship between the
weight and length of a bear weight and length of a bear
r=0.897, p=0.0025 r=0.897, p=0.0025 strong strong linear linear relationship relationship
between the weight and length of a bear between the weight and length of a bear
Chi Chi square Tests square Tests
n n Chi Chi Square testing is generally used to test Square testing is generally used to test claims about claims about categorical categorical frequency counts for different categories frequency counts for different categories
n n Uses Chi Uses Chi square distribution square distribution n n Uses Chi Uses Chi square distribution square distribution n n Many different types of tests: Many different types of tests:
i.e. Independence, Homogeneity, Goodness of fit, Fisher’s i.e. Independence, Homogeneity, Goodness of fit, Fisher’s exact, exact, McNemars McNemars
square Tests square Tests
Square testing is generally used to test Square testing is generally used to test categorical categorical data consisting of data consisting of
frequency counts for different categories frequency counts for different categories square distribution square distribution square distribution square distribution
Many different types of tests: Many different types of tests: i.e. Independence, Homogeneity, Goodness of fit, Fisher’s i.e. Independence, Homogeneity, Goodness of fit, Fisher’s
Example: Test of Independence Example: Test of Independence
n n Lets do a test of independence between sex (M Lets do a test of independence between sex (M or F) and BMI group (Normal, Overweight, or F) and BMI group (Normal, Overweight, Obese) Obese)
n n H H : Sex and BMI group are independent : Sex and BMI group are independent n n H H 0 0 : Sex and BMI group are independent : Sex and BMI group are independent n n H H A A : Sex and BMI group are not independent : Sex and BMI group are not independent
Example: Test of Independence Example: Test of Independence
Lets do a test of independence between sex (M Lets do a test of independence between sex (M or F) and BMI group (Normal, Overweight, or F) and BMI group (Normal, Overweight,
: Sex and BMI group are independent : Sex and BMI group are independent : Sex and BMI group are independent : Sex and BMI group are independent : Sex and BMI group are not independent : Sex and BMI group are not independent
SAS Syntax SAS Syntax
proc freq data=mydata.newbmi; proc freq data=mydata.newbmi; table sex*owt/nopercent norow nocol expected table sex*owt/nopercent norow nocol expected chisq; chisq; run; run; run; run;
Explanation of Syntax: Explanation of Syntax: n n expected expected = based on the independent assumption to calculate = based on the independent assumption to calculate
the expected frequency the expected frequency n n chisq chisq = chi = chi square test square test
SAS Syntax SAS Syntax
proc freq data=mydata.newbmi; proc freq data=mydata.newbmi; table sex*owt/nopercent norow nocol expected table sex*owt/nopercent norow nocol expected
= based on the independent assumption to calculate = based on the independent assumption to calculate
What is our conclusion?
Pvalue>0.05 FTR Ho
What is our conclusion?
Fisher’s Exact Test Fisher’s Exact Test
n n When the expected values is <5, then Chi When the expected values is <5, then Chi square test is not valid square test is not valid
n n In this case, we use In this case, we use Fisher’s Exact test Fisher’s Exact test Example: Example: Association between wearing helmets and Association between wearing helmets and n n Example: Example: Association between wearing helmets and Association between wearing helmets and getting face injuries? getting face injuries?
Helmet
Face Injury
yes
yes 2
no 6
Fisher’s Exact Test Fisher’s Exact Test
When the expected values is <5, then Chi When the expected values is <5, then Chi
Fisher’s Exact test Fisher’s Exact test Association between wearing helmets and Association between wearing helmets and Association between wearing helmets and Association between wearing helmets and
no
13
19
SAS Syntax SAS Syntax Data Data helmet; helmet; Input helmet $ faceinj $ count @@; Input helmet $ faceinj $ count @@; Datalines; Datalines; yes yes 2 no yes 13 yes yes 2 no yes 13 yes no 6 no no 19 yes no 6 no no 19 ; ; ; ; run run; ;
proc proc freq freq order=data; order=data; weight count; weight count; table faceinj*helmet/nopercent norow nocol expected; table faceinj*helmet/nopercent norow nocol expected; exact chisq; exact chisq; run run; ;
SAS Syntax SAS Syntax
Input helmet $ faceinj $ count @@; Input helmet $ faceinj $ count @@;
table faceinj*helmet/nopercent norow nocol expected; table faceinj*helmet/nopercent norow nocol expected;
This tells us Chi square test is not valid, therefore use Fisher’s exact p value
Using the two sided pvalue and significance level=0.05, what is our conclusion?
value and significance level=0.05,
T T Test Test
n n Used to compare a continuous variable between two Used to compare a continuous variable between two populations or groups of a categorical variable populations or groups of a categorical variable
n n Assess difference Assess difference between between the two means the two means n n Assumptions: Assumptions: n n Assumptions: Assumptions: 1. 1. Equal variance for both populations Equal variance for both populations 2. 2. The sample data need to be randomly sampled The sample data need to be randomly sampled 3. 3. The two samples are independent The two samples are independent 4. 4. Small sample size (<30) if it is ND, or Small sample size (<30) if it is ND, or 5. 5. Larger sample size if it not ND Larger sample size if it not ND
Test Test
Used to compare a continuous variable between two Used to compare a continuous variable between two populations or groups of a categorical variable populations or groups of a categorical variable
the two means the two means
Equal variance for both populations Equal variance for both populations The sample data need to be randomly sampled The sample data need to be randomly sampled The two samples are independent The two samples are independent Small sample size (<30) if it is ND, or Small sample size (<30) if it is ND, or Larger sample size if it not ND Larger sample size if it not ND
Example Example
n n Let’s examine if the systolic blood Let’s examine if the systolic blood pressure is different between pressure is different between blood pressure group (n=15) and blood pressure group (n=15) and hypertensive (n=10) group hypertensive (n=10) group hypertensive (n=10) group hypertensive (n=10) group
n n Ho: µNormal=µHypertensive=0 Ho: µNormal=µHypertensive=0
n n Ha: µNormal=µHypertensive≠0 Ha: µNormal=µHypertensive≠0
Example Example
Let’s examine if the systolic blood Let’s examine if the systolic blood pressure is different between pressure is different between a normal a normal blood pressure group (n=15) and blood pressure group (n=15) and hypertensive (n=10) group hypertensive (n=10) group hypertensive (n=10) group hypertensive (n=10) group
Ho: µNormal=µHypertensive=0 Ho: µNormal=µHypertensive=0
Ha: µNormal=µHypertensive≠0 Ha: µNormal=µHypertensive≠0
38 38
Input Data into SAS Input Data into SAS Normal BP (mmHg)
Hypertensive BP (mmHg)
114 117 130 155 115 115 125 138 148 132 121 100 132 121 100 115 122 156 122 162 140 151 110 156 122 162 130 158
Input Data into SAS Input Data into SAS
data bp; input SYM $ sbp @@; datalines; . . .
39 39
.
. ; Run;
Mean SBP For Both Groups Mean SBP For Both Groups
Proc sort; Proc sort; By sym; By sym; Run; Run;
proc means; proc means; Var sbp; Var sbp; By sym; By sym; Run; Run;
Mean SBP For Both Groups Mean SBP For Both Groups
40 40
n n Assumption: Check to see if the groups are Assumption: Check to see if the groups are normally distributed? normally distributed?
Proc univariate normal plot; Proc univariate normal plot; Proc univariate normal plot; Proc univariate normal plot; Var sbp; Var sbp; by sym; by sym; Run; Run;
Assumption: Check to see if the groups are Assumption: Check to see if the groups are
Proc univariate normal plot; Proc univariate normal plot; Proc univariate normal plot; Proc univariate normal plot;
41 41
Normality Check Normality Check Normality Check Normality Check Is the normal group ND?
42 42
Is the hypertensive group ND?
n n Assumption: Are variances are equal? Assumption: Are variances are equal? n n Yes Yes > use the pooled method (t) > use the pooled method (t) n n No No > use > use satterthwaite’s satterthwaite’s
Proc Proc ttest ttest; ; Class sym; Class sym; Var Var sbp sbp; ; Run; Run;
Assumption: Are variances are equal? Assumption: Are variances are equal? > use the pooled method (t) > use the pooled method (t)
satterthwaite’s satterthwaite’s method (t’) method (t’)
43 43
Difference between means is significant CI does not include O
44 44 Fvalue>1 and p value<0.05; variances not equal
Difference between means is significant (p<0.05) so REJECT NULL
Example Example
n n Variances are not equal (p=0.0499<0.05) Variances are not equal (p=0.0499<0.05) n n Satterthwaite Satterthwaite p=0.0276 <0.05 p=0.0276 <0.05
> Reject null > Reject null > Blood pressure between the normal and > Blood pressure between the normal and > Blood pressure between the normal and > Blood pressure between the normal and hypertensive groups is significantly different hypertensive groups is significantly different
* *Interpret with caution since normal distribution assumption not met* Interpret with caution since normal distribution assumption not met*
Example Example
Variances are not equal (p=0.0499<0.05) Variances are not equal (p=0.0499<0.05) p=0.0276 <0.05 p=0.0276 <0.05
> Blood pressure between the normal and > Blood pressure between the normal and > Blood pressure between the normal and > Blood pressure between the normal and hypertensive groups is significantly different hypertensive groups is significantly different
Interpret with caution since normal distribution assumption not met* Interpret with caution since normal distribution assumption not met*
Further Readings Further Readings
n n Step Step By By Step Basic Statistics Using SAS: Step Basic Statistics Using SAS: Exercises Exercises Author: Larry Hatcher Author: Larry Hatcher
n n Data analysis using SAS for Windows: Basic Data analysis using SAS for Windows: Basic Author: Author: Mirka Mirka Ondrack Ondrack
Further Readings Further Readings
Step Basic Statistics Using SAS: Step Basic Statistics Using SAS:
Data analysis using SAS for Windows: Basic Data analysis using SAS for Windows: Basic