analysis using sas

SAS INDIVIDUAL PROJECT AHMAD HILMI BIN AZMAN

2010704243

Meister meister_hilmi@yahoo.com

UNIVERSITI TEKNOLOGI MARA (UiTM)

INDIVIDUAL PROJECT

SAS PROGRAMMING

AHMAD HILMI BIN AZMAN

PREPARED FOR

MADAM WAN FAIROS BT WAN YAACOB

26 DECEMBER 2012

Table of Contents 1.0 TASK 1 ............................................................................................................................................... 3

1.1 Introduction .................................................................................................................................. 3

1.2 SAS Macro Programming .............................................................................................................. 5

a) Substituting text with %LET .............................................................................................................. 5

b) Creating Modular Code with Macros ................................................................................................ 7

c) Adding Parameters to Macros .......................................................................................................... 8

2.0 TASK 2 ............................................................................................................................................... 9

2.1 Introduction .................................................................................................................................. 9

2.1.1 Source of Dataset ...................................................................................................................... 9

2.1.2 Description of Dataset .............................................................................................................. 9

2.2 Analysis of Variance (ANOVA) ..................................................................................................... 10

2.2.1 Descriptive Analysis ................................................................................................................ 10

2.2.2 Analysis of Variance (ANOVA) ................................................................................................. 12

2.2.3 Report on Analysis of Variance (ANOVA) ................................................................................ 13

2.3 Regression Analysis ..................................................................................................................... 14

2.3.1 Model Adequacy Checking ...................................................................................................... 14

2.3.2 Regression Model ................................................................................................................... 16

2.3.3 Correlation Analysis ................................................................................................................ 18

2.3.4 Multicollinearity Test .............................................................................................................. 19

2.3.5 Stepwise Analysis .................................................................................................................... 20

2.3.6 Lack of Fit Test......................................................................................................................... 24

2.3.7 Report on Regression Analysis ................................................................................................ 26

2.4 Correlation Analysis using Macro ............................................................................................... 27

2.4.1 Report on Correlation Analysis ............................................................................................... 28

3.0 TASK 3 ............................................................................................................................................. 29

3.1 Question 1 ................................................................................................................................... 29

3.2 Question 2 ................................................................................................................................... 30

4.0 References ...................................................................................................................................... 39

1.0 TASK 1

1.1 Introduction

The data is about Sales of Mini Supermarket. This data is about the study of sales of mini

supermarket of different branch. There are all 16 observation from this data and 6 variables.

The variables are Branch, State, Date, NumWorker, SaleYear and Manager. The variable

description are shown in the table below.

Variable Name Description

Branch List branch area in Malaysia

State Name of state of the branch

Date Date of the branch been open

NumWorker Number of worker for each branch

SaleYear Number of sales per year

Manager Name of manager in charge of the branch

The SAS Command:

libname indi 'C:\Users\Meister\Documents\SAS\individual project';

data indi.task;

infile 'C:\Users\Meister\Documents\SAS\individual

project\data\task1.txt';

input Branch $ 1-16 State $ 17-32 Date 33-48 NumWorker 49-56 SaleYear

57-65 ManagerName $ 66-75;

proc print data=indi.task;

title 'Meister Supermarket Sales';

format SaleYear dollar9. Date date9.;

The data:

1.2 SAS Macro Programming

a) Substituting text with %LET

Getting sales from Penang.

data indi.task;

%let slecstate = Penang;

proc print data=indi.task noobs;

title "Meister Supermarket Sales in Penang";

where State = "&slecstate";

Getting State and Branch in charge by Mahmood.

data indi.task;

%let slecnum = Mahmood;

proc print data=indi.task noobs;

title "Meister Supermarket Sales in Penang";

where ManagerName = "&slecnum";

var managername branch state;

b) Creating Modular Code with Macros

data indi.task;

%MACRO meister ;

proc sort data=indi.task;

by descending date;

proc print data=indi.task;

var Branch State ManagerName Date;

title 'Meister Supermarket by Branch Date Open';

format SaleYear dollar9. Date year4.;

%MEND sample;

%meister

c) Adding Parameters to Macros

data indi.task;

%MACRO select(State=,sortvar=);

proc sort data = indi.task out = indi.task1;

by &sortvar;

where State="&state";

proc print data=indi.task1;

var Branch State Date;

format SaleYear dollar9. Date year4.;

TITLE1 "States by Date Order";

%MEND select;

%select (State=Penang, sortvar=date)

%select (State=Selangor, sortvar=date);

2.0 TASK 2

2.1 Introduction

2.1.1 Source of Dataset

We are taking our data from database from New York University Stern (NYU Stern). The title

of the data is Movie Buzz Data

2.1.2 Description of Dataset

Our data is all about movie. We were using movie sales data. Where this data have 62

observations which are from different movies. This dataset contain 12 variables. The data

have 6 quantitative variables and 6 qualitative variables.

The 6 quantitative variables are Budget, StarPower, Addict, ComingSoon,

Fandango and CantWait. Variable Budget mean production budget in million dollar.

StarPower stand for star power measure by index of star poser. Variable Addict is the

variable that count people view watching trailer at traileraddict.com. ComingSoon is the

variable that count the comment on message board at comingsoon.net. The variable

Fandago is variable that count people attention on the movie in fandango.com. The last

variable CantWait stand for can’t wait to see vote in fandango.com. Here are summarize of

the variable description.

Variables Descriptions

Budget Production budget ($ million)

StarPower Index of star poser

Addict Trailer views at traileraddict.com

ComingSoon Number of people comment on message board at comingsoon.net

Fandago Number of people read review at fandango.com

CantWait Percentage of Fandago votes that can’t wait to see

The 6 qualitative variables are MPRating, Sequel, Action, Comedy, Animated and Horror.

Variable MPRating mean MPAA rating code. Sequel stand for sequel movie. Variable Action

is the variable for action movie. Comedy is the variable for comedy movie. The variable

Animated is the variable for animated movie. The last variable Horror is the variable for

horror movie.

Variables Descriptions

MPRating MPAA Rating where code 1=G (general) 2=PG (parental guide) 3=PG13 (parental guide and may not appropriate under 13 years old) 4=R (Restricted)

Sequel Sequel movie where code 1=sequel 2=not sequel

Action Action movie where code 1=action film 2=not action film

Comedy Comedy movie where code 1=comedy film 2=not comedy film

Animated Animated movie where code 1=animated film 2=not animated film

Horror Horror movie where code 1=horror film 2=not horror film

2.2 Analysis of Variance (ANOVA)

The analysis of variance (ANOVA) been done towards sales and types of movie. The type of

movie consist of variable comedy animated horror and action.

2.2.1 Descriptive Analysis

The coding:

libname sas 'C:\Users\Meister\Documents\SAS\project';

data sas.anova;

set work.project;

length type $ 10;

if action=1 then type='action';

else if comedy=1 then type='comedy';

else if animated=1 then type='animated';

else if horror=1 then type='horror';

else if type=1 then delete;

keep type sales;

proc univariate data = sas.anova normal plot;

class type;

1. Normality analysis for variable horror

Based on Kolmogorov-Smirnov analysis, the p-value is 0.15 where the value is greater

than alpha 0.05. We can conclude that the variable horror is normal.

2. Normality analysis for variable comedy

than alpha 0.05. We can conclude that the variable comedy is normal.

3. Normality analysis for variable animated

than alpha 0.05. We can conclude that the variable animated is normal.

4. Normality analysis for variable action

than alpha 0.05. We can conclude that the variable action is normal.

2.2.2 Analysis of Variance (ANOVA)

The coding:

libname sas 'C:\Users\Meister\Documents\SAS\project';

data sas.anova;

set work.project;

length type $ 10;

if action=1 then type='action';

else if comedy=1 then type='comedy';

else if animated=1 then type='animated';

else if horror=1 then type='horror';

else if type=1 then delete;

keep type sales;

proc anova data=sas.anova;

class type;

model sales=type;

means type / hovtest=bf;

Based on the analysis of variance, the p-value is 0.5102 where this value is greater than

alpha 0.05. we can conclude that the sales does not have relationship with the types of

movie.

2.2.3 Report on Analysis of Variance (ANOVA)

The analysis of variance (ANOVA) been done towards sales and types of movie.

The type of movie consist of 4 variables which the variable are comedy, animated, horror

and action. All the 4 variables must first be analyze for normality. Therefore descriptive

analysis been done. The analysis been done each of the variables.

Based on Kolmogorov-Smirnov analysis for variable Horror, the p-value is 0.15

where the value is greater than alpha 0.05. Based on this result, we can conclude that the

variable horror is normal. For variable Comedy, the Kolmogorov-Smirnov analysis show that

the p-value is 0.1. Where this value is greater than alpha 0.05. This mean that we can

conclude that the variable comedy is normal. Then for variable animated, based on

Kolmogorov-Smirnov analysis, the p-value is 0.15 where the value is greater than alpha 0.05.

By this we can conclude that the variable animated is normal. Next is analysis for variable

action. Based on Kolmogorov-Smirnov analysis, the p-value is 0.0851 where the value is

greater than alpha 0.05. Therefore we can conclude that the variable action is normal.

After all variable been confirm normal, we do the analysis of variance toward the

sales of the movie. Based on the analysis of variance, the p-value is 0.5102 where this value

is greater than alpha 0.05. By this we can conclude that the sales does not have relationship

with the types of movie.

2.3 Regression Analysis

2.3.1 Model Adequacy Checking

The coding for p-p plot original data:

proc import out = indi.project

datafile = "C:\Users\Meister\Documents\SAS\individual

project\movie.xlsx"

dbms = xlsx replace;

proc capability data = indi.project normal;

var sales;

qqplot sales/ normal;

ppplot sales/ normal;

histogram / normal;

inset mean std;

The coding for p-p plot transform data:

data indi.trans;

set indi.project;

transform = ln(sales);

proc capability data = indi.trans normal;

var transform;

qqplot transform / normal;

ppplot transform / normal;

histogram / normal;

inset mean std;

P-value of Kolmogorov-Smirnov less than alpha. We can conclude that the distribution is not normal. Transformation needed

P-value of Kolmogorov-Smirnov more than alpha. We can conclude that the distribution is normal. Transformation succeed.

2.3.2 Regression Model

The coding for regression model of the transform data:

project\movie.xlsx"

data indi.trans;

set indi.project;

merge indi.trans,indi.project;

proc reg data=indi.trans ;

model transform = MPRATING BUDGET STARPOWR SEQUEL ACTION COMEDY

ANIMATED

HORROR ADDICT CMNGSOON FANDANGO CNTWAIT;

plot residual.*cases.;

The model is significant since the p-value = 0.0001 less than alpha 0.05

Regression model

Sales = 15.25374 – 0.21222MPRating + 0.00515Budget – 0.00471StarPowr + 0.39390Sequel –

0.74909Action – 0.00164Comedy – 0.82118Animated + 0.43770Horror + 0.00002216Addict –

0.00013618CmngSoon + 0.00020521Fandango + 3.29154CntWait

2.3.3 Correlation Analysis

The coding for correlation:

proc corr data=indi.trans;

var transform BUDGET STARPOWR ADDICT CMNGSOON FANDANGO CNTWAIT ;

The correlation matrix above shows the coefficient of Pearson correlation between

quantitative variables in the data set. There is 5 predictor variables that are significant correlated

with the dependent variable (Sales). That is BUDGET, ADDICT, CMGSOON, FANDAGO and

CNTWAIT with correlation 0.45708, 0.43750, 0.0133, 0.37974 and 0.65501 respectively. Some of

the predictor variables also shows a correlation exist among them. BUDGET and STARPOWR have

a significant correlated and this indicates that a movie with high STARPOWR tend to increase the

budget to make the movie.

2.3.4 Multicollinearity Test

ANIMATED

HORROR ADDICT CMNGSOON FANDANGO CNTWAIT / vif tol ;

Variables Interpretation of Collinearity Statistics

Tolerance VIF

MPAA Rating No multicollinearity since tolerance value 0.602 more than 0.2

No multicollinearity since VIF value 1.660 less than 10

Budget No multicollinearity since tolerance value 0.401 more than 0.2

StarPower No multicollinearity since tolerance value 0.547 more than 0.2

Sequel No multicollinearity since tolerance value 0.544 more than 0.2

Action No multicollinearity since tolerance value 0.498 more than 0.2

Comedy No multicollinearity since tolerance value 0.515 more than 0.2

Animated No multicollinearity since tolerance value 0.523 more than 0.2

Horor No multicollinearity since tolerance value 0.617 more than 0.2

Addict No multicollinearity since tolerance value 0.465 more than 0.2

Coming Soon No multicollinearity since tolerance value 0.376 more than 0.2

Fandago No multicollinearity since tolerance value 0.570 more than 0.2

CantWait No multicollinearity since tolerance value 0.440 more than 0.2

2.3.5 Stepwise Analysis

The coding for stepwise:

project\movie.xlsx"

data indi.trans;

set indi.project;

ANIMATED

HORROR ADDICT CMNGSOON FANDANGO CNTWAIT / selection=stepwise ;

1. Steps of stepwise

Step 1 Step 2

1. Variable CNTWAIT have the higher correlation value. Which is 0.4290.

2. Thus the variable is selected as the first variable to be enter to the model.

1. Variable ACTION have the higher correlation value. Which is 0.4856.

2. Thus the variable is selected to be enter to the model.

3. The two variable tested and no variable are deleted.

Step 3 Step 4

1. Variable ADDICT have the higher correlation value. Which is 0.5217.

3. The three variable tested and no variable are deleted.

1. Variable SEQUEL have the higher correlation value. Which is 0.5420.

3. The four variable tested and no variable are deleted.

Summary of the stepwise selection:

Based on 12 variable 4 are selected for the final model. All the selected variables are

CNTWAIT, ADDICT, SEQUEL and ACTION. Therefore the final model is

y = 14.56468 + 0.41590SEQUEL – 0.69464ACTION + 0.00002895ADDICT – 3.81397CNTWAIT

Here the test of the model

Hypothesis

H0: 𝛽1=𝛽2=𝛽3=𝛽4 =0

H1: at least one 𝛽i is not equal zero

Significant value

α=0.05

Test statistic

P value=0.0001

Decision

Since P-value=0.0001 < α=0.05 to reject H0.

Conclusion

The model is significant.

2.3.6 Lack of Fit Test

The coding for lack of fit:

Full model:

project\movie.xlsx"

data indi.trans;

set indi.project;

ANIMATED

HORROR ADDICT CMNGSOON FANDANGO CNTWAIT / lackfit ;

Full reduced model:

model transform = SEQUEL ACTION ADDICT CNTWAIT / lackfit ;

Full model Reduced model (after stepwise)

Hypothesis H0: There is no lack of fit H1: There is lack of fit Significant value α=0.05 Test statistic P value=0.0001 Decision Since P-value=0.0001 < α=0.05 to reject H0.

Conclusion The model have lack of fit.

Hypothesis H0: There is no lack of fit H1: There is lack of fit Significant value α=0.05 Test statistic P value=0.0001 Decision Since P-value=0.0001 < α=0.05 to reject H0.

Conclusion The model have lack of fit.

2.3.7 Report on Regression Analysis

To do regression model, first we plot the p-p plot to see either the data is normal

or not. Based on descriptive analysis the p-value of Kolmogorov-Smirnov less than alpha.

We can conclude that the distribution is not normal. Then transformation needed to be

done. After transformation been done using natural log transformation. The Kolmogorov-

Smirnov value is 0.15 which is more than alpha. We can conclude that the distribution is

normal. Therefore the transformation is succeed.

Then we run the data to get the regression model. The regression model is

Sales = 15.25374 – 0.21222MPRating + 0.00515Budget – 0.00471StarPowr +

0.39390Sequel – 0.74909Action – 0.00164Comedy – 0.82118Animated + 0.43770Horror

+ 0.00002216Addict – 0.00013618CmngSoon + 0.00020521Fandango + 3.29154CntWait.

For correlation matrix shows the coefficient of Pearson correlation between

quantitative variables in the data set. There is 5 predictor variables that are significant

correlated with the dependent variable (Sales). That is BUDGET, ADDICT, CMGSOON,

FANDAGO and CNTWAIT with correlation 0.45708, 0.43750, 0.0133, 0.37974 and 0.65501

respectively. Some of the predictor variables also shows a correlation exist among them.

BUDGET and STARPOWR have a significant correlated and this indicates that a movie with

high STARPOWR tend to increase the budget to make the movie.

Multicollinearity analysis been done and based on the result we can conclude that

there is multicollinearity exist. This is because since all the tolerance value more than 0.2

and VIF value is less than 10.

Then stepwise analysis been done. Based on 12 variable 4 are selected for the final

model. All the selected variables are CNTWAIT, ADDICT, SEQUEL and ACTION. Therefore

the final model is

y = 14.56468 + 0.41590SEQUEL – 0.69464ACTION + 0.00002895ADDICT –

3.81397CNTWAIT

For lack of fit analysis, we can conclude that the full model and reduced model is

not fit. This is because the p-value 0.0001 less than alpha.

2.4 Correlation Analysis using Macro

project\movie.xlsx"

data indi.trans;

set indi.project;

%MACRO correlation ;

proc corr data=indi.trans;

var transform BUDGET STARPOWR ADDICT CMNGSOON FANDANGO CNTWAIT;

%MEND sample;

%correlation

2.4.1 Report on Correlation Analysis

The correlation matrix above shows the coefficient of Pearson correlation

between quantitative variables in the data set. There is 5 predictor variables that are

significant correlated with the dependent variable (Sales). That is BUDGET, ADDICT,

CMGSOON, FANDAGO and CNTWAIT with correlation 0.45708, 0.43750, 0.0133, 0.37974

and 0.65501 respectively. Some of the predictor variables also shows a correlation exist

among them. BUDGET and STARPOWR have a significant correlated and this indicates that

a movie with high STARPOWR tend to increase the budget to make the movie.

3.0 TASK 3

3.1 Question 1

Coding

project\data\country.txt"

dbms = tab replace;

proc print data = indi.project;

Partial output

3.2 Question 2

a) Total population of the world

SAS Command:

proc sql;

select sum(population)

as TotalPopulation

from indi.project;

Result:

b) List all region

SAS Command:

proc sql;

title 'List of All Region';

select distinct(region)

from indi.project

group by region;

Result:

c) Total GDP of Cambodia

SAS Command:

proc sql;

title 'Total GDP of Cambodia';

select name , sum(gdp) as total

from indi.project

where name='Cambodia';

Result:

d) Number of countries with area greater than 1000000

SAS Command:

proc sql;

title 'Number of Country with Area more than 1000000';

select count(name) as NumberOfCountry

from indi.project

where area ge 1000000;

Result:

e) Total population of France, Greece and Spain ????

SAS Command:

proc sql;

title Total Population France Greece and Spain;

select sum(population) as total label='Total Population'

from indi.project

where name in ('France', 'Greece', 'Spain');

Result:

f) Number of countries by region

SAS Command:

proc sql;

title 'Number of Countries by Region';

select region, count(name) as no label='Number of Countries'

from indi.project

group by region;

Result:

g) Number of countries by region with population at least 15 million

SAS Command:

proc sql;

title 'Number of Countries by Region with Population more than 15

million';

select region, count(name) as no label='Number of Country'

from indi.project

where population ge 15000000

group by region;

Result:

h) List of region with population at least 150 million

SAS Command:

proc sql;

title 'List of Region with Population more than 150 million';

select distinct(region)

from indi.project

where population ge 150000000

group by region;

Result:

i) Total population and GDP of South Asia

SAS Command:

proc sql;

title 'Total Population and GDP of South Asia';

select distinct(region),sum(population) as Pop label='Total

Population', sum(gdp) as ttl label='Total GDP'

from indi.project

where region='South Asia';

Result:

j) List of countries per capita with population more than 200 million

SAS Command:

proc sql;

title 'List of Countries per Capita';

title2 'Population more than 200 million';

select name, (gdp/population) as capita

from indi.project

where population ge 200000000;

Result:

k) List of Country for region middle east in million

SAS Command:

proc sql;

title 'List of Country for Middle East in million';

select name, (population/1000000) as pop label='Population'

from indi.project

where region='Middle East';

Result:

l) List of Country for region middle east in million

SAS Command:

proc sql;

title Total Population France Greece and Italy;

select name, population

from indi.project

where name in ('France', 'Greece', 'Italy');

Result:

m) List of country name with United

SAS Command:

proc sql;

title 'Countries Name with "UNITED"';

select name

from indi.project

having name ? 'United';

Result:

n) List of countries that population larger than Russia

SAS Command:

proc sql;

title 'List of Countries that Population larger than Russia';

select name

from indi.project

where population gt (select population from indi.project where

name='Russia') ;

Result:

o) List of country name with United

SAS Command:

proc sql;

title 'List of Countries that region of India and Iran';

select name, region

from indi.project

where region in (select region from indi.project where name in

('India','Iran')) ;

Result:

p) List of country name with United

SAS Command:

proc sql;

title 'List of Countries';

title2 'region more than Canada less than Algeria';

select name

from indi.project

where population gt (select population from indi.project where

name in ('Canada'))

and population lt (select population from indi.project where name

in ('Algeria'));

Result:

4.0 References

Books:

Douglas C. Montgomery, Elizabeth A. Peck and G. Geoffrey Vining [2006], “Introduction to

Linear Regression Analysis”

Michael H. Kutner, Christopher J. Nachtsheim, John Neter and William Li, “Applied Linear

Statistical Models”

Website:

http://sites.stat.psu.edu/~xzhan/stat597c/sp04/Chapter7.htm

http://sastipsbyhal.blogspot.com/2012/01/sas-date-calculator-now-available.html

analysis using sas

Documents

new developments in survival analysis using sas...

multivariate techniques: an overview using sas analysis...

clinical graphs using sas - lex jansen · phuse 2016 1...

using sas for the design analysis and visualization of...

survival analysis using sas proc · pdf filesurvival...

analysis of clinical trials using sas　勉強用 isseing333

data analysis using sas enterprise guide.pdf

analysis of survey data using the sas® survey procedures...

survival analysis using sas · pdf filesurvival analysis...

estimating and modelling relative survival using sas · 6...

regression analysis using sas and stata · regression...

analysis of clinical trials using sas- a. dmitrienko, et...

competing risk survival analysis using phreg in sas 9 group...

competing risks in survival analysis using · pdf...

survival analysis eberg - sas · survival analysis using...

statistical power analysis using sas and r

a sas® macro for complex sample data analysis using ... ·...

using sas at sas: the mining of sas technical support1 using...

regression analysis using sas and stata€¦ · regression...

relative survival analysis using sas