b409 w11 sas collaborative stats guide v4.2

Page | 2

Numerical Summaries ..................................................................................................................... 3

Variation Within The Data ............................................................................................................ 12

Confidence Intervals ..................................................................................................................... 25

Simple Regression ......................................................................................................................... 33

Correlation Coefficient .................................................................................................................. 47

Test of Significance ....................................................................................................................... 61

Limits (Confidence / Prediction) ................................................................................................... 82

Appendix ....................................................................................................................................... 87

Table of Contents

Page | 3

Numerical Summaries

Team 1

Baljeet Kaur

Trystan McDonald

Jaswant Seahra

Mriseal Sinha

Surbhi Surbhi

Theo Wolski

Chapter 01

Page | 4

Introduction

Collecting, processing and forming data are skills that are widely sought after in today’s business world.

In order to make effective business decisions you must possess the skills necessary to analyse,

manipulate and present findings derived from the mining of raw data.

Data can be produced in numerical and non-numerical forms. When deducing the significance of data, it

is advantageous to provide context to the process; knowing where (location) and how your data fits

(dispersion) into your query can provide valuable insight into your department’s current and future

campaigns.

Numerical summaries present data by location include stating the data’s mean, mode, and median.

Data that is presented by how it is dispersed is done by stating its range and standard deviation.

www.palgrave.com/business/taylor/taylor1/lecturers/

Numerical Summaries

Definition: A set of numeric data summarized and described by two parameters.

Measure of centrality: Data measured by its mean, median and mode.

Measure of spread: Ordinal data measured by its range, quartile range and standard deviation.

Mean: The arithmetic average of all data

Median: The middle value of ordered data. Data must be ordinal or interval.

Mode: The most commonly occurring value in data set.

Terms and Concepts

Mean: The arithmetic average of all data points.

Mean = Mean=Σn/n

Example - 3, 7, 5, 13, 20, 23, 39, 23, 40, 23, 14, 12, 56, 23, 29

The sum of these numbers is 330

There are fifteen numbers.

Mean = 330 / 15 = 22

http://www.palgrave.com/business/taylor/taylor1/lecturers/

Page | 5

Median: The center value of ordinal or interval data ordered by ascension.

3,5,7,12,13,14,20,23,23,23,23,29,39,40,56

Total number is 15 and that is divided by 2, result is 7.5

So median value between 7 and 8

(20+23)/2= 21.5

Mode: The most commonly occurring value in a data set.

3,5,7,12,13,14,20,23,23,23,23,29,39,40,56

23 is the Mode because it is repeated 4 times.

Range: Largest value - smallest value.

Example: 2, 6 , 2 , 4, 1, 4 , 3, 1 , 1

6-1= 5

Quartile Range: The range from of the centre point of the ordered data.

Example:

1, 11, 15, 19, 20, 24, 28, 34, 37, 47, 50, 57

Lower quartile = value from the centre of the first half of data or Quartile1

The median of 1, 11, 15, 19, 20, 24

(third + fourth observations) ÷ 2

(15 + 19) ÷ 2=17

Upper quartile = value from the centre of the second half of data Quartile2

The median of 28, 34, 37, 47, 50, 57

(third + fourth observations) ÷ 2

(37 + 47) ÷ 2 = 42

Interquartile range is Q2–Q1

42 – 17= 25

Page | 6

Standard Deviation: Also known as 'Root Mean Square Deviation', is calculated by squaring and adding

the deviations from the mean, finding the average of the squared deviations, and then square-rooting

the result.

σ = Variance

n-1

http://www.mathsisfun.com/median.html

www.palgrave.com/business/taylor

Example: http://hubpages.com/hub/Musical-Terms

Numeric summaries are to a mathematician what sheet music is to a musician. As we know, numerical

summaries include equations of mean, median, range, quartile range and standard deviation; each of

these equations allows for the input of data for the purpose of analysis. Without the numeric summaries

and equations for these terms one would not be able to determine the desired findings, much like

without sheet music a musician would not be able to play his or her instrument.

As sheet music is a language that musicians speak; numerical summaries are the language of

statisticians.

A bar line is used for separating musical notes into areas that are manageable, allowing the musician to

read where the tempo and notes are going within the song. In statistics, for example, if someone asks

you to measure 46 points of data within a particular data set without identifying the centre point (mean

or median) you cannot effectively measure the data. Like in the use of sheet music, following the

building blocks of a process is an essential first step in determining the outcome of the data or song

being played.

http://www.mathsisfun.com/median.html

http://www.palgrave.com/business/taylor

http://hubpages.com/hub/Musical-Terms

Page | 7

_________________________

_________________________

_________________________

_________________________

_________________________

In the above diagram:

The bar line is referenced as…

Mean (ie: mean= Staff/ 4 spaces)

Standard deviation (ie: σ= One bar line- another bar line)

Implementing with SAS

In this tutorial we’ll be doing a walkthrough with the Height’s database. It’s a rather simple database but

effective for showing Numerical Summaries.

Bar Line Double bar

line Difference from one bar line to

another is called measure

Page | 8

We have three columns shown above: Family, Gender, and Height. Although you can do many things

with numerical summaries in SAS with variables relating to one another, we will just be working with the

Heights Variable. Click Tasks-Describe-Summary Statistics (shown below).

You’ll notice a window pop up. This is the main interface you will be using to run numerical summaries.

You can do a lot of neat things with this function of SAS, but for this exercise we will just be using the

Variable Heights. Click and drag over to the right hand side under analysis variable.

Page | 9

On the left hand side of the window you will see a tab saying “Statistics” click on statistics. The window

will change to list all the different numerical summaries that the data can be run through. Click the ones

that you need for your research. In this example we used: Mean, Standard Deviation, Minimum,

Maximum, and Number of Observation.

Page | 10

If you wish for a more visual element to your research, click plots and pick a graph design.

When you’re done click RUN at the bottom of the window.

After clicking run, SAS will process the data and show the results you requested. At the top the Mean,

Standard Deviation, Variance, Minimum variable, Max variable, the Range, and the total number of

Page | 11

values that were used to process this information. So in this tutorial we can see that the average

(Mean) of the values in the dataset equal 66.83, STD Dev equalling 2.72, Variance equalling 7.4, and the

range 9.

Conclusion

With this application of SAS you can make statistical observations and decisions, depending on what

marketing questions you need to answer in your career regarding numerical summaries.

Page | 12

Chapter 02

Variation Within The Data

Blueprint

Christopher Atkinson

Fredric Ayih

Gauvtam Bajaaj

Danusha Fernando

Paramjeet Kaur

Chapter 02

Page | 13

Introduction

Variation is seen in every part of our day to day lives, from our home to the workplace to anything in

which we can observe a difference. On a daily basis, you see cars of different brands, models, colors and

sizes. The very differences in these observations illustrate variance. When looking at a dataset of all

Toyota cars for example, one can observe that they come in different prices, sizes and features such as

engine size, horsepower and number of cylinders. These differences within a dataset illustrate the

concept of variation within data.

What Is Variation?

Data variation measures the spread of data around the mean. It shows the differences in the variables

which may be quantitative as well as qualitative. We may have two sets of data with varying input

values but similar means. Here variations may be observed in terms of the number of variable inputs,

range of data, dispersion of the data etc. In order to measure the amount of variability between the

data sets we use statistical tools such as variance and standard deviation. Variance measures the

difference between each variable and the mean, squared to remove the sign effects. The standard

deviation is the square root of the variance which brings the measure back to scale. Together with

mean, standard deviation gives a first level indication of the characteristics of any set of numbers.

Standard Deviation indicates the degree to which the values are clustered around the mean. A large

amount of dispersion explains how far results are from the expected level of mean. Thus, the variations

within the data are measured in a quantitative manner. Pictorial representation of variations within data

can be shown using bars and charts.

What causes variation within the data?

It becomes necessary to find out if the variation within the data is a regular event or a random event so

that the results attained do not come as a surprise to us. There are common causes such as process

input and conditions that contribute to the regular everyday variation. For example, a probability of a

3% occurrence of errors in data provides the Statisticians to forecast the temperature within a desired

range. On the other hand, there may be some special causes such as the random occurrence of a

temporary event which may create a variation within the existing data making it difficult to work on. For

example, sudden flow of the north-east wind may cause a sudden drop in temperatures making it

difficult to predict the temperature.

Page | 14

Process Flow for Implementation in SAS

To further understand the concept of variance, we will be exploring and analyzing the CARS dataset,

which contains a variety of variables such as origin, type, horsepower, number of cylinders and retail

price on vehicles sold by dealer. We will start by opening the dataset and familiarizing ourselves with the

data and the variables. Following this, we will create several reports to describe the data, identify trends

and explain variance within the dataset by using both numerical and alphabetical variables. You will also

be given an opportunity to filter the data in order to focus on a smaller set of variables to run reports

from.

Creating a Simple Bar Chart: Open the Cars data tableby selecting Servers> Libraries > SASHELP from the Server List. Navigate to the

CARSdatabase and select it. Click Open. Creating a Bar Chart: On the menu bar click Tasks and then

select Graph to open Bar Charts. The Bar Chart window has five pages: Bar Chart, Data, Appearance,

Titles and Properties. In the Bar Chart page, click the Simple Vertical Bar (Figure 2.1).

Figure 2.1

To produce a report to identify the frequency in each category of variable Type, click the variable Type

and drag it to the Column to chart role (Figure 2.2).

Page | 15

Figure 2.2

Click Run to run the task and produce report. To make changes to the title, click Modify Task and give an

appropriate name to the Title of graph (Figure 2.3).

Figure 2.3

Rerun the task by clicking the Run button.

Page | 16

Figure 2.4

The resulting graph (Figure 2.4) shows the number of cars in the database by type. There are more

sedans than any type of car, but there are also some SUVs, Sports and Trucks. Note that the number of

cars in each type changes as you look at a different type. This illustrates the concept of variance, when it

comes to frequency.

Page | 17

Creating a Scatter Plot:

To generate a scatter plot,return to the Cars data set and click Tasks and then select the Graph to open

Scatter Plot. Select the simple two-dimensional scatter plot in the scatter plot page (Figure 2.5).

Figure 2.5

Click Data in the selection pane to assign a column. DragHorsepower to the Horizontal task role

followed by MSRP on the Vertical task role (Figure 2.6). Rename Titles and Click Run.

Page | 18

Figure 2.6

Figure 2.7

This scatter plot (Figure 2.7) displays horsepower and the manufacturer suggested retail price. The

horsepower is between 100 and 300 and are priced below $50,000. Due to the variance in the data, you

can observe that certain cars have horsepower values as high as 500 and some cars are price closer to

$200,000. This scatter plotter allows you to visualize variation by assigning a spot to every data set,

based on 2 measurable variables.

Page | 19

Creating a Tile Chart:

Click Tasks and then Graph to open the Tile Chart. For this report, click the variable Type and drag it to

Classification variable under column roles, drag variable Invoice to the Color analysis and drag variable

Horsepower to Size Analysis variable (Figure 2.8).

Figure 2.8

Click Titles in the list of options in the selection pane and click Graph. From the drop down arrow under

Tile Layout click Flow layout. In the Title page of the Tile window give an appropriate name to the chart.

Click Run.

Page | 20

Figure 2.9

In this chart (Figure 2.9) variance in a data set is expressed through numerical and alphabetical variable

(Type, Invoice and Horsepower). The cars in the database are arranged into boxes based on their type,

and the sizes of the boxes are determined by the total horsepower in each type. Note that sedans do

not have the highest horsepower per car, but because the database contains a lot more sedans than any

other type of car (see frequency by vehicle chart), the total horsepower of sedans is higher than any

other type of car. This is why the sedan box is the largest, and the hybrid box is the smallest. Lastly, the

variance in Total invoice is illustrated by the color of the box. Note that Sedan is in a darker green not

because they are more expensive, but because there are more sedans than any other car type; hence

the Total invoice for sedans is much higher.

Filtering Data:

To filter the Cars data table, refer back to the process flow and click the Tasks tab on the menu bar and

select Data to open Filter and Sort. Click and drag all the variables in the selected pane. To filter the

data, click the Filter tab. The filter page contains four empty boxes. Click the down-arrow on the first

box and select Type as variable; in the second box select the criteria as Equal to from the drop-down list,

in the third box click the ellipsis button and select the value as Sports and click OK.

Page | 21

Creating a Stacked Vertical Bar chart from the Filtered Data:

To generate a Stacked Vertical Bar, click the Tasks tab on the menu bar and select Graph to open the

Bar Chart window. In the Bar Chart page click the Stacked Vertical Bar. In the Data page drag the

variable MSRP to Column to Chart and Origin as Stack. Give an appropriate name to the graph and click

Run.

Figure 2.10

Figure 2.10 displays variance within the data on three levels: The manufacturer suggested retail price,

the number of cars or frequency, and the origin of the car. Note that Europe is the only location where

the number of cars at the $90,000 price point is higher than other price points. The bulk of cars

manufactured in Asia are at the $30,000 price point and a little more than half of USA manufactured

cars are at the same retail price. The fact that Europe produces the majority of cars above $90,000 can

indicate their focus on higher end vehicles.

To generate and view a stacked vertical bar with a different variable, click the Tasks tab on the menu bar

and select Graph to open the Bar Chart window In the Bar Chart page click the Stacked Vertical Bar. In

the Data page drag the variable MSRP to the column to chart and variable Cylinder to stack and Run the

report.

Page | 22

Figure 2.11

The above chart (Figure 2.11) displays variance within the data on three levels: The manufacturer

suggested retail price, the number of cars or frequency, and the number of cylinders. Note that the

origin variable has been replaced by the cylinder variable. The heights of the bars have not changed, and

the majority of cars are price at $30,000. As price increases there are fewer cars with six and eight

cylinders available. Cars of four cylinders or less are only available at prices below $30,000, and ten or

twelve cylinder cars are only available above the $90,000 price point. Note that this picture of variance

allows you to identify an outlier: the only car with a price of $180,000 has six cylinders.

Similarly, to generate a chart comparing the variables Engine Size and Cylinders, drag Engine Size to

column to chart and Cylinders to stack to produce a report of two other variables. Give an appropriate

name to the graph and RUN.

Page | 23

Figure 2.8

Figure 2.12 displays variance on three levels: the Engine size (L), the frequency and the cylinder sizes

within each engine size.

Based on what you have learnt thus far, read the following statements and indicate if they are (T) TRUE

or (F) FALSE.

1. The most common engine size is 3.0 [ ] 2. The most common cylinder size is 6 [ ] 3. There are more 8 cylinder cars with 4.2 engine sizes than there are at 5.4 [ ] 4. There are as many 12 cylinder cars as there are 10 cylinder cars [ ] 5. Across all cylinder sizes, the least common engine size is 7.8 [ ] 6. As you increase engine size, the number of car with four cylinders increase [ ]

Page | 24

Conclusion

As demonstrated, SAS can sort all variations within data to a specific set of objectives from the

perspective of a specific department such as marketing department or the company on a whole. This

allows management to project future strategies through historically available data and draw conclusions

which may help create an overall analysis of the company in the long run.

From the size of engines, miles per gallon-city or highway, manufacturer names, types and origins of

vehicles, SAS provides a relatively easy way to calculate and visually verify the variations of data within

different samples. Through several charts, graphs, one can arrive at conclusive decisions to support

strategies (eg: increase sales, decrease production on non-selling vehicles, under achieving miles per

gallon). We can simplify forms that break down variations within the data and draw conclusions in a

simplified and comprehensive manner which are used to create strategies.

Answers for the exercise based on figure 2.12 : 1 – True 2 – True 3 – True 4 – True 5 – True 6 – False

Page | 25

Chapter 03

Confidence Intervals

Spice Girls

Alexandra Gonchar

Ellen Guimaraes

Ksenia Knyazeva

Ekaterina Loskutova

Chapter 03

Page | 26

What is a Confidence Interval?

Statistics define Confidence Interval as a particular kind of interval approximation of a population limit.

It is a perceived interval, which differs from sample to sample and normally includes the mean of the

population of interest, and guarantees a high percentage of likelihood that the results will be very

similar if the experiment is repeated.In order to determine how frequently the observed interval

contains the parameter of interest, the confidence level or confidence coefficient are used. As the

Confidence Interval is calculated from a sample which contains the value of a certain data parameter

with a specified probability, the end-points of the interval are the confidence limits. The specified

probability is called the confidence level.

What is the purpose of a Confidence Interval?

In order to predict the mean, the standard deviation, and varianceof a population, a random sample is

taken from a larger population and a statistic is calculated. It is usually very important to predict the

level of reliability in the results provided by the sample. This is where the Confidence Interval comes in.

The Confidence Interval provides a range in which one can be relatively certain that their specific data

mean is located. Therefore, as the name states, a Confidence Interval is used to calculate the

confidence that one can have in the result of a sample.

When are Confidence Intervals most commonly used?

A confidence interval does not forecast if the true value of the parameter of interest has an exacting

chance of setting in the confidence interval given the data truly obtained. The Confidence Interval lets us

estimate the true mean of a certain data set using the results of previous measurements (sample size,

standard deviation, and confidence level). It is used to indicate the reliability of an estimate.

Page | 27

Examples where Confidence Intervals can be used:

Governments looking to reliably predict population trends

Likelihood of certain candidates to be elected

Reactions to certain new products

Survey response rate reliability

Predict results based on previous research

Using a Confidence Interval

An example of how one can arrive at a Confidence Interval is the following:

Getting statistics from an entire population may be impossible, information may be correct but

outdated, and response rates on surveys may be very low. Because of this, researchers simplify the

statistical process by picking a sample of the population of interest, finding answers to their research

questions, and trying to estimate the reliability and precision of the results. This reliability estimate is

where using the Confidence Interval comes in.

For example, lets answer the following question: With 95% accuracy, what is the average amount of

languages spoken by each student at George Brown?

We could ask every student at George Brown but that would be time consuming and some students may

not answer truthfully. Therefore, a convenient way to answer our question is by picking and analyzing a

sample that we can work with. This will help us to calculate the Confidence Interval which will be the

answer to our question. In this case, we will pick a reasonably large proportion of the students in the

school , so that the results will be representative of the larger population (We will be using a

representative class). Once we have chosen the sample, we need to estimate the reliability that the

mean of the entire population will be contained In a certain range (Confidence Interval).

Page | 28

Results:

Mean=2.6 languages per student

Standard deviation=1.836

(Intervals are calculated from the mean, standard deviation and the size of the sample)

By doing the Confidence Interval calculations we arrive at a conclusion. The mean number of languages,

with 95% confidence, is between 1.945 & 3.255.

Applying Confidence Intervals to SAS

The Distribution Analysis produces statistics describing the distribution of a single variable. Next

example explores the distribution of the variable Height in the Volcanoes data set. On the Process Flow

field click the Volcanoes data icon to make it active. Then select Task DescribeDistribution.

In the data tab choose the Height variable for analysis. Then in the distributions tab click Normal.

Page | 29

In the Tables tab you can choose all the statistics you would like to explore. We are particularly

interested in Basic Confidence Intervals and Basic Measures (Mean, Standard Deviation, and Variance).

To measure confidence intervals we have to specify the confidence level in the drop-box on the top

right. You can choose among 90%, 95%, and 99%. After selection click Run.

The Resulting Report starts with basic statistic measures about the distribution of the variable: mean,

median, standard deviation, variance, and range. Another section of the report contains confidence

limits assuming normality. This table shows confidence intervals for main parameters (mean, standard

deviation, and variance) with 95% confidence level.

Page | 30

We can also build a plot to better evaluate the normality of variable distribution. Click Modify Task and

in the open window click Plots page. You can choose among different appearances. Choose Histogram

Plot.

Page | 31

Click Insert Page and choose statistics you would like to include to the plot (for this example we took

sample size, sample mean, and standard deviation). Choose the location of this information on the

graph and click Run.

Page | 32

From the example we can see that the sample size is 32. The graph shows that the data is normally

distributed and the Volcanoes’ Height mean is 3113.563. With 95% of confidence, the height of average

Volcano (mean) is from 2481.3 km to 3745.9 km.

Chapter 04 Chapter 04

Page | 33

Simple Regression

Sukhoi

Amit Bansal

Sheleena Jaria

Kalpesh Patel

Ishan Sangrai

Pranay Sankhe

Introduction to Regression Analysis

In the statistical terms, regression is the study of the natural relationship between the variables so that

one may be able to predict the unknown value of one variable for a known value of another variable.

According to Oxford English Dictionary, the word ‘regression’ means “stepping back” or “returning to

average value”. The term was first used in the 19th century by Sir Francis Galton. He found out an

Page | 34

interesting result by studying the height of about 1000 fathers and sons. His calculation were that (i)

sons of all fathers tend to be tall and sons of short fathers tend to be short in height (ii) But the mean

height of the tall fathers was greater than the mean height of sons, whereas the mean height of the

short sons was greater than the mean height of the short fathers. The tendency of the entire mankind to

twin back to average height was termed by Galton ‘Regression towards Mediocricity’ and the line that

shows trend named as ‘Regression Line’.

In words of M.M Blair, ‘Regression is the measure of the average relationship between two or more

variables.

Regression analysis is used to:

– Predict the value of a dependent variable based on the value of at least one independent

variable.

– Explain the impact of changes in an independent variable based on the dependent variable.

Dependent variable:the variable we wish to predict or explain.

Independent variable: the variable used to explain the dependent variable.

Regression Formula

To calculate relation between X and Y we need an equation which is

Regression Equation Y = a + bX

Where X and Y are the variables, b = the slope of the regression line, a = the intercept point of the

regression line.

Slope (B) = (nΣXY - (ΣX) (ΣY)) / (nΣX2 - (ΣX) 2)

Intercept (A) = (ΣY – b (ΣX)) / n

Page | 35

Figure 4.1 shows Simple Regression

As per Figure 4.1 Regression line shows the average relationship between two variables. This is also

known as Line of Best Fit. On the basis of regression line, we can predict the value of a dependent

variable on the basis of the given value of the independent variable. So this regression line of Y on X

gives the best estimate for the value of Y for any given value of X.

Steps In Linear Regression

1. State the hypothesis.

2. State the null hypothesis

3. Gather the data.

4. Compute the regression equation.

5. Examine tests of statistical significant and measures of association.

6. Relate statistical findings to the hypothesis. Accept or reject the null hypothesis.

7. Reject, accept or revise the original hypothesis. Make suggestions for research design and

management aspects of the problem

Page | 36

Regression Example

To find the Simple Regression, Let’s take a simple example, where X is Cattle and Y is Cost. The example

shows the relationship between both of them. First we need a database.

Cattle (X) Cost(Y)

3.437 27.698

12.801 57.634

6.136 47.172

11.685 49.295

5.733 24.115

3.021 33.612

1.689 9.512

2.339 14.755

1.025 10.57

2.936 15.394

5.049 27.843

1.693 17.717

1.187 20.253

9.73 37.465

14.325 101.334

7.737 47.427

7.538 35.944

10.211 45.945

8.697 46.89

To find regression equation, we will first find slope, intercept and use it to form regression equation.

Step 1: Count the number of values

Step 2: Find XY, X2, Y2

Step 3: Find ΣX, ΣY, ΣXY, ΣX2,ΣY2

ΣX = 116.969; ΣY = 670.575; ΣXY = 5570.426; ΣX2 = 1036.087,ΣY2 =32134.66

Step 4: After putting Values in slope formula

Slope (b) = (nΣXY - (ΣX) (ΣY)) / (nΣX2 - (ΣX)2)

= 1.4086

Page | 37

Step 5: Now, substitute the value in the formula.

Intercept (a) = (ΣY - b (ΣX)) / n

= 26.6211

Step 6: Then substitute these values in regression equation

Regression Equation(Y) = a + bX

= 26.6211 + 1.4086X

Suppose if we want to know the approximate ‘Y’ value for the variable ‘X’ = 3.437. Then we can

substitute the value in the above equation.

Regression Equation(Y) = a + bX

= 26.6211 + 1.4086 (3.437)

= 26.6211 + 4.8416 =31.4627

The Above example tells us how to find the relationship between two variables by calculating the

Regression from the above mentioned steps.

Assumptions Of Simple Regression

In theory, there are several important assumptions that must be satisfied if linear regression is to be used. These are:

1. Both the independent (X) and the dependent (Y) variables are measured at the interval or ratio level.

2. The relationship between the independent (X) and the dependent (Y) variables is linear.

3. Errors in prediction of the value of Y are distributed in a way that approaches the normal curve.

4. Errors in prediction of the value of Y are all independent of one another.

5. The distribution of the errors in prediction of the value of Y is constant regardless of the value of

X.

Page | 38

Implementing within SAS

Now when we doing the same task in SAS, we need to have a database on which we will calculate

relationship between them (Variables).

So initially you have to do is

Open SASFileOpenData

Figure 4.2 shows how to access Data in SAS

Page | 39

Now select Data from the computer which you want to analyze. After selecting data, window pops like

as below shown:

Figure 4.3

Page | 40

After selecting the data, go to GraphScatter Chart

Figure 4.4

Click 2D Scatter chart

Figure 4.5

Page | 41

Figure 4.6 shows Columns to assign different Task Roles

Drag cattle into Horizontal and Cost into Vertical, then Run

Figure 4.7 shows after selection of variables in their Task roles

Page | 42

Figure 4.8 shows Scatter Plot Graph

Now we need to find the relationship between X and Y through SAS, Select the process flow and then

double click Market database.

Figure 4.9

Page | 43

Select AnalyzeRegressionLinear Regression

Figure 4.10

Then insert Cattle into Dependent Variable and Cost into Explanatory variables

Figure 4.11

Page | 44

Click RUN. Output will have several graphs but we focused only on one which is shown below.

Figure 4.12 shows relationship between Cattle and Cost.

Figure 4.13 shows the window after clicking Process Flow

Page | 45

In SAS, we can modify the output. Right click on Linear RegressionModify Linear Regression

Figure 4.14

Linear Regression window will pop up and here we want name on Footer. So click Titlesfootnote

Figure 4.15

Page | 46

Click Default text and then write you’re “Name” instead of “the SAS system” than click RUN

Figure 4.16

Conclusion

After doing the analysis, initially manually and later with SAS software, we get to know that output remains the same but the difference in efforts is far different from each other. By using SAS software, it’s easy to get the output which otherwise would take lots of tedious hours. The best thing about the SAS software is that you can make changes at any point of time with just fraction of seconds but otherwise you need to do the complete calculation again. So in nutshell, Simple regression gives us a relationship between two values and we can predict the one value if other is known and using the SAS software we get the output early and error free.

Page | 47

Chapter 05

Correlation Coefficient

Fusion

Gaurav Anand

Maninder Kaur

Anil Khurana

Rizwan Maknojia

Bikramjit Singh

Chapter 05

Page | 48

Definition

The correlation is one of the most common and most useful statistics. A correlation is a single number

that describes the degree of relationship between two variables. It gives a mathematical number to

weather two numeric variable are related or not, It ranges from -1 to +1.

“+1” correlation indicates a perfect positive correlation, meaning that both variables move in the same direction together.

“-1” correlation indicates a perfect negative correlation, meaning that as one variable goes up, the other goes down

A “0” correlation indicates that there is no relationship between the variables.

In mathematic terms, Correlation is referred as “r”. The degree of relationship between variables can be

defined by r value as shown in table 5.1.

Value of r Strength of relationship

-1.0 to -0.5 OR 1.0 to 0.5 Strong

-0.5 to -0.3 OR 0.3 to 0.5 Moderate

-0.3 to -0.1 OR 0.1 to 0.3 Weak

-0.1 to 0.1 None or very weak

Table 5.1 – r value table

Page | 49

Correlation Example

Let’s assume that we want to look at the relationship between two variables, the age of the student and

their marks. Perhaps we have a hypothesis that the age of a student’s effects their marks. We have a

sample data of 10 students and their marks out of 50.

Age Marks

25 35

30 48

26 36

24 36

28 45

25 40

31 46

31 40

26 36

25 31

Table 5.2

Based on the above data in Table 5.2 the calculated correlation value is “r=.105”. This indicates that

there is not a strong positive relationship between age of the student and their mark. Therefore, it’s not

necessarily that the older the student is, higher the marks he or she will get. Neither there is a negative

relationship between the two. The “r” value is .105 which is very close to “0”, it indicates that there is

hardly any relationship between the two variables.

Page | 50

Implementing Within SAS

Let’s understand how correlation can be used in SAS Enterprise Guide.

Open SAS Enterprise Guide 4.2. Open Class data set from LibrarySASHelp. Class data set has the name

of the student, Sex, Age, Height & their Weight. Now, we will check if there is any relationship between

the Height of the student & their Weight. On the menu bar at the top, click on

TasksMultivariateCorrelations as shown in figure 5.1

Figure 5.1 – Path to open Correlation

Correlation window will pop up.

Page | 51

Figure 5.2 – Correlation Window

Select & drag Height under Analysis variable & Weight under Correlate with & click Run

Figure 5.3 – Assigning variables for correlation

Page | 52

Figure 5.4 – Correlation output window

As you can see the output in figure 5.4, “Correlation Analysis” at the top it displays the variables for

which you want to check the degree of relationship between them. Below that under “Simple Statistics”,

it shows the Mean, Standard Deviation, Minimum Value & Maximum Value for both Height & Weight,

where N is the number of students in the class. All these statistics are used to calculate the correlation

between the two variables. In the output displayed above, we can see that the Correlation Coefficient

value is .877. Thus, we can say that there is strong positive relation between the height of the student &

their Weight. If the height of student will increase, the weight will also increase.

Page | 53

Modifying Output

In SAS enterprise guide, we can modify the output in different ways. For example, we want to check the

correlation between height & weight separately for males & females & also we want the scatter plot in

the output.

Right-click on CorrelationsModify Correlations under the process flow

Figure 5.5 – Modify correlation path

Page | 54

Correlations window will pop-up & drag Sex under Group analysis by as shown in figure 5.6

Figure 5.6 – Assigning variables for group analysis

Click on Resultscheck the option Create a scatter plot for each correlation pair

Figure 5.7 – Result screen of correlation window

Page | 55

Click on Titles& edit the Analysis Titles and Footnote by un-checking “Use default text”, click Run& click

on yes to override the result from previous Run.

Figure 5.8 – Titles & footnotes

In output shown in Figure 5.9 & 5.10, we have two different results, one for males & other for females &

both with the scatter plots. The correlation values are

Males, r=.85

Females, r=.88

Thus, both males & females have a strong positive relation between their height & weight.

Page | 56


Page | 57


Page | 58

Multiple Correlations

We can also do multiple correlations at the same time. For example, now we will check the relationship

between the “height & weight” and “Age & height” from the class data set. Height is the common

variable here, we want to correlate weight & age with it. So, we will put height in Analysis variable & we

will put Age & height under Correlate with because each variable in “Correlate with” role will be

correlated with the variables in the Analysis variables role.

To do multiple correlations

Right-click on CorrelationsModify

Correlations under the process flow & drag Age under correlate field with weight. Click Run& click on

yes to override the result from previous Run

Figure 5.11 – Correlation window

Page | 59

The output in Figure 5.12 displays the correlation between “Weight & Height” & “Age & Height” for

males & females separately.

Figure 5.12 – Correlation output

Page | 60

Note: The calculations we have done so far are based on simple correlation. We have some more

options in SAS to calculate correlation in different ways. Right-click on CorrelationsModify

Correlations under the process flow and click Options to find out the different ways as shown in figure

Figure 5.13 – Correlation options window

You can try different options to see what results they produce.

Page | 61

Chapter 06

Test of Significance

Gotcha

Luz Alvarez

Hasan Can

Michell Escutia

LiLi Xu

Sharon Yang

Chapter 06

Page | 62

Tests of significance are statistical tests used make claims or inference about the population from which the sample has drawn. To begin, a null hypothesis H0and confidence interval must be determined based a given scenario. H0 represents the assumption, either because it is believed to be true or because it is to be used as a basis for argument, but has been proved. Confidence Interval represents the estimated range being calculated from a given set of sample data. The common choices are 0.90, 0.95, and 0.99. The percentages correspond to the areas of the normal curve being covered. The outcome of the test is either “reject H0” or “Do not reject H0.”

There are different tools used, but we are going to observe the most common ones: t-Test, One-Way

ANOVA, Nonparametric One-Way ANOVA, Linear Models, and Mixed Models.

6.1 t-Test

Within a t-Test, there are three different types: Two Sample, Paired, and One Sample. We will walk

through each one of them based on a given scenario. In order to implement the t-Test using SAS

Enterprise Guide 4.2, open the dataset named marathons.sas.7bdat . FileOpenData. When the

database is open, now we can access the t-Test menu by clicking AnalyzeANOVAt-Test. (Figure

6.1)

Figure 6.1: Open a Task

Page | 63

We also can access this menu through TasksANOVAt-Test. (Figure 6.2)


t-Test Two Sample

This is a statistic used to evaluate whether or not the two independent samples are representative of

the same population. In addition, it is assumed that each sample is normally distributed with equal

variances. For instance, you want to compare the running time during the marathon at the city of New

York and Boston. A random sample of 50 observations from the Boston marathon and 100 observations

from the New York marathon have been recorded and saved. The variables in the dataset include City

and Time (in hours).

In the new window, click t-Test types, you will find 3 different types oft-Test, select Two Sample. (Figure

6.3)

Page | 64

Figure 6.3: Select t-Test type

Then click Data, we are going to assign a variable to identify level row. Then classify the variable and

select the variable we are going to analyze. Click the variable City and drag it to the

ClassificationVariables.Then click the variable Time and drag it to the Analysis Variables. (Figure 6.4)

Figure 6.4: Select variables

Page | 65

Click Analysis on the left menu. Specify the test value for null hypothesis H0 and the confidence level.

Set H0 to 0 because we believe the difference between the two observations is 0 or equal variances.

Then set confidence level to 95%. (Figure 6.5)

Figure 6.5: Set Null Hypothesis and Confidence Level

Click Plots and select the type of plots you need to display in the report. (Figure 6.6)

Figure 6.6: Select plots

Page | 66

After customizing the titles and click Run. (Figure 6.7)

Figure 6.7: Customize titles

The t-Test result is now shown as below. Whether or not we should reject the null hypothesis, we can

either use the method Pooled for unequal variances or the method Satterthwaite for unequal variances.

The column labeled t values corresponds to the t-test statistic, the column labeled DF corresponds to

degree of freedom, and the column labeled Pr > ltl corresponds to the P-value that has to be

interpreted. Since we already assumed the two observed samples are equal variances, we can use its P-

value as indicator, which is < 0.0001. with 95% confidence level we chose, the standard P-value we have

set is (1 – 0.95), which is 0.05. The P-value for equal variances is < 0.0001, which is smaller than 0.05. So

we can reject the null hypothesis. (Figure 6.8)

Page | 67

Figure 6.8: t-Test Two Sample Results

t-Test Paired

This is to test whether or not the two matched samples are representative of the same population.

Open the dataset named bloodpressure.sas.7bdat in order to examine the effectiveness of a medication

in reducing blood pressure. A random sample of individuals with high blood pressure is taken and their

diastolic pressure is recorded. The individuals are then placed on medication and one month later their

diastolic blood pressure is once again recorded. The dataset contains the following variables: subject,

age, baseline blood pressure, and new blood pressure.

In the t-Test window, select Paired. (Figure 6.9)

Page | 68

Figure 6.9: Select t-Test type

Click Data, and then assign the variables of Baseline BP and New BP to Paired Variables. (Figure 6.10)


Page | 69

After customizing the titles and click Run.(Figure 6.11)

Figure 6.11: t-Test Paired Results

t-Test One Sample

This is a test to determine whether a sample is representative of a population with specified mean. Let’s

use the same data set bloodpressure.sas.7bdat as used in paired sample. Under t-Test type, select One

Sample. (Figure 6.12)

Figure 6.12: Choose t-Test type

Page | 70

Under Data, click Age and drag it to Analysis Variables. (Figure 6.13)


After customizing the titles and click Run. See the results below. (Figure 6.14)

Figure 6.14: Results

Page | 71

6.2 One-Way ANOVA

One-Way ANOVA (Analysis of variance) test is another way to test hypotheses. It is a procedure used to

perform an analysis of variance by testing whether or not the means of two or more samples are equal.

It assumes all the samples are drawn from normally distributed populations with equal variance, which

is similar t-test two sample. It is based on the fact that 2 independent estimates of the population

variance and it can be obtained from the sample data.

Select Analyze ANOVAOne-Way ANOVA. (Figure 6.15)


Click Data and select the dependant and independent variable. In this case, Weight is theDependent

Variable and the Displacement is the Independent Variable. (Figure 6.16)

Page | 72


Click Test and select tests for equal variance. (Figure 6.17)

Figure 6.17: Tests

Page | 73

Click MeansComparison, and then select the method and confidence level you want to use. We want

to stick with 95% confidence level. (Figure 6.18)

Figure 6.18: Comparison

Click Breakdown and select the statistics for qualitative variables that you want in the report(Figure

6.19).

Figure 6.19: Breakdown

Page | 74

Click Plots and select between the two types (Box and Whisker or Means) that you want to display in

your result. (Figure 6.20)

Figure 6.20: Breakdown

Customize your titles and click Run. See the resultsbelow. (Figure 6.21)


Page | 75

6.3 Nonparametric One-Way ANOVA

This type of test allows you to implement nonparametric tests for location and scale when you have a

continuous dependent variable and a single independent variable.

In statistical inference, or hypothesis testing, parametric runs because they depend on the spec of a

probability distribution except for a place of free parameters the traditional runs are called. Parametric

runs are stated to depend on distributional assumptions, nonparametric tests, do not require

distributional assumptions.

Nonparametric methods are often almost as powerful as parametric methods, even if the data are

distributed normally.

Select AnalyzeANOVAOne-Way ANOVA. (Figure 6.22)


Click Data and select the dependant and independent variable. (Figure 6.23)

Page | 76


Click Analysis and select test scores you want in your results. (Figure 6.24)

Figure 6.24: Analysis Tests

Page | 77

Then click on Extract p-values. (Figure 6.25)

Figure 6.25: Extract p-values

Customize your titles and click Run. See the results. (Figure 6.26)


Page | 78

6.4 Linear models

The Linear Models task is used to perform an analysis of variance when you have a continuous

dependent variable with classification variables, quantitative variables, or both.

Select AnalyzeANOVALinear models. (Figure 6.27)


Click Data and select the dependant. (Figure 6.28)


Page | 79

Click Model Options and select the hypothesis test options that you want in your result. (Figure 6.29)

Figure 6.29: Select model options

Customize your titles and click Run. See the results. (Figure 6.30)

Figure 6.30: Linear Model Results

Page | 80

6.5 Mixed models

The Mixed Models task is used to provide facilities for fitting a number of basic mixed models. These

models enable you to handle both fixed effects and random effects in a linear model for a continuous

response. Numerous experimental contrives produce data for which coalesced models are appropriate.

Select AnalyzeANOVAMixed Models. (Figure 6.31)


Click Data and select the dependant variable and the quantitative variables you want to analyze. (Figure

6.32)

Page | 81


Customize your titles and click Run. See the results as shown below. (Figure 6.33)

Figure 6.33: Mixed Model Results

Page | 82

Chapter 07

Limits (Confidence / Prediction)

Dean Squad

Eric Plaskacz

Christina Mofid

Edison Nguyen

Marissa Shaver

Alexandra Wackett

Chapter 07

Page | 83

Confidence Limits

Definition

A confidence interval is the likely range of the true value and since there is only one true value, the

confidence interval defines a range where it is likely to be. Most often, confidence intervals are at the

95% level – called the 95% confidence interval. These intervals mean that on average, 95% of the ranges

will capture the true population mean, while 5% of them, on average will not capture the true

population.

Confidence intervals are used because it might not be possible to measure everyone in a given

population simply because of a lack of resources. However, by using confidence intervals, it is possible

to use a sample of the population to calculate a range within which the population is likely to fall within.

Confidence Limits – Confidence limits are the upper and lower boundaries of the interval.

Width of Confidence Intervals

Confidence intervals give us a range of upper and lower boundaries. If the interval is narrow – meaning

a small difference between the upper and lower boundaries, than we can be confident that the study

was quite large and the true value is precise. If the confidence interval is wide – than we can conclude

that the study was most likely small which means that the true value will be imprecise

Prediction Intervals

Definition

A prediction interval is a range that will tell you were you can expect to see future observations. These

intervals are useful in determining what future values should be, based upon present or past data. They

can be useful to us because they can predict future data points before the information is even collected,

as opposed to having to wait to collect it. Since there is uncertainty in knowing what future data will be

the prediction interval will always be wider than the confidence interval.

Page | 84

Example in SAS EG

Beer Sales Data

Data set shows monthly sales of beer in hectoliters. The average high and low temperatures within the region are also recorded over five years.

The data shows trends of beer sales and the relationship

This Chapter will focus on computer confidence and prediction intervals as well as interpreting the associated output.

How to Complete in SAS EG

Open the dataset: beer_sales.sas,

As you can see from the raw data, an

increase in temperature is strongly

and positively correlated to beer

sales.

If we make a simple line plot before

we start computing confidence

intervals, it will give us a better sense

of the information we’re looking at.

Click TaskGraphLine Plot

After selecting the first line plot, add

High Temp to the horizontal axis

(independent variable) and Sales to

the Y axis (dependent variable)

Make sure to change the appropriate titles and footnotes in the properties tab. Click Run.

Figure 7.1

Page | 85

The results:

Therefore, as temperature increases, beer sales increase as well.

Confidence Limits

Confidence limits represent the high and

low values of the range

To computer confidence limits for month,

click TaskDescribeDistribution

Analysis

To compute confidence limits on sales,

drag sales to the task role pane under

variable analysis

Click Run

Figure 7.2

Figure 7.3

Page | 86

Output Analysis (below)

From the output generated, the confidence limit provides a range in which the mean of the data

(186.9), there is 95% confidence that the sample mean will fall between the limits of 177.84 and

195.97. This is assuming we have a normal distribution.

The probability of the mean falling outside of the given confidence limit by chance alone is 5%.

We expect that if more data on beer sales is collected, the confidence limit is expected to decrease.

Limits on other variables including month and temperature can be computed by changing the

variable analysis accordingly.

Figure 7.4

Page | 87

Appendix

Team Contributions

A simple breakdown by each team, showing how the work was

distributed among themselves:

Team 1 – Numerical Summaries

Definition – Jaswant Seahra, Mriseal Sinha

Example – Baljeet Kaur, Jaswant Seahra, Theo Wolski

Implementing with SAS – Trystan Macdonald, SurbhiSurbhi

Documenting design process – Trystan Macdonald, Theo Wolski

Defining SAS results – Trystan Macdonald

Conclusion – Trystan Macdonald, Theo Wolski, Jaswant Seahra

Compilation – Theo Wolski, SurbhiSurbhi

Appendix

Page | 88

Blueprint – Variation Within The Data

Definition – Paramjeet Kaur, Christopher Atkinson

Example – Fredric Ayih, Danusha Fernando

Designing within SAS – Danusha Fernando , Christopher Atkinson

Documenting design process – Gauvtam Bajaaj, Frederic Ayih

Defining SAS results – Fredric Ayih , Paramjeet Kaur, Gauvtam Bajaaj

Conclusion – Christopher Atkinson , Gauvtam Bajaaj

Compilation – Danusha Fernando , Paramjeet Kaur

Spice Girls – Confidence Intervals

Definition – Everyone

Example – Everyone

Implementing within SAS – Everyone

Documenting design process – Everyone

Defining SAS results – Everyone

Conclusion – Everyone

Compilation – Everyone

Page | 89

Sukhoi – Simple Regression

Definition – Kalpesh Patel, Ishan Sangrai, Sheleena Jaria

Example – Kalpesh Patel, Ishan Sangrai, Sheleena Jaria

Implementing within SAS – Amit Bansal, Pranay Sankhe

Documenting design process – Amit Bansal, Pranay Sankhe

Defining SAS results – Amit Bansal, Pranay Sankhe

Conclusion – Amit Bansal, Ishan Sangrai

Compilation – Amit Bansal, Ishan Sangrai

Fusion – Correlation Coefficient (r)








Page | 90

Gotcha – Test Of Significance

Definition – Hasan Can, Michell Escutia, Lily Xu

Example – Luz Alvarez

Implementing within SAS – Luz Alvarez

Documenting design process – Luz Alvarez, Hasan Can, Michell Escutia

Defining SAS results – Luz Alvarez, Lily Xu, Sharon Yang

Conclusion – Lily Xu, Sharon Yang

Compilation – Luz Alvarez, Lily Xu, Sharon Yang, Hasan Can, Michell Escutia

Dean Squad – Limits (Confidence / Prediction)








b409 w11 sas collaborative stats guide v4.2

Technology