b409 w11 sas collaborative stats guide v4.2
TRANSCRIPT
Page | 2
Numerical Summaries ..................................................................................................................... 3
Variation Within The Data ............................................................................................................ 12
Confidence Intervals ..................................................................................................................... 25
Simple Regression ......................................................................................................................... 33
Correlation Coefficient .................................................................................................................. 47
Test of Significance ....................................................................................................................... 61
Limits (Confidence / Prediction) ................................................................................................... 82
Appendix ....................................................................................................................................... 87
Table of Contents
Page | 3
Numerical Summaries
Team 1
Baljeet Kaur
Trystan McDonald
Jaswant Seahra
Mriseal Sinha
Surbhi Surbhi
Theo Wolski
Chapter 01
Page | 4
Introduction
Collecting, processing and forming data are skills that are widely sought after in today’s business world.
In order to make effective business decisions you must possess the skills necessary to analyse,
manipulate and present findings derived from the mining of raw data.
Data can be produced in numerical and non-numerical forms. When deducing the significance of data, it
is advantageous to provide context to the process; knowing where (location) and how your data fits
(dispersion) into your query can provide valuable insight into your department’s current and future
campaigns.
Numerical summaries present data by location include stating the data’s mean, mode, and median.
Data that is presented by how it is dispersed is done by stating its range and standard deviation.
www.palgrave.com/business/taylor/taylor1/lecturers/
Numerical Summaries
Definition: A set of numeric data summarized and described by two parameters.
Measure of centrality: Data measured by its mean, median and mode.
Measure of spread: Ordinal data measured by its range, quartile range and standard deviation.
Mean: The arithmetic average of all data
Median: The middle value of ordered data. Data must be ordinal or interval.
Mode: The most commonly occurring value in data set.
Terms and Concepts
Mean: The arithmetic average of all data points.
Mean = Mean=Σn/n
Example - 3, 7, 5, 13, 20, 23, 39, 23, 40, 23, 14, 12, 56, 23, 29
The sum of these numbers is 330
There are fifteen numbers.
Mean = 330 / 15 = 22
Page | 5
Median: The center value of ordinal or interval data ordered by ascension.
3,5,7,12,13,14,20,23,23,23,23,29,39,40,56
Total number is 15 and that is divided by 2, result is 7.5
So median value between 7 and 8
(20+23)/2= 21.5
Mode: The most commonly occurring value in a data set.
3,5,7,12,13,14,20,23,23,23,23,29,39,40,56
23 is the Mode because it is repeated 4 times.
Range: Largest value - smallest value.
Example: 2, 6 , 2 , 4, 1, 4 , 3, 1 , 1
6-1= 5
Quartile Range: The range from of the centre point of the ordered data.
Example:
1, 11, 15, 19, 20, 24, 28, 34, 37, 47, 50, 57
Lower quartile = value from the centre of the first half of data or Quartile1
The median of 1, 11, 15, 19, 20, 24
(third + fourth observations) ÷ 2
(15 + 19) ÷ 2=17
Upper quartile = value from the centre of the second half of data Quartile2
The median of 28, 34, 37, 47, 50, 57
(third + fourth observations) ÷ 2
(37 + 47) ÷ 2 = 42
Interquartile range is Q2–Q1
42 – 17= 25
Page | 6
Standard Deviation: Also known as 'Root Mean Square Deviation', is calculated by squaring and adding
the deviations from the mean, finding the average of the squared deviations, and then square-rooting
the result.
σ = Variance
n-1
http://www.mathsisfun.com/median.html
www.palgrave.com/business/taylor
Example: http://hubpages.com/hub/Musical-Terms
Numeric summaries are to a mathematician what sheet music is to a musician. As we know, numerical
summaries include equations of mean, median, range, quartile range and standard deviation; each of
these equations allows for the input of data for the purpose of analysis. Without the numeric summaries
and equations for these terms one would not be able to determine the desired findings, much like
without sheet music a musician would not be able to play his or her instrument.
As sheet music is a language that musicians speak; numerical summaries are the language of
statisticians.
A bar line is used for separating musical notes into areas that are manageable, allowing the musician to
read where the tempo and notes are going within the song. In statistics, for example, if someone asks
you to measure 46 points of data within a particular data set without identifying the centre point (mean
or median) you cannot effectively measure the data. Like in the use of sheet music, following the
building blocks of a process is an essential first step in determining the outcome of the data or song
being played.
Page | 7
_________________________
_________________________
_________________________
_________________________
_________________________
In the above diagram:
The bar line is referenced as…
Mean (ie: mean= Staff/ 4 spaces)
Standard deviation (ie: σ= One bar line- another bar line)
Implementing with SAS
In this tutorial we’ll be doing a walkthrough with the Height’s database. It’s a rather simple database but
effective for showing Numerical Summaries.
Bar Line Double bar
line Difference from one bar line to
another is called measure
Page | 8
We have three columns shown above: Family, Gender, and Height. Although you can do many things
with numerical summaries in SAS with variables relating to one another, we will just be working with the
Heights Variable. Click Tasks-Describe-Summary Statistics (shown below).
You’ll notice a window pop up. This is the main interface you will be using to run numerical summaries.
You can do a lot of neat things with this function of SAS, but for this exercise we will just be using the
Variable Heights. Click and drag over to the right hand side under analysis variable.
Page | 9
On the left hand side of the window you will see a tab saying “Statistics” click on statistics. The window
will change to list all the different numerical summaries that the data can be run through. Click the ones
that you need for your research. In this example we used: Mean, Standard Deviation, Minimum,
Maximum, and Number of Observation.
Page | 10
If you wish for a more visual element to your research, click plots and pick a graph design.
When you’re done click RUN at the bottom of the window.
After clicking run, SAS will process the data and show the results you requested. At the top the Mean,
Standard Deviation, Variance, Minimum variable, Max variable, the Range, and the total number of
Page | 11
values that were used to process this information. So in this tutorial we can see that the average
(Mean) of the values in the dataset equal 66.83, STD Dev equalling 2.72, Variance equalling 7.4, and the
range 9.
Conclusion
With this application of SAS you can make statistical observations and decisions, depending on what
marketing questions you need to answer in your career regarding numerical summaries.
Page | 12
Chapter 02
Variation Within The Data
Blueprint
Christopher Atkinson
Fredric Ayih
Gauvtam Bajaaj
Danusha Fernando
Paramjeet Kaur
Chapter 02
Page | 13
Introduction
Variation is seen in every part of our day to day lives, from our home to the workplace to anything in
which we can observe a difference. On a daily basis, you see cars of different brands, models, colors and
sizes. The very differences in these observations illustrate variance. When looking at a dataset of all
Toyota cars for example, one can observe that they come in different prices, sizes and features such as
engine size, horsepower and number of cylinders. These differences within a dataset illustrate the
concept of variation within data.
What Is Variation?
Data variation measures the spread of data around the mean. It shows the differences in the variables
which may be quantitative as well as qualitative. We may have two sets of data with varying input
values but similar means. Here variations may be observed in terms of the number of variable inputs,
range of data, dispersion of the data etc. In order to measure the amount of variability between the
data sets we use statistical tools such as variance and standard deviation. Variance measures the
difference between each variable and the mean, squared to remove the sign effects. The standard
deviation is the square root of the variance which brings the measure back to scale. Together with
mean, standard deviation gives a first level indication of the characteristics of any set of numbers.
Standard Deviation indicates the degree to which the values are clustered around the mean. A large
amount of dispersion explains how far results are from the expected level of mean. Thus, the variations
within the data are measured in a quantitative manner. Pictorial representation of variations within data
can be shown using bars and charts.
What causes variation within the data?
It becomes necessary to find out if the variation within the data is a regular event or a random event so
that the results attained do not come as a surprise to us. There are common causes such as process
input and conditions that contribute to the regular everyday variation. For example, a probability of a
3% occurrence of errors in data provides the Statisticians to forecast the temperature within a desired
range. On the other hand, there may be some special causes such as the random occurrence of a
temporary event which may create a variation within the existing data making it difficult to work on. For
example, sudden flow of the north-east wind may cause a sudden drop in temperatures making it
difficult to predict the temperature.
Page | 14
Process Flow for Implementation in SAS
To further understand the concept of variance, we will be exploring and analyzing the CARS dataset,
which contains a variety of variables such as origin, type, horsepower, number of cylinders and retail
price on vehicles sold by dealer. We will start by opening the dataset and familiarizing ourselves with the
data and the variables. Following this, we will create several reports to describe the data, identify trends
and explain variance within the dataset by using both numerical and alphabetical variables. You will also
be given an opportunity to filter the data in order to focus on a smaller set of variables to run reports
from.
Creating a Simple Bar Chart: Open the Cars data tableby selecting Servers> Libraries > SASHELP from the Server List. Navigate to the
CARSdatabase and select it. Click Open. Creating a Bar Chart: On the menu bar click Tasks and then
select Graph to open Bar Charts. The Bar Chart window has five pages: Bar Chart, Data, Appearance,
Titles and Properties. In the Bar Chart page, click the Simple Vertical Bar (Figure 2.1).
Figure 2.1
To produce a report to identify the frequency in each category of variable Type, click the variable Type
and drag it to the Column to chart role (Figure 2.2).
Page | 15
Figure 2.2
Click Run to run the task and produce report. To make changes to the title, click Modify Task and give an
appropriate name to the Title of graph (Figure 2.3).
Figure 2.3
Rerun the task by clicking the Run button.
Page | 16
Figure 2.4
The resulting graph (Figure 2.4) shows the number of cars in the database by type. There are more
sedans than any type of car, but there are also some SUVs, Sports and Trucks. Note that the number of
cars in each type changes as you look at a different type. This illustrates the concept of variance, when it
comes to frequency.
Page | 17
Creating a Scatter Plot:
To generate a scatter plot,return to the Cars data set and click Tasks and then select the Graph to open
Scatter Plot. Select the simple two-dimensional scatter plot in the scatter plot page (Figure 2.5).
Figure 2.5
Click Data in the selection pane to assign a column. DragHorsepower to the Horizontal task role
followed by MSRP on the Vertical task role (Figure 2.6). Rename Titles and Click Run.
Page | 18
Figure 2.6
Figure 2.7
This scatter plot (Figure 2.7) displays horsepower and the manufacturer suggested retail price. The
horsepower is between 100 and 300 and are priced below $50,000. Due to the variance in the data, you
can observe that certain cars have horsepower values as high as 500 and some cars are price closer to
$200,000. This scatter plotter allows you to visualize variation by assigning a spot to every data set,
based on 2 measurable variables.
Page | 19
Creating a Tile Chart:
Click Tasks and then Graph to open the Tile Chart. For this report, click the variable Type and drag it to
Classification variable under column roles, drag variable Invoice to the Color analysis and drag variable
Horsepower to Size Analysis variable (Figure 2.8).
Figure 2.8
Click Titles in the list of options in the selection pane and click Graph. From the drop down arrow under
Tile Layout click Flow layout. In the Title page of the Tile window give an appropriate name to the chart.
Click Run.
Page | 20
Figure 2.9
In this chart (Figure 2.9) variance in a data set is expressed through numerical and alphabetical variable
(Type, Invoice and Horsepower). The cars in the database are arranged into boxes based on their type,
and the sizes of the boxes are determined by the total horsepower in each type. Note that sedans do
not have the highest horsepower per car, but because the database contains a lot more sedans than any
other type of car (see frequency by vehicle chart), the total horsepower of sedans is higher than any
other type of car. This is why the sedan box is the largest, and the hybrid box is the smallest. Lastly, the
variance in Total invoice is illustrated by the color of the box. Note that Sedan is in a darker green not
because they are more expensive, but because there are more sedans than any other car type; hence
the Total invoice for sedans is much higher.
Filtering Data:
To filter the Cars data table, refer back to the process flow and click the Tasks tab on the menu bar and
select Data to open Filter and Sort. Click and drag all the variables in the selected pane. To filter the
data, click the Filter tab. The filter page contains four empty boxes. Click the down-arrow on the first
box and select Type as variable; in the second box select the criteria as Equal to from the drop-down list,
in the third box click the ellipsis button and select the value as Sports and click OK.
Page | 21
Creating a Stacked Vertical Bar chart from the Filtered Data:
To generate a Stacked Vertical Bar, click the Tasks tab on the menu bar and select Graph to open the
Bar Chart window. In the Bar Chart page click the Stacked Vertical Bar. In the Data page drag the
variable MSRP to Column to Chart and Origin as Stack. Give an appropriate name to the graph and click
Run.
Figure 2.10
Figure 2.10 displays variance within the data on three levels: The manufacturer suggested retail price,
the number of cars or frequency, and the origin of the car. Note that Europe is the only location where
the number of cars at the $90,000 price point is higher than other price points. The bulk of cars
manufactured in Asia are at the $30,000 price point and a little more than half of USA manufactured
cars are at the same retail price. The fact that Europe produces the majority of cars above $90,000 can
indicate their focus on higher end vehicles.
To generate and view a stacked vertical bar with a different variable, click the Tasks tab on the menu bar
and select Graph to open the Bar Chart window In the Bar Chart page click the Stacked Vertical Bar. In
the Data page drag the variable MSRP to the column to chart and variable Cylinder to stack and Run the
report.
Page | 22
Figure 2.11
The above chart (Figure 2.11) displays variance within the data on three levels: The manufacturer
suggested retail price, the number of cars or frequency, and the number of cylinders. Note that the
origin variable has been replaced by the cylinder variable. The heights of the bars have not changed, and
the majority of cars are price at $30,000. As price increases there are fewer cars with six and eight
cylinders available. Cars of four cylinders or less are only available at prices below $30,000, and ten or
twelve cylinder cars are only available above the $90,000 price point. Note that this picture of variance
allows you to identify an outlier: the only car with a price of $180,000 has six cylinders.
Similarly, to generate a chart comparing the variables Engine Size and Cylinders, drag Engine Size to
column to chart and Cylinders to stack to produce a report of two other variables. Give an appropriate
name to the graph and RUN.
Page | 23
Figure 2.8
Figure 2.12 displays variance on three levels: the Engine size (L), the frequency and the cylinder sizes
within each engine size.
Based on what you have learnt thus far, read the following statements and indicate if they are (T) TRUE
or (F) FALSE.
1. The most common engine size is 3.0 [ ] 2. The most common cylinder size is 6 [ ] 3. There are more 8 cylinder cars with 4.2 engine sizes than there are at 5.4 [ ] 4. There are as many 12 cylinder cars as there are 10 cylinder cars [ ] 5. Across all cylinder sizes, the least common engine size is 7.8 [ ] 6. As you increase engine size, the number of car with four cylinders increase [ ]
Page | 24
Conclusion
As demonstrated, SAS can sort all variations within data to a specific set of objectives from the
perspective of a specific department such as marketing department or the company on a whole. This
allows management to project future strategies through historically available data and draw conclusions
which may help create an overall analysis of the company in the long run.
From the size of engines, miles per gallon-city or highway, manufacturer names, types and origins of
vehicles, SAS provides a relatively easy way to calculate and visually verify the variations of data within
different samples. Through several charts, graphs, one can arrive at conclusive decisions to support
strategies (eg: increase sales, decrease production on non-selling vehicles, under achieving miles per
gallon). We can simplify forms that break down variations within the data and draw conclusions in a
simplified and comprehensive manner which are used to create strategies.
Answers for the exercise based on figure 2.12 : 1 – True 2 – True 3 – True 4 – True 5 – True 6 – False
Page | 25
Chapter 03
Confidence Intervals
Spice Girls
Alexandra Gonchar
Ellen Guimaraes
Ksenia Knyazeva
Ekaterina Loskutova
Chapter 03
Page | 26
What is a Confidence Interval?
Statistics define Confidence Interval as a particular kind of interval approximation of a population limit.
It is a perceived interval, which differs from sample to sample and normally includes the mean of the
population of interest, and guarantees a high percentage of likelihood that the results will be very
similar if the experiment is repeated.In order to determine how frequently the observed interval
contains the parameter of interest, the confidence level or confidence coefficient are used. As the
Confidence Interval is calculated from a sample which contains the value of a certain data parameter
with a specified probability, the end-points of the interval are the confidence limits. The specified
probability is called the confidence level.
What is the purpose of a Confidence Interval?
In order to predict the mean, the standard deviation, and varianceof a population, a random sample is
taken from a larger population and a statistic is calculated. It is usually very important to predict the
level of reliability in the results provided by the sample. This is where the Confidence Interval comes in.
The Confidence Interval provides a range in which one can be relatively certain that their specific data
mean is located. Therefore, as the name states, a Confidence Interval is used to calculate the
confidence that one can have in the result of a sample.
When are Confidence Intervals most commonly used?
A confidence interval does not forecast if the true value of the parameter of interest has an exacting
chance of setting in the confidence interval given the data truly obtained. The Confidence Interval lets us
estimate the true mean of a certain data set using the results of previous measurements (sample size,
standard deviation, and confidence level). It is used to indicate the reliability of an estimate.
Page | 27
Examples where Confidence Intervals can be used:
Governments looking to reliably predict population trends
Likelihood of certain candidates to be elected
Reactions to certain new products
Survey response rate reliability
Predict results based on previous research
Using a Confidence Interval
An example of how one can arrive at a Confidence Interval is the following:
Getting statistics from an entire population may be impossible, information may be correct but
outdated, and response rates on surveys may be very low. Because of this, researchers simplify the
statistical process by picking a sample of the population of interest, finding answers to their research
questions, and trying to estimate the reliability and precision of the results. This reliability estimate is
where using the Confidence Interval comes in.
For example, lets answer the following question: With 95% accuracy, what is the average amount of
languages spoken by each student at George Brown?
We could ask every student at George Brown but that would be time consuming and some students may
not answer truthfully. Therefore, a convenient way to answer our question is by picking and analyzing a
sample that we can work with. This will help us to calculate the Confidence Interval which will be the
answer to our question. In this case, we will pick a reasonably large proportion of the students in the
school , so that the results will be representative of the larger population (We will be using a
representative class). Once we have chosen the sample, we need to estimate the reliability that the
mean of the entire population will be contained In a certain range (Confidence Interval).
Page | 28
Results:
Mean=2.6 languages per student
Standard deviation=1.836
(Intervals are calculated from the mean, standard deviation and the size of the sample)
By doing the Confidence Interval calculations we arrive at a conclusion. The mean number of languages,
with 95% confidence, is between 1.945 & 3.255.
Applying Confidence Intervals to SAS
The Distribution Analysis produces statistics describing the distribution of a single variable. Next
example explores the distribution of the variable Height in the Volcanoes data set. On the Process Flow
field click the Volcanoes data icon to make it active. Then select Task DescribeDistribution.
In the data tab choose the Height variable for analysis. Then in the distributions tab click Normal.
Page | 29
In the Tables tab you can choose all the statistics you would like to explore. We are particularly
interested in Basic Confidence Intervals and Basic Measures (Mean, Standard Deviation, and Variance).
To measure confidence intervals we have to specify the confidence level in the drop-box on the top
right. You can choose among 90%, 95%, and 99%. After selection click Run.
The Resulting Report starts with basic statistic measures about the distribution of the variable: mean,
median, standard deviation, variance, and range. Another section of the report contains confidence
limits assuming normality. This table shows confidence intervals for main parameters (mean, standard
deviation, and variance) with 95% confidence level.
Page | 30
We can also build a plot to better evaluate the normality of variable distribution. Click Modify Task and
in the open window click Plots page. You can choose among different appearances. Choose Histogram
Plot.
Page | 31
Click Insert Page and choose statistics you would like to include to the plot (for this example we took
sample size, sample mean, and standard deviation). Choose the location of this information on the
graph and click Run.
Page | 32
From the example we can see that the sample size is 32. The graph shows that the data is normally
distributed and the Volcanoes’ Height mean is 3113.563. With 95% of confidence, the height of average
Volcano (mean) is from 2481.3 km to 3745.9 km.
Chapter 04 Chapter 04
Page | 33
Simple Regression
Sukhoi
Amit Bansal
Sheleena Jaria
Kalpesh Patel
Ishan Sangrai
Pranay Sankhe
Introduction to Regression Analysis
In the statistical terms, regression is the study of the natural relationship between the variables so that
one may be able to predict the unknown value of one variable for a known value of another variable.
According to Oxford English Dictionary, the word ‘regression’ means “stepping back” or “returning to
average value”. The term was first used in the 19th century by Sir Francis Galton. He found out an
Page | 34
interesting result by studying the height of about 1000 fathers and sons. His calculation were that (i)
sons of all fathers tend to be tall and sons of short fathers tend to be short in height (ii) But the mean
height of the tall fathers was greater than the mean height of sons, whereas the mean height of the
short sons was greater than the mean height of the short fathers. The tendency of the entire mankind to
twin back to average height was termed by Galton ‘Regression towards Mediocricity’ and the line that
shows trend named as ‘Regression Line’.
In words of M.M Blair, ‘Regression is the measure of the average relationship between two or more
variables.
Regression analysis is used to:
– Predict the value of a dependent variable based on the value of at least one independent
variable.
– Explain the impact of changes in an independent variable based on the dependent variable.
Dependent variable:the variable we wish to predict or explain.
Independent variable: the variable used to explain the dependent variable.
Regression Formula
To calculate relation between X and Y we need an equation which is
Regression Equation Y = a + bX
Where X and Y are the variables, b = the slope of the regression line, a = the intercept point of the
regression line.
Slope (B) = (nΣXY - (ΣX) (ΣY)) / (nΣX2 - (ΣX) 2)
Intercept (A) = (ΣY – b (ΣX)) / n
Page | 35
Figure 4.1 shows Simple Regression
As per Figure 4.1 Regression line shows the average relationship between two variables. This is also
known as Line of Best Fit. On the basis of regression line, we can predict the value of a dependent
variable on the basis of the given value of the independent variable. So this regression line of Y on X
gives the best estimate for the value of Y for any given value of X.
Steps In Linear Regression
1. State the hypothesis.
2. State the null hypothesis
3. Gather the data.
4. Compute the regression equation.
5. Examine tests of statistical significant and measures of association.
6. Relate statistical findings to the hypothesis. Accept or reject the null hypothesis.
7. Reject, accept or revise the original hypothesis. Make suggestions for research design and
management aspects of the problem
Page | 36
Regression Example
To find the Simple Regression, Let’s take a simple example, where X is Cattle and Y is Cost. The example
shows the relationship between both of them. First we need a database.
Cattle (X) Cost(Y)
3.437 27.698
12.801 57.634
6.136 47.172
11.685 49.295
5.733 24.115
3.021 33.612
1.689 9.512
2.339 14.755
1.025 10.57
2.936 15.394
5.049 27.843
1.693 17.717
1.187 20.253
9.73 37.465
14.325 101.334
7.737 47.427
7.538 35.944
10.211 45.945
8.697 46.89
To find regression equation, we will first find slope, intercept and use it to form regression equation.
Step 1: Count the number of values
Step 2: Find XY, X2, Y2
Step 3: Find ΣX, ΣY, ΣXY, ΣX2,ΣY2
ΣX = 116.969; ΣY = 670.575; ΣXY = 5570.426; ΣX2 = 1036.087,ΣY2 =32134.66
Step 4: After putting Values in slope formula
Slope (b) = (nΣXY - (ΣX) (ΣY)) / (nΣX2 - (ΣX)2)
= 1.4086
Page | 37
Step 5: Now, substitute the value in the formula.
Intercept (a) = (ΣY - b (ΣX)) / n
= 26.6211
Step 6: Then substitute these values in regression equation
Regression Equation(Y) = a + bX
= 26.6211 + 1.4086X
Suppose if we want to know the approximate ‘Y’ value for the variable ‘X’ = 3.437. Then we can
substitute the value in the above equation.
Regression Equation(Y) = a + bX
= 26.6211 + 1.4086 (3.437)
= 26.6211 + 4.8416 =31.4627
The Above example tells us how to find the relationship between two variables by calculating the
Regression from the above mentioned steps.
Assumptions Of Simple Regression
In theory, there are several important assumptions that must be satisfied if linear regression is to be used. These are:
1. Both the independent (X) and the dependent (Y) variables are measured at the interval or ratio level.
2. The relationship between the independent (X) and the dependent (Y) variables is linear.
3. Errors in prediction of the value of Y are distributed in a way that approaches the normal curve.
4. Errors in prediction of the value of Y are all independent of one another.
5. The distribution of the errors in prediction of the value of Y is constant regardless of the value of
X.
Page | 38
Implementing within SAS
Now when we doing the same task in SAS, we need to have a database on which we will calculate
relationship between them (Variables).
So initially you have to do is
Open SASFileOpenData
Figure 4.2 shows how to access Data in SAS
Page | 39
Now select Data from the computer which you want to analyze. After selecting data, window pops like
as below shown:
Figure 4.3
Page | 40
After selecting the data, go to GraphScatter Chart
Figure 4.4
Click 2D Scatter chart
Figure 4.5
Page | 41
Figure 4.6 shows Columns to assign different Task Roles
Drag cattle into Horizontal and Cost into Vertical, then Run
Figure 4.7 shows after selection of variables in their Task roles
Page | 42
Figure 4.8 shows Scatter Plot Graph
Now we need to find the relationship between X and Y through SAS, Select the process flow and then
double click Market database.
Figure 4.9
Page | 43
Select AnalyzeRegressionLinear Regression
Figure 4.10
Then insert Cattle into Dependent Variable and Cost into Explanatory variables
Figure 4.11
Page | 44
Click RUN. Output will have several graphs but we focused only on one which is shown below.
Figure 4.12 shows relationship between Cattle and Cost.
Figure 4.13 shows the window after clicking Process Flow
Page | 45
In SAS, we can modify the output. Right click on Linear RegressionModify Linear Regression
Figure 4.14
Linear Regression window will pop up and here we want name on Footer. So click Titlesfootnote
Figure 4.15
Page | 46
Click Default text and then write you’re “Name” instead of “the SAS system” than click RUN
Figure 4.16
Conclusion
After doing the analysis, initially manually and later with SAS software, we get to know that output remains the same but the difference in efforts is far different from each other. By using SAS software, it’s easy to get the output which otherwise would take lots of tedious hours. The best thing about the SAS software is that you can make changes at any point of time with just fraction of seconds but otherwise you need to do the complete calculation again. So in nutshell, Simple regression gives us a relationship between two values and we can predict the one value if other is known and using the SAS software we get the output early and error free.
Page | 47
Chapter 05
Correlation Coefficient
Fusion
Gaurav Anand
Maninder Kaur
Anil Khurana
Rizwan Maknojia
Bikramjit Singh
Chapter 05
Page | 48
Definition
The correlation is one of the most common and most useful statistics. A correlation is a single number
that describes the degree of relationship between two variables. It gives a mathematical number to
weather two numeric variable are related or not, It ranges from -1 to +1.
“+1” correlation indicates a perfect positive correlation, meaning that both variables move in the same direction together.
“-1” correlation indicates a perfect negative correlation, meaning that as one variable goes up, the other goes down
A “0” correlation indicates that there is no relationship between the variables.
In mathematic terms, Correlation is referred as “r”. The degree of relationship between variables can be
defined by r value as shown in table 5.1.
Value of r Strength of relationship
-1.0 to -0.5 OR 1.0 to 0.5 Strong
-0.5 to -0.3 OR 0.3 to 0.5 Moderate
-0.3 to -0.1 OR 0.1 to 0.3 Weak
-0.1 to 0.1 None or very weak
Table 5.1 – r value table
Page | 49
Correlation Example
Let’s assume that we want to look at the relationship between two variables, the age of the student and
their marks. Perhaps we have a hypothesis that the age of a student’s effects their marks. We have a
sample data of 10 students and their marks out of 50.
Age Marks
25 35
30 48
26 36
24 36
28 45
25 40
31 46
31 40
26 36
25 31
Table 5.2
Based on the above data in Table 5.2 the calculated correlation value is “r=.105”. This indicates that
there is not a strong positive relationship between age of the student and their mark. Therefore, it’s not
necessarily that the older the student is, higher the marks he or she will get. Neither there is a negative
relationship between the two. The “r” value is .105 which is very close to “0”, it indicates that there is
hardly any relationship between the two variables.
Page | 50
Implementing Within SAS
Let’s understand how correlation can be used in SAS Enterprise Guide.
Open SAS Enterprise Guide 4.2. Open Class data set from LibrarySASHelp. Class data set has the name
of the student, Sex, Age, Height & their Weight. Now, we will check if there is any relationship between
the Height of the student & their Weight. On the menu bar at the top, click on
TasksMultivariateCorrelations as shown in figure 5.1
Figure 5.1 – Path to open Correlation
Correlation window will pop up.
Page | 51
Figure 5.2 – Correlation Window
Select & drag Height under Analysis variable & Weight under Correlate with & click Run
Figure 5.3 – Assigning variables for correlation
Page | 52
Figure 5.4 – Correlation output window
As you can see the output in figure 5.4, “Correlation Analysis” at the top it displays the variables for
which you want to check the degree of relationship between them. Below that under “Simple Statistics”,
it shows the Mean, Standard Deviation, Minimum Value & Maximum Value for both Height & Weight,
where N is the number of students in the class. All these statistics are used to calculate the correlation
between the two variables. In the output displayed above, we can see that the Correlation Coefficient
value is .877. Thus, we can say that there is strong positive relation between the height of the student &
their Weight. If the height of student will increase, the weight will also increase.
Page | 53
Modifying Output
In SAS enterprise guide, we can modify the output in different ways. For example, we want to check the
correlation between height & weight separately for males & females & also we want the scatter plot in
the output.
Right-click on CorrelationsModify Correlations under the process flow
Figure 5.5 – Modify correlation path
Page | 54
Correlations window will pop-up & drag Sex under Group analysis by as shown in figure 5.6
Figure 5.6 – Assigning variables for group analysis
Click on Resultscheck the option Create a scatter plot for each correlation pair
Figure 5.7 – Result screen of correlation window
Page | 55
Click on Titles& edit the Analysis Titles and Footnote by un-checking “Use default text”, click Run& click
on yes to override the result from previous Run.
Figure 5.8 – Titles & footnotes
In output shown in Figure 5.9 & 5.10, we have two different results, one for males & other for females &
both with the scatter plots. The correlation values are
Males, r=.85
Females, r=.88
Thus, both males & females have a strong positive relation between their height & weight.
Page | 56
Figure 5.9 – Correlation output window
Page | 57
Figure 5.10 – Correlation output window
Page | 58
Multiple Correlations
We can also do multiple correlations at the same time. For example, now we will check the relationship
between the “height & weight” and “Age & height” from the class data set. Height is the common
variable here, we want to correlate weight & age with it. So, we will put height in Analysis variable & we
will put Age & height under Correlate with because each variable in “Correlate with” role will be
correlated with the variables in the Analysis variables role.
To do multiple correlations
Right-click on CorrelationsModify
Correlations under the process flow & drag Age under correlate field with weight. Click Run& click on
yes to override the result from previous Run
Figure 5.11 – Correlation window
Page | 59
The output in Figure 5.12 displays the correlation between “Weight & Height” & “Age & Height” for
males & females separately.
Figure 5.12 – Correlation output
Page | 60
Note: The calculations we have done so far are based on simple correlation. We have some more
options in SAS to calculate correlation in different ways. Right-click on CorrelationsModify
Correlations under the process flow and click Options to find out the different ways as shown in figure
Figure 5.13 – Correlation options window
You can try different options to see what results they produce.
Page | 61
Chapter 06
Test of Significance
Gotcha
Luz Alvarez
Hasan Can
Michell Escutia
LiLi Xu
Sharon Yang
Chapter 06
Page | 62
Tests of significance are statistical tests used make claims or inference about the population from which the sample has drawn. To begin, a null hypothesis H0and confidence interval must be determined based a given scenario. H0 represents the assumption, either because it is believed to be true or because it is to be used as a basis for argument, but has been proved. Confidence Interval represents the estimated range being calculated from a given set of sample data. The common choices are 0.90, 0.95, and 0.99. The percentages correspond to the areas of the normal curve being covered. The outcome of the test is either “reject H0” or “Do not reject H0.”
There are different tools used, but we are going to observe the most common ones: t-Test, One-Way
ANOVA, Nonparametric One-Way ANOVA, Linear Models, and Mixed Models.
6.1 t-Test
Within a t-Test, there are three different types: Two Sample, Paired, and One Sample. We will walk
through each one of them based on a given scenario. In order to implement the t-Test using SAS
Enterprise Guide 4.2, open the dataset named marathons.sas.7bdat . FileOpenData. When the
database is open, now we can access the t-Test menu by clicking AnalyzeANOVAt-Test. (Figure
6.1)
Figure 6.1: Open a Task
Page | 63
We also can access this menu through TasksANOVAt-Test. (Figure 6.2)
Figure 6.2: Open a Task
t-Test Two Sample
This is a statistic used to evaluate whether or not the two independent samples are representative of
the same population. In addition, it is assumed that each sample is normally distributed with equal
variances. For instance, you want to compare the running time during the marathon at the city of New
York and Boston. A random sample of 50 observations from the Boston marathon and 100 observations
from the New York marathon have been recorded and saved. The variables in the dataset include City
and Time (in hours).
In the new window, click t-Test types, you will find 3 different types oft-Test, select Two Sample. (Figure
6.3)
Page | 64
Figure 6.3: Select t-Test type
Then click Data, we are going to assign a variable to identify level row. Then classify the variable and
select the variable we are going to analyze. Click the variable City and drag it to the
ClassificationVariables.Then click the variable Time and drag it to the Analysis Variables. (Figure 6.4)
Figure 6.4: Select variables
Page | 65
Click Analysis on the left menu. Specify the test value for null hypothesis H0 and the confidence level.
Set H0 to 0 because we believe the difference between the two observations is 0 or equal variances.
Then set confidence level to 95%. (Figure 6.5)
Figure 6.5: Set Null Hypothesis and Confidence Level
Click Plots and select the type of plots you need to display in the report. (Figure 6.6)
Figure 6.6: Select plots
Page | 66
After customizing the titles and click Run. (Figure 6.7)
Figure 6.7: Customize titles
The t-Test result is now shown as below. Whether or not we should reject the null hypothesis, we can
either use the method Pooled for unequal variances or the method Satterthwaite for unequal variances.
The column labeled t values corresponds to the t-test statistic, the column labeled DF corresponds to
degree of freedom, and the column labeled Pr > ltl corresponds to the P-value that has to be
interpreted. Since we already assumed the two observed samples are equal variances, we can use its P-
value as indicator, which is < 0.0001. with 95% confidence level we chose, the standard P-value we have
set is (1 – 0.95), which is 0.05. The P-value for equal variances is < 0.0001, which is smaller than 0.05. So
we can reject the null hypothesis. (Figure 6.8)
Page | 67
Figure 6.8: t-Test Two Sample Results
t-Test Paired
This is to test whether or not the two matched samples are representative of the same population.
Open the dataset named bloodpressure.sas.7bdat in order to examine the effectiveness of a medication
in reducing blood pressure. A random sample of individuals with high blood pressure is taken and their
diastolic pressure is recorded. The individuals are then placed on medication and one month later their
diastolic blood pressure is once again recorded. The dataset contains the following variables: subject,
age, baseline blood pressure, and new blood pressure.
In the t-Test window, select Paired. (Figure 6.9)
Page | 68
Figure 6.9: Select t-Test type
Click Data, and then assign the variables of Baseline BP and New BP to Paired Variables. (Figure 6.10)
Figure 6.10: Select variables
Page | 69
After customizing the titles and click Run.(Figure 6.11)
Figure 6.11: t-Test Paired Results
t-Test One Sample
This is a test to determine whether a sample is representative of a population with specified mean. Let’s
use the same data set bloodpressure.sas.7bdat as used in paired sample. Under t-Test type, select One
Sample. (Figure 6.12)
Figure 6.12: Choose t-Test type
Page | 70
Under Data, click Age and drag it to Analysis Variables. (Figure 6.13)
Figure 6.13: Select variables
After customizing the titles and click Run. See the results below. (Figure 6.14)
Figure 6.14: Results
Page | 71
6.2 One-Way ANOVA
One-Way ANOVA (Analysis of variance) test is another way to test hypotheses. It is a procedure used to
perform an analysis of variance by testing whether or not the means of two or more samples are equal.
It assumes all the samples are drawn from normally distributed populations with equal variance, which
is similar t-test two sample. It is based on the fact that 2 independent estimates of the population
variance and it can be obtained from the sample data.
Select Analyze ANOVAOne-Way ANOVA. (Figure 6.15)
Figure 6.15: Open a Task
Click Data and select the dependant and independent variable. In this case, Weight is theDependent
Variable and the Displacement is the Independent Variable. (Figure 6.16)
Page | 72
Figure 6.16: Select variables
Click Test and select tests for equal variance. (Figure 6.17)
Figure 6.17: Tests
Page | 73
Click MeansComparison, and then select the method and confidence level you want to use. We want
to stick with 95% confidence level. (Figure 6.18)
Figure 6.18: Comparison
Click Breakdown and select the statistics for qualitative variables that you want in the report(Figure
6.19).
Figure 6.19: Breakdown
Page | 74
Click Plots and select between the two types (Box and Whisker or Means) that you want to display in
your result. (Figure 6.20)
Figure 6.20: Breakdown
Customize your titles and click Run. See the resultsbelow. (Figure 6.21)
Figure 6.21: Results
Page | 75
6.3 Nonparametric One-Way ANOVA
This type of test allows you to implement nonparametric tests for location and scale when you have a
continuous dependent variable and a single independent variable.
In statistical inference, or hypothesis testing, parametric runs because they depend on the spec of a
probability distribution except for a place of free parameters the traditional runs are called. Parametric
runs are stated to depend on distributional assumptions, nonparametric tests, do not require
distributional assumptions.
Nonparametric methods are often almost as powerful as parametric methods, even if the data are
distributed normally.
Select AnalyzeANOVAOne-Way ANOVA. (Figure 6.22)
Figure 6.22: Open a Task
Click Data and select the dependant and independent variable. (Figure 6.23)
Page | 76
Figure 6.23: Select variables
Click Analysis and select test scores you want in your results. (Figure 6.24)
Figure 6.24: Analysis Tests
Page | 77
Then click on Extract p-values. (Figure 6.25)
Figure 6.25: Extract p-values
Customize your titles and click Run. See the results. (Figure 6.26)
Figure 6.26: Results
Page | 78
6.4 Linear models
The Linear Models task is used to perform an analysis of variance when you have a continuous
dependent variable with classification variables, quantitative variables, or both.
Select AnalyzeANOVALinear models. (Figure 6.27)
Figure 6.27: Open a Task
Click Data and select the dependant. (Figure 6.28)
Figure 6.28: Select variables
Page | 79
Click Model Options and select the hypothesis test options that you want in your result. (Figure 6.29)
Figure 6.29: Select model options
Customize your titles and click Run. See the results. (Figure 6.30)
Figure 6.30: Linear Model Results
Page | 80
6.5 Mixed models
The Mixed Models task is used to provide facilities for fitting a number of basic mixed models. These
models enable you to handle both fixed effects and random effects in a linear model for a continuous
response. Numerous experimental contrives produce data for which coalesced models are appropriate.
Select AnalyzeANOVAMixed Models. (Figure 6.31)
Figure 6.31: Open a Task
Click Data and select the dependant variable and the quantitative variables you want to analyze. (Figure
6.32)
Page | 81
Figure 6.32: Select variables
Customize your titles and click Run. See the results as shown below. (Figure 6.33)
Figure 6.33: Mixed Model Results
Page | 82
Chapter 07
Limits (Confidence / Prediction)
Dean Squad
Eric Plaskacz
Christina Mofid
Edison Nguyen
Marissa Shaver
Alexandra Wackett
Chapter 07
Page | 83
Confidence Limits
Definition
A confidence interval is the likely range of the true value and since there is only one true value, the
confidence interval defines a range where it is likely to be. Most often, confidence intervals are at the
95% level – called the 95% confidence interval. These intervals mean that on average, 95% of the ranges
will capture the true population mean, while 5% of them, on average will not capture the true
population.
Confidence intervals are used because it might not be possible to measure everyone in a given
population simply because of a lack of resources. However, by using confidence intervals, it is possible
to use a sample of the population to calculate a range within which the population is likely to fall within.
Confidence Limits – Confidence limits are the upper and lower boundaries of the interval.
Width of Confidence Intervals
Confidence intervals give us a range of upper and lower boundaries. If the interval is narrow – meaning
a small difference between the upper and lower boundaries, than we can be confident that the study
was quite large and the true value is precise. If the confidence interval is wide – than we can conclude
that the study was most likely small which means that the true value will be imprecise
Prediction Intervals
Definition
A prediction interval is a range that will tell you were you can expect to see future observations. These
intervals are useful in determining what future values should be, based upon present or past data. They
can be useful to us because they can predict future data points before the information is even collected,
as opposed to having to wait to collect it. Since there is uncertainty in knowing what future data will be
the prediction interval will always be wider than the confidence interval.
Page | 84
Example in SAS EG
Beer Sales Data
Data set shows monthly sales of beer in hectoliters. The average high and low temperatures within the region are also recorded over five years.
The data shows trends of beer sales and the relationship
This Chapter will focus on computer confidence and prediction intervals as well as interpreting the associated output.
How to Complete in SAS EG
Open the dataset: beer_sales.sas,
As you can see from the raw data, an
increase in temperature is strongly
and positively correlated to beer
sales.
If we make a simple line plot before
we start computing confidence
intervals, it will give us a better sense
of the information we’re looking at.
Click TaskGraphLine Plot
After selecting the first line plot, add
High Temp to the horizontal axis
(independent variable) and Sales to
the Y axis (dependent variable)
Make sure to change the appropriate titles and footnotes in the properties tab. Click Run.
Figure 7.1
Page | 85
The results:
Therefore, as temperature increases, beer sales increase as well.
Confidence Limits
Confidence limits represent the high and
low values of the range
To computer confidence limits for month,
click TaskDescribeDistribution
Analysis
To compute confidence limits on sales,
drag sales to the task role pane under
variable analysis
Click Run
Figure 7.2
Figure 7.3
Page | 86
Output Analysis (below)
From the output generated, the confidence limit provides a range in which the mean of the data
(186.9), there is 95% confidence that the sample mean will fall between the limits of 177.84 and
195.97. This is assuming we have a normal distribution.
The probability of the mean falling outside of the given confidence limit by chance alone is 5%.
We expect that if more data on beer sales is collected, the confidence limit is expected to decrease.
Limits on other variables including month and temperature can be computed by changing the
variable analysis accordingly.
Figure 7.4
Page | 87
Appendix
Team Contributions
A simple breakdown by each team, showing how the work was
distributed among themselves:
Team 1 – Numerical Summaries
Definition – Jaswant Seahra, Mriseal Sinha
Example – Baljeet Kaur, Jaswant Seahra, Theo Wolski
Implementing with SAS – Trystan Macdonald, SurbhiSurbhi
Documenting design process – Trystan Macdonald, Theo Wolski
Defining SAS results – Trystan Macdonald
Conclusion – Trystan Macdonald, Theo Wolski, Jaswant Seahra
Compilation – Theo Wolski, SurbhiSurbhi
Appendix
Page | 88
Blueprint – Variation Within The Data
Definition – Paramjeet Kaur, Christopher Atkinson
Example – Fredric Ayih, Danusha Fernando
Designing within SAS – Danusha Fernando , Christopher Atkinson
Documenting design process – Gauvtam Bajaaj, Frederic Ayih
Defining SAS results – Fredric Ayih , Paramjeet Kaur, Gauvtam Bajaaj
Conclusion – Christopher Atkinson , Gauvtam Bajaaj
Compilation – Danusha Fernando , Paramjeet Kaur
Spice Girls – Confidence Intervals
Definition – Everyone
Example – Everyone
Implementing within SAS – Everyone
Documenting design process – Everyone
Defining SAS results – Everyone
Conclusion – Everyone
Compilation – Everyone
Page | 89
Sukhoi – Simple Regression
Definition – Kalpesh Patel, Ishan Sangrai, Sheleena Jaria
Example – Kalpesh Patel, Ishan Sangrai, Sheleena Jaria
Implementing within SAS – Amit Bansal, Pranay Sankhe
Documenting design process – Amit Bansal, Pranay Sankhe
Defining SAS results – Amit Bansal, Pranay Sankhe
Conclusion – Amit Bansal, Ishan Sangrai
Compilation – Amit Bansal, Ishan Sangrai
Fusion – Correlation Coefficient (r)
Definition – Everyone
Example – Everyone
Implementing within SAS – Everyone
Documenting design process – Everyone
Defining SAS results – Everyone
Conclusion – Everyone
Compilation – Everyone
Page | 90
Gotcha – Test Of Significance
Definition – Hasan Can, Michell Escutia, Lily Xu
Example – Luz Alvarez
Implementing within SAS – Luz Alvarez
Documenting design process – Luz Alvarez, Hasan Can, Michell Escutia
Defining SAS results – Luz Alvarez, Lily Xu, Sharon Yang
Conclusion – Lily Xu, Sharon Yang
Compilation – Luz Alvarez, Lily Xu, Sharon Yang, Hasan Can, Michell Escutia
Dean Squad – Limits (Confidence / Prediction)
Definition – Everyone
Example – Everyone
Implementing within SAS – Everyone
Documenting design process – Everyone
Defining SAS results – Everyone
Conclusion – Everyone
Compilation – Everyone