advanced data analysis binder 2015
TRANSCRIPT
-
7/25/2019 Advanced Data Analysis Binder 2015
1/165
1
School of Business Management, NMIMS
University
Advanced Data Analysis Using SPSS
ByProf. Shailaja Rego
Year 2015
Trimester IV
-
7/25/2019 Advanced Data Analysis Binder 2015
2/165
2
Contents
Sr.
No.
Topic Page No.
1 Teaching Plan 34
2 Charts 57
3 Nonparametric Tests 828
4 Parametric & Nonparametric tests using SPSS 29 - 42
5 Multivariate Analysis Introduction 4348
6 Multiple Regression Analysis 4874
7 Discriminant Analysis 7589
8 Logistic Regression 8997
9 MANOVA 97100
10 Factor Analysis 100116
11 Canonical Correlation 116120
12 Cluster Analysis 120142
13 Conjoint Analysis 142149
14 Multidimensional Scaling ( MDS) 149151
15 Classification & Regression Trees(CART) 152-165
-
7/25/2019 Advanced Data Analysis Binder 2015
3/165
3
Advanced Methods of Data Analysis
School of Business Management, NMIMS University
MBA (Full Time) course: 2013-2014 Trim: IV
_________________________________________________________________________Instructor: Ms Shailaja Rego
Email Id:[email protected],[email protected]
Extn no. 5873________________________________________________________________________
Objectives:
This course aims at understanding of multivariate techniques by manual method, using SPSS or othercomputer assisted tool for carrying out the analysis and interpreting the results to assist the business
decisions.
Prerequisite:Statistical Analysis for Business Decisions
Evaluation Criteria:(in %)
Quiz : 20%Project : 20%
Assignment : 20%
End Term Exam : 40%
Session Details:
Session Topic
1 Introduction to multivariate data analysis, dependence & Interdependence techniques,Introduction to SPSS, two sample tests, one way ANOVA, n-way ANOVA using SPSS and
interpretation of the SPSS output. Discussion on assumptions of these tests.
(Chapter 11 & chapter 12, Appendix III- IBM SPSS of Business Research Methodology by
Srivastava Rego)2 Introduction to SPSS and R(Open source Statistical Software)
(Appendix II of Business Research Methodology by Srivastava Rego) & R manual
3 & 4 Introduction to Econometrics, multiple regression analysis, Concepts of Ordinary LeastSquare Estimate (OLSE) & Best Linear Unbiased Estimate (BLUE). With Practical
Examples in business
(Read: H & A Chapter 4,p.g. 193 - 288)
5 Multiple Regression Assumptions - Heteroscedasticity, Autocorrelation & Multicollinearity(Read: H & A Chapter 4,p.g. 193 - 288)
6 Multiple RegressionCase Problems, Dummy variables, Outlier analysis.
(Read: H & A Chapter 4,p.g. 193 - 288)
7 & 8 Factor AnalysisIntroduction to Factor Analysis, Objectives of Factor Analysis, Designinga Factor Analysis, Assumptions in a Factor Analysis, Factors & assessing overall fit.
Interpretation of the factors.
(Read: H & A Chapter 3 pg. 125 - 165)
9 Case problem based on Factor Analysis
10 &
11
Introduction to Cluster Analysis. Objective of Cluster Analysis. Research Design of Cluster
Analysis. Assumptions in a Cluster Analysis. Employing Hierarchical.
(Read: H & A Chapter 8 pg. 579 - 622)
Non-Hierarchical, Interpretation of the clusters formed.
(Read: H & A Chapter 8 pg. 579 - 622)
mailto:[email protected],mailto:[email protected],mailto:[email protected],mailto:[email protected], -
7/25/2019 Advanced Data Analysis Binder 2015
4/165
4
12 Case problem based on Cluster Analysis
(Read: H & A Chapter 8 pg. 579 - 622)
13 Introduction to Discriminant Analysis. Objectives of Discriminant Analysis. Research
Design for Discriminant Analysis. Assumptions of Discriminant Analysis.Estimation of the Discriminant model & assessing the overall fit.
Applications, Interpretation & Hypothetical Illustrations
(Read: H & A Chapter 5 pg. 297 - 336)14 Introduction to Logistic Regression model & assessing the overall fit.
Interpretation & Hypothetical Illustrations
(Read: H & A Chapter 5 pg. 293 -336 )
15 Applications, Interpretation & Hypothetical Illustrations(Read: H & A Chapter 5 pg. 293 -336 )Case Problems on Logistic Regression
16 &
17
Classification & Regression Trees(CART) Models , Random Forests Models.
Applications, Interpretation & case problems
18 Introduction to Conjoint Analysis. Objectives of Conjoint Analysis. Designing of ConjointAnalysis. Assumptions on Conjoint Analysis. Interpreting the results.
(Read: H & A pg. Chapter 7 483 - 547)
19 &
20Guest Sessions from Industry professionals in Business Analytics
The different Multivariate Techniques will also be interpreted using SPSS and R.
Text Book:1. Multivariate Data Analysis by Hair & Anderson. ( H & A )
Reference Books:1. Business Research Methodology by Srivastava & Rego2. Statistics for Management by Srivastava & Rego3. Analyzing Multivariate Data by Johnson & Wichern.4. Multivariate Analysis by Subhash Sharma5. Market Research by Naresh Malhotra6. Statistics for Business & Economics by Douglus & Lind7. Market Research by Rajendra Nargundkar8. Statistics for Management by Aczel & Sounderpandian
-
7/25/2019 Advanced Data Analysis Binder 2015
5/165
5
Hypothesis Testing
Univariate
Techniques
Parametric Non Parametric
One
Samplet TestZ test
Two
Sample
Independent
Samplest Test
Z test
Dependent
Samples
Paired t Test
One Sample
Chi-Square
K-SRuns
Binomial
Two Sample
Independent
Samples
Chi-Square
Mann-Whitney
MedianK-S
Dependen
Samples
Sign
WilcoxonChi-Squar
McNemar
-
7/25/2019 Advanced Data Analysis Binder 2015
6/165
6
Multivariate Techniques
Dependence
Techniques
Interdependence
Techniques
One Dependent VariableTechniques :
Cross Tabulation
ANOVA
ANOCOVAMultiple Regression
Two group discriminant
Analysis
Logit Analysis
Conjoint Analysis
More than one DependentVariable
Techniques :
MANOVA
MANOCOVACanonical Correlation
Multiple discriminant
Analysis
VariableInterdependence
Techniques :
Factor Analysis
Inter objectsimilarity
Techniques :
Cluster Analysi
MDS
-
7/25/2019 Advanced Data Analysis Binder 2015
7/165
7
Metric Dependent
Variable
One Independent
Variable
More than one
Independent
Variable
Binary
t TestCategorical:
Factorial
Categorical
& IntervalInterval
ANOVA ANOCOVA
Regression
One Factor More than
One Factor
One way
ANOVA
N way
ANOVA
-
7/25/2019 Advanced Data Analysis Binder 2015
8/165
8
NON-PARAMETRIC TESTS
Contents1. Relevance- Advantages and Disadvantages
2. Tests for
Randomness of a Series of Observations - Run Test
c. Specified Mean or Median of a PopulationSigned Rank Testd. Goodness of Fit of a DistributionKolmogorov- Smirnov Test
e. Comparing Two PopulationsKolmogorov- Smirnov Test
Equality of Two MeansMann - Whitney (U)Test
Equality of Several Means `
Wilcoxon - Wilcox Test
Kruskel -Wallis Rank Sum (H) Test Friedmans ( F)Test Two Way ANOVA
Rank Correlation Spearmans
Testing Equality of Several Rank Correlations
Kendals Rank Correlation Coefficient
Sign Test
1 Relevance and IntroductionAll the tests of significance, those have been discussed in Chapters X and XI, are based on certain
assumptions about the variables and their statistical distributions. The most common assumption isthat the samples are drawn from a normally distributed population. This assumption is more critical
when the sample size is small. When this assumption or other assumptions for various tests described
in the above chapters are not valid or doubtful, or when the data available is ordinal (rank) type, we
take the help of non-parametric tests. For example, in the students t test for testing the equality ofmeans of two populations based on samples from the two populations, it is assumed that the samples
are from normal distributions with equal variance. If we are not too sure of the validity of this
assumption, it is better to apply the test given in this Chapter.
While the parametric tests refer to some parameters like mean, standard deviation, correlation
coefficient, etc. the non parametric tests, also called as distribution-free tests, are used for testing
other features also, like randomness, independence, association, rank correlation, etc.In general, we resort to use of non-parametric tests where
The assumption of normal distribution for the variable under consideration or someassumption for a parametric test is not valid or is doubtful.
The hypothesis to be tested does not relate to the parameter of a population
The numerical accuracy of collected data is not fully assured
Results are required rather quickly through simple calculations.However, the non-parametric tests have the following limitations or disadvantages:
They ignore a certain amount of information.
They are often not as efficient or reliable as parametric tests.The above advantages and disadvantages are in consistent with general premise in statistics that is, amethod that is easier to calculate does not utilize the full information contained in a sample and is less
reliable.
The use of non-parametric tests, involves a tradeoff. While the efficiency or reliability is
lost to some extent, but the ability to use lesser information and to calculate faster is
gained.
-
7/25/2019 Advanced Data Analysis Binder 2015
9/165
9
There are a number of tests in statistical literature. However, we have discussed only the following
tests.
Types and Names of Tests for
Randomness of a Series of ObservationsRun Test
Specified Mean or Median of a PopulationSigned Rank TestGoodness of Fit of a DistributionKolmogorov- Smirnov Test
Comparing Two PopulationsKolmogorov- Smirnov Test
Equality of Two MeansMann - Whitney (U) Test
Equality of Several MeansWilcoxon - Wilcox Test
Kruskel -Wallis Rank Sum ( H) Test
Friedmans ( F)Test
Rank CorrelationSpearmans
We now discuss certain popular and most widely used non-parametric tests in the subsequent
sections.
2 Test for Randomness in a Series of Observations:The Run Test
This test has been involved for testing whether the observations in a sample occur in a certain order
or they occur in a random order. The hypotheses are.Ho : The sequence of observations is random
H1: The sequence of observations is not random
The only condition for validity of the test is that the observations in the sample be obtained undersimilar conditions.
The procedure for carrying out this test is as follows.
First, all the observations are arranged in the order they are collected. Then the median is calculated.All the observations in the sample larger than the median value are given a +sign and those below the
median are given asign. If there are an odd number of observations then the median observation is
ignored. This ensures that the number of +signs is equal to the number of signs.A succession of values with the same sign is called a runand the number of runs, say R, gives an
idea of the randomness of the observations. This is the test statistic. If the value of R is low, it
indicates certain trend in the observations, and if the value of R is high, it indicates presence of some
factor causing regular fluctuations in the observations.The test procedure is numerically illustrated below.
Illustration 1
The Director of an institute wanted to browse the answer books of 20 students out of the answerbooks of 250 students in the paper of Statistical Methods at the management institute. The Registrar
selected 20 answer books one by one from the entire lot of 250 answer books. The books had the
marks obtained by the students indicated on the first page. The Director wanted to ascertain whether
the Registrar had selected the answer books in a random order? He used the above test for this
purpose.
He found out that the marks, given below, have median as 76. The observations less than the median
are under marked as the sign, and observations more than the median are under marked as the sign
+.
+ + + + + + + + + +
- (1)- (2) -(3)- (4) (5) -(6)- (7) (8) -(9)- -(10)-
-
7/25/2019 Advanced Data Analysis Binder 2015
10/165
10
It may be noted that the number of (+) observations is equal to number of () observations; both
being equal to 10. As defined above, succession of values with the same sign is called a run. Thus the
first run comprises of observations 58 and 61, with thesign, second run comprises of only one
observation i.e. 78, with the + sign, the third observation comprises of three observations 72, 69 and
65, with the negative sign, and so on.Total number of runs R = 10.This value of R lies inside the acceptance interval found from appropriate table as from 7 to 15 at 5 %
level of significance. Hence the hypothesis that the sample is drawn in a random order is accepted.
Applications:
(i) Testing Randomness of Stock Rates of ReturnThe number-of-runs test can be applied to a series of stocks rates of return, foreach of the trading
day, to see whether the stocks rates of return are random or exhibit a pattern that could be exploitedfor earning profit.
(ii) Testing the Randomness of the Pattern Exhibited by Quality Control Data OverTime
If a production process is in control, the distribution of sample values should be randomly distributedabove and below the center line of a control chart. Please refer to chapter XVII on Industrial
Statistics. We can use this test for testing whether the pattern of, say, 10 sample observations, taken
over time, is random,.
3 Test for Specified Mean or Median of a PopulationThe Signed Rank Test
This test has been evolved to investigate the significance of the difference between a population mean
or median and a specified value of the mean or median, say m0 .
The hypotheses are as follows:
Null hypothesis Ho : m= m0Alternative Hypothesis H1 : m m0
The test procedure is numerically illustrated below.
Illustration 2
In the above example relating to answer books of MBA students, the Director further desired to have
an idea of the average marks of students. When he enquired the concerned Professor, he was
informed that the professor had not calculated the average but felt that the mean would be 70, and the
median would be 65. The Director wanted to test this, and asked for a sample of 10 randomly selected
answer books. The marks on those books are tabulated below as xi s.
Null hypothesis: Ho : m= 70
Alternative Hypothesis H1 : m 70Sample values are as follows:
Any sample values equal to m0are to be discarded from the sample.
*- absolute value or modulus of ( x im0 )
xi ( Marks) 55 58 63 78 72 69 64 79 75 80
xim0 15 12 7 +8 +2 1 6 +9 +5 +10
|xim0|* 15 12 7 8 2 1 6 9 5 10
Ranks of|xim0|
+ orsigns1 2 6 5 9 10 7 4 8 3
Ranks with signs
of respective
( xim0 )
1 2 6 +5 +9 10 7 +4 +8 +3
-
7/25/2019 Advanced Data Analysis Binder 2015
11/165
11
Now, Sum of ( + ) ranks = 29 Sum of () ranks = 26
Here, the statistic T is defined as the minimum of the sum of positive ranks and sum of negative
ranks.
Thus, T = Minimum of 28 and 26 = 26
The critical value of T at 5% level of significance is 8 ( for n = number of values ranked = 10).
Since the calculated value 26 is more than the critical value, the null hypothesis is not rejected. Thusthe Director does not have sufficient evidence to contradict the professors guess about mean marks
as 70.
It may be noted that the criteria using rank methods is reverse of the parametric tests wherein
the null hypothesis is rejected if the critical value exceeds the tabulated value.
In the above example, we have tested a specified value of the mean. If the specified value of the
median is to be tested the test is exactly similar only the word mean is substituted by median. For
example if the director wanted to test whether the median of the marks 70, the test would have
resulted in same values and same conclusions.
It may be verified that the Director does not have sufficient evidence to contradict the professors
guess about median marks as 65.
Example 1
The following Table gives the real annual income that senior managers actually take home incertain countries, including India. These have been arrived at, by US based Hay Group, in 2006, after
adjusting for the cost of living, rental expenses and purchasing power parity.
Overall Rank Country Amount ( in Euros)
5 Brazil 76,449
26 China 42,288
8 Germany 75,701
2 India 77,665
9 Japan 69,6343 Russia 77,355
4 Switzerland 76,913
1 Turkey 79,021
23 UK 46,809
13 USA 61,960
Test whether 1) the mean income is equal to 70000 2) The median value is 70000
It may be verified that both the hypotheses are rejected.4 Test for Goodness of Fit of a Distribution(One sample) - Kolmogorov - Smirnov
In Chapter X, we have discussed 2 test as the test for judging goodness of fit of a distribution.However, it is assumed that the observations come from a normal distribution.
The test is used to investigate the significance of the difference between observed and expectedcumulative distribution function for a variable with a specified theoretical distribution which could be
Binomial, Poisson, Normal or an Exponential. It tests whether the observations could reasonably have
come from the specified distribution. Here,
Null Hypothesis Ho : The sample comes from a specified population
Alternative Hypothesis H1 : The sample does not come from a specified populationThe testing procedure envisages calculations of observed and expected cumulative distribution
functions denoted by Fo(x) and Fe(x), respectively, derived from the sample. The comparison of the
two distributions for various values of the variable is measured by the test statistic
D = | Fo(x)Fe(x) | (1)
-
7/25/2019 Advanced Data Analysis Binder 2015
12/165
12
If the value of the difference of D is less, the null hypothesis is likely to be accepted. But if the
difference is more, it is likely to be rejected. The procedure is explained below for testing the fitting
of an uniform distribution (vide section 7.6.1 of Chapter VII
on Statistical Distributions)
Illustration 3
Suppose we want to test whether availing of educational loan by the students of 5 Management
Institutes is independent of the Institute in which they study.
The following data gives the number of students from each of the five institutes viz. A, B, C, D andE. These students were out of 60 students selected randomly at each institute.
The relevant data and calculations for the test are given in the following Table.
Institutes
A B C D E
Out of Groups of 60students
5 9 11 16 19
Observed CumulativeDistribution Function
Fo (x)
5/60 14/60 25/60 41/60 60/60
Expected CumulativeDistribution Function
Fe (x)
12/60 24/60 36/60 48/60 60/60
| Fo (x)Fe (x)| 7/60 10/60 11/60 7/60 0
From the last row of the above table, it is observed that
Maximum Value of D i.e. Max D = 11/60 = 0.183
The tabulated value of D at 5% level of significance and 60 d.f. = 0.175
Since calculated value is more than the tabulated value, the Null Hypothesis is rejected. Thus,
availing of loan by the students of 5 management institutes is not independent of institutes they study.
Comparing of2and K-S Test
The Chi-square test is the most popular test of goodness of fit. On comparing the two tests, we note
that the K-S test is easier to apply. While2- test is specially meant for categorical data, the K-S test
is applicable for random samples from continuous populations. Further, the K-S statistic utilises each
of the n observations. Hence, the K-S test makes better use of available information than Chi-square
statistic.
5 Comparing Two Populations - Kolmogorov-Smirnov TestThis test is used for testing whether two samples come from two identical populationdistributions. The hypotheses are:
H0:F1(x) = F2(x)
i.e. the two populations of random variables x and y are almost thesame.
H1:F1(x) F2(x)
i.e. the two populations are not same that is claimed
There are no assumptions to be made for the populations. However, for reliable results, the
samples should be sufficiently large say, 15 or more.The procedure for carrying out this test is as follows.
Given samples of size n1and n2from the two populations, the cumulative distribution functions F1(x)
-
7/25/2019 Advanced Data Analysis Binder 2015
13/165
13
can be determined and plotted. Hence the maximum value of the difference between the plotted
values can thus be found and compared with a critical value obtained from the concerned Table. If the
observed value exceeds the critical value the null hypothesis that the two population distributions are
identical is rejected.The test is explained through the illustration given below.
Illustration 4At one of the Management Institutes, a sample of 30 (15 from commerce background and 15 from
Engineering background), Second Year MBA students was selected, and the data was collected on
their background and CGPA scores at the end of the First year.The data is given as follows.
Sr No CGPA CommerceCGPA Engineering
1 3.24 2.97
2 3.14 2.92
3 3.72 3.03
4 3.06 2.79
5 3.14 2.77
6 3.14 3.11
7 3.06 3.33
8 3.17 2.65
9 2.97 3.14
10 3.14 2.97
11 3.69 3.39
12 2.85 3.08
13 2.92 3.3
14 2.79 3.25
15 3.22 3.14
Here, we wish to test that the CGPA for students with Commerce and Engineering backgrounds
follow the same distribution.
The value of the statistics D is calculated by preparing the following Table.
SrNo
CGPACom
Cum
Score
ofCGPAs
Cumulative
Distribution
Function ofCGPAs
CGPAEng
Cumulative
Score ofCGPAs
Cumulative
Distribution
Function ofCGPAs
Diffenc
(i) (ii) (iii)Fi(C) =
Col.(iii)/47.25Sr
No (vi) (vii) Fi(E)=
Fi(C
Fi(EDi
(iv) (v) Col.(vii)/45.84 (vii
(viii)
14 2.79 2.79 0.0590 8 2.65 2.65 0.0578 0.00
-
7/25/2019 Advanced Data Analysis Binder 2015
14/165
14
12 2.85 5.64 0.1194 5 2.77 5.42 0.1182 0.00
13 2.92 8.56 0.1812 4 2.79 8.21 0.1791 0.00
9 2.97 11.53 0.2440 2 2.92 11.13 0.2428 0.00
4 3.06 14.59 0.3088 1 2.97 14.1 0.3076 0.00
7 3.06 17.65 0.3735 10 2.97 17.07 0.3724 0.00
2 3.14 20.79 0.4400 3 3.03 20.1 0.4385 0.00
5 3.14 23.93 0.5065 12 3.08 23.18 0.5057 0.00
6 3.14 27.07 0.5729 6 3.11 26.29 0.5735 0.00
10 3.14 30.21 0.6394 9 3.14 29.43 0.642 0.00
8 3.17 33.38 0.7065 15 3.14 32.57 0.7105 0.00
15 3.22 36.6 0.7746 14 3.25 35.82 0.7814 0.00
1 3.24 39.84 0.8432 13 3.3 39.12 0.8534 0.0111 3.69 43.53 0.9213 7 3.33 42.45 0.926 0.00
3 3.72 47.25 1 11 3.39 45.84 1
It may be noted that the observations ( scores) have been arranged in ascending order
for both the groups of students.
The calculation of values in different colums is explained below.
Col. (iii) The first cumulative score is same as the score in Col. (ii). The second cumulative score is
obtained by adding the first score to second score, and so on. The last i.e. 15thscore is obtained byadding fourteen scores to the fifteenth score.
Col. (iv) The cumulative distribution function for any observation in second column is obtained bydividing the cumulative score for that observation by cumulative score of the last i.e. 15
thobservation.
Col. (vi) and Col. (viii) values are obtained just like the values in Col. (iii) and Col. (iv).
The test statistics D is calculated from Col. (viii) of the above Table as
Maximum of Di= D = Maximum of { Fi (C) - Fi(E) } = 0.0068
Since the calculated value of the statistic viz. D is 0.0068 is less than its tabulated value 0.457
at 5 % level (for n =15 ), the null hypothesis that both samples come from the same population is not
rejected.
6 Equality of Two MeansMann-Whitney UTest
This test is used with two independent samples. It is an alternative to the t test without the latterslimiting assumptions of coming from normal distributions with equal variance.
For using the U test, all observations are combined and ranked as one group of data, from smallest to
largest. The largest negative score receives the lowest rank. In case of ties, the average rank isassigned. After the ranking, the rank values for each sample are totaled. The U statistic is calculated
as follows:
111
21 2
)1(nnU R
nn (2)
or,
-
7/25/2019 Advanced Data Analysis Binder 2015
15/165
15
222
21 2
)1(nnU R
nn (3)
where,
n1= Number of observations in sample 1; n2= Number of observations in sample 2
Rl= Sum of ranks in sample 1; R2 = Sum of ranks in sample 2.
For testing purposes, the smaller of the above two U values is used.
The test is explained through a numerical example given below.
Example 2
Two equally competent groups of 10 salespersons were imparted training by two different methods
A and B. The following data gives sales of a brand of paint, in 5 kg. tins, per week per salespersonafter one month of receiving training. Test whether both the methods of imparting training are equally
effective.
Salesman
Sr. No.
TrainingMethod
ASalesman
Sr. No.
TrainingMethodB
Sales Sales
1 1,500 1 1,340
2 1,540 2 1,300
3 1,860 3 1,620
4 1,230 4 1,070
5 1,370 5 1,210
6 1,550 6 1,170
7 1,840 7 950
8 1,250 8 1,380
9 1,300 9 1,460
10 1,710 10 1,030
Solution :Here, the hypothesis to be tested is that both the training methods are equally effective, i.e.
H0 : m1 = m2H1 m1 m2
where, m1 is the mean sales of salespersons trained by method A, and m2 is the mean sales of
salespersons trained by method B.
The following Table giving the sales values for both the groups as also the combined rank of sales for
each of the salesman is prepared to carry out the test.
Salesman
Sr. No.
Training Method ASalesman
Sr. No.
Training Method B
Sales
in 5 kg. Tins
Combined
Rank
Sales
in 5 kg. Tins
Combined
Rank
1 1,500 14 1 1,340 10
-
7/25/2019 Advanced Data Analysis Binder 2015
16/165
16
2 1,540 15 2 1,300 8.5
3 1,860 20 3 1,620 17
4 1,230 6 4 1,070 3
5 1,370 11 5 1,210 5
6 1,550 16 6 1,170 4
7 1,840 19 7 950 1
8 1,250 7 8 1,380 12
9 1,300 8.5 9 1,460 13
10 1,710 18 10 1,030 2
Average
Sales1,515 - - 1,253 -
Sum of
Ranks- R1= 135 - - R2= 76
U
values- 20 - - 79.5
The U statistic for both the training methods are calculated as follows :
5.1342
)110(101010U = 20.5
5.752
)110(101010U = 79.5
The tabulated or critical value of U with nl = n2= 10, for = 0.5, is 23, for a two-tailed test.
It may be noted that in this test too, like the signed rank test in section 3, the calculated value
must be smaller than the critical value to reject the null hypothesis.
Since the calculated value 20.5 is smaller than the critical value 23, the null hypothesis is rejected thatthe two training methods are equally effective. It implies that Method A is superior to Method B.
7 Equality of Several Means - The Wilcoxon-Wilcox Test for Comparison
of Multiple Treatments
This test is analogous to ANOVA, and is used to test the significance of the differences among means
of several groups recorded only in terms of ranks of observations in a group. However, if the original
data is recorded in absolute values it could be converted into ranks. The hypotheses, like in ANOVA
are:
H0: m1 = m2 = m3H1 : Allmeans are not equal
However, there is one important difference between ANOVA and this test. While ANOVA tests
only the equality of all means, this test goes beyond that to compare even equality of all pairs of
means.The procedure for carrying out this test is illustrated with an example, given below
Example 4A prominent company is a dealer of a brand of LCD TV, and has showrooms in five different cities
-
7/25/2019 Advanced Data Analysis Binder 2015
17/165
17
viz. Delhi(D), Mumbai(M), Kolkata(K), Ahmedabad(A) and Bangalore(B). Based on the data about
sales in the showroom for 5 consecutive weeks, the ranks of LCD TVs in five successive weeks are
recorded as follows:
Delhi Mumbai Kolkata Ahmedabad Bangalore
Week D M K A B
1 1 5 3 4 2
2 2 4 3 5 13 1 3 4 5 2
4 1 4 3 5 2
5 2 3 5 4 1
Rank Sum* 7 19 18 23 8
* Rank Sum is calculated by adding the ranks for all the five weeks for each of the city.
From the values of Rank Sum, we calculate the net difference in Rank Sum for every pair of cities,and tabulate as follows :
Differencein Ranks
D M K A B
D 0 12 11 16* 1
M 12 0 1 4 11
K 11 1 0 5 10
A 16* 4 5 0 15*
B 1 11 10 15* 0The critical value for the difference in Rank Sums for number of cities = 5, number of observationsfor each city = 5, and 5% level of significance is 13.6.
Comparing the calculated difference in rank sums with this 13.6, we note that difference in rank sums
in A (Ahmedabad) and D (Delhi) as also, difference in rank sums between A(Ahmedabad) and B(
Bangalore) are significant.Note :
In the above case, if the data would have been available in terms of actual values rather thanranks alone, ANOVA would just led to conclusion that means of D, M, K, A and B are not
equal, but would have not gone beyond that. However, the above test concludes that mean of
Ahmedabad is not equal to mean of Delhi as also mean of Bangalore. Thus, it gives comparison
of all pairs of means.
8 Kruskall-Wallis Rank Sum Test for Equality of Means (of Several Populations) (H-test) (
One-Way ANOVA)
This test is used for testing equality of means of a number of populations, and the null hypothesis is
of the type
H0: m1 = m2 = m3 ..( can be even more than 3)It may be recalled that H0is the same as in ANOVA. However, here the ranks of observations are used andnot actual observations.
The Kruskall-Wallis test is a One-Factor or One-Way ANOVAwith values of variable as ranks. It is ageneralization of the two samples Mann-Whitney, U rank sum test for situations involving more than twopopulations.
The procedure for carrying out the test involves, assigning combined ranks to the observations in all
-
7/25/2019 Advanced Data Analysis Binder 2015
18/165
18
the samples from smallest to largest. The rank sum of each sample is then calculated. The test
statistics H is calculatedas follows:
)1(3)1(
122
1
nn
T
nnH
j
jk
j
)1(3
2
1
nn
T
j
jk
j
(4)
whereTj= Sum of ranks for treatment j
nj = Number of observations for treatment i
n = nj= Total number of observationsk = Number of treatments
The test is illustrated through an example given below.
Example 5
A chain of departmental stores opened three stores in Mumbai. The management wants to comparethe sales of the three stores over a six day long promotional period. The relevant data is given below.
(Sales ibn Rs. Lakhs)
Store A Store B Store C
Sales Sales Sales16 20 23
17 20 24
21 21 26
18 22 27
19 25 29
29 28 30
Use the Kruskal-Wallis test to compare the equality of mean sales in all the three stores.
SolutionThe combined ranks to the sales of all the stores on all the six days are calculated and presented in the
following Table.Store A Store B Store C
SalesCombinedRank
Sales Combined Rank SalesCombinedRank
16 1 20 5.5 23 10
17 2 20 5.5 24 11
21 7.5 21 7.5 26 13
18 3 22 9 27 14
19 4 25 12 29 16.5
29 16.5 28 15 30 18
T1= 34.0 T2 = 54.5 T3 = 82.5
It may be noted that the rank given for sales as 21 in Store A and for sales as 21 in Store B aregiven equal ranks as 7.5. Since there are six ranks below, the rank for 21 would have been 7, but
since the value 21 is repeated, both the values get rank as average of 7 and 8 as 7.5. The next value 22
has been assigned the rank 9. If there were three values as 21, the rank assigned to the values would
have been average of 7, 8 and 9 as 8, and the next value would have been ranked as 10.Now the H statistics is calculated as
= 171 9 ( = 18 x 19 / 2 : sum of ranks from 1 to 18 )
-
7/25/2019 Advanced Data Analysis Binder 2015
19/165
19
1)(1836
5.825.540.34)118(18
12
222
H
12 10932.5= ---------------------- ----------- 57
18( 181 ) 6
12 10932.5
= --------- x -------- 57
306 6
= 71.4557 = 14. 45
The value of 2 at 2 d.f. and 5% level of significance is 5.99.Since the calculated value of H is 14.45, and is greater than tabulated value, and it falls in the criticalregion, we reject the null hypothesis that all the three stores have equal sales.
9 Test for Given Samples to be from the Same Population - Friedman's Test (Two-Way
ANOVA)
Friedman's test is a non-parametric test for testing hypothesis that a given number of samples have
been drawn from the same population. This test is similar to ANOVA but it does not require the
assumption of normality and equal variance. Further, this test is carried out with the data in terms of
ranks of observations rather than their actual values, like in ANOVA. It is used whenever the numberof samples is greater than or equal to 3 (say k) and each of the sample size is equal (say n) like two-
way analysis of variance. In fact, it is referred to as Two-Way ANOVA. The null hypothesis to be
tested is that all the k samples have come from identical populations.The use of the test is illustrated below through a numerical example.
Illustration 5
Following data gives the percentage growth of sales of three brands of refrigerators, say A, B andC over a period of six years.
Percentage Growth Rate of the Brands
Year Brand A Brand B Brand C
1 15 14 32
2 10 19 30
3 15 11 27
4 13 19 38
5 18 20 336 27 20 22
In this case, the null hypothesis is that there is no significant difference among the growth rates of
the three brands. The alternative hypothesis is that at least two samples (two brands) differ from eachother.
Under null hypothesis, the Friedman's test statistic is :
)1k(n3Rj)1k(nk
12F
2k
1j
(5)
where,
-
7/25/2019 Advanced Data Analysis Binder 2015
20/165
20
k = Number of samples(brands) = 3 ( in the illustration)
n = Number of observations for each sample(brand) = 6 ( in the illustration)
Ri= Sum of ranks of jth sample (brand)
It may be noted that this F is different from Fishers F defined in Chapter VII on Statistical
Distributions.
The statistical tables exist for the sampling distribution of Friedmans F, these are not readily forvarious values of n and k. However, the sampling distribution of F can be approximated by a
2
(chi-square) distribution with kl degrees of freedom. The chi-square distribution table shows, that
with 31 = 2 degrees of freedom, the chi-square value at 5% level of significance is2= 5.99.
If the calculated value of F is less than or equal, to thetabulated value of chi-square (2at 5% level
of significance), growth rates of brands are considered statistically the same. In other words, there is
no significant difference in the growth rates of the brands. In case the calculated value exceeds thetabulated value, the difference is termed as significant.
For the above example, the following null hypothesis is framed:
Ho : There is no significant difference in the growth rates of the three brands of refrigerators.For calculation of F, the following Table is prepared. The figures in brackets indicate the rank of
growth of a brand in a particular year- the lowest growth is ranked 1 and the highest growth is ranked3.
The Growth Rate of Refrigerators of Different Brands(Growth Rates of Refrigerators)
Total Ranks
Year Brand A Brand B Brand C (Row Total)
1 15 14 32 6
(2) (1) (3)
2 18 15 30 6
(2) (1) (3)
3 15 11 27 6(2) (1) (3)
4 13 19 38 6
(l) (2) (3)
5 20 18 33 6
(2) (1) (3)
6 27 20 22 6
(3) (1) (2)
Total Ranks( Rj) 12 7 17 36
With reference to the above table, Friedmans test amounts to testing that sums of ranks ( Rj) of variousbrands are all equal.
The value of F is calculated as,
)13(63)1779(72
12
222F
= 726
482 = 80.372 = 8.3
It is observed that the calculated value of 'F' statistics is greater than the tabulated value of2 (5.99 at
5% level of significance and 2 d.f.). Hence, the hypothesis that there is no significant difference in the
growth rates of the three brands is rejected.
Therefore, we conclude that there is a significant difference in the growth rates of the three
-
7/25/2019 Advanced Data Analysis Binder 2015
21/165
21
brands of refrigerator, during the period under study. The significant difference is due to the
best growth rate of brand C.
In the above example, if the data were given for six showrooms instead of six years, the test would
have remained the same.
10 Test for Significance of Spearmans Rank Correlation
In Chapter VIII on Simple Correlation and Regression Analysis, the spearmans rank correlation has
been discussed in section 8.4 and is defined as
)1( 2
2
nn
dr
i
s
where n is the number of pairs of ranks given to individuals or units or objects, and diis the difference
in the two ranks given to ith individual / unit /object
There is no statistic to be calculated for testing the significance of the rank correlation. The calculatedvalue of rs is itself compared with the tabulated value of rs, given in Appendix . , at 5% or 1%
level of significance. If the calculated value is more than the tabulated value, the null hypothesis that
there is no correlation in the two rankings is rejected.Here, the hypotheses are as follows
Ho : s= 0
H1: s 0
In example 8.6, of Chapter VIII on Correlation and Regression, The rank correlation betweenpriorities of Job Commitment Drivers among executives from India and Asia Pacific was found to
be 0.9515. Comparing this value with the tabulated value of rsat n i.e.10 d.f. and 5% level of
significance as 0.6364 , we find that the calculated value is more than the tabulated value, and hencewe reject the null hypothesis that there is no correlation between priorities of Job Commitment
Drivers among executives from India and Asia Pacific.
10.1 Test for Significance of Spearmans Rank Correlation for Large Sample Size
If the number of pairs of ranks is more than 30, then the distribution of the rank correlation rs under
the null hypothesis that s= 0, can be approximated by normal distribution with mean 0 and s.d. as
1-n
1. This can be expressed in symbolic form as follows :
for n > 30, rs 1-n
10,N (6)
For example, in the case study 4 of this chapter, relating to rankings of top 50 CEOs in different
cities, n = 50. It may be verified that the rank correlation between the rankings in Mumbai and
Bangalore is 0.544. Thus the value of z i.e. standard normal variable is
z =7/1
0544.0 = 0.544 7 = 3.808 which is more than 1.96, the value of z at 5% level of significance.
Thus, it may be concluded that the ranking between Mumbai and Bangalore are significantlycorrelated.
11 Testing Equality of Several Rank Correlations
Sometimes, more than 2 ranks, are given to an individual / entity / object, and we are required
to test whether all the three rankings are equal. Consider the following situation wherein some
cities have been ranked as per three criteria
-
7/25/2019 Advanced Data Analysis Binder 2015
22/165
22
Illustration 6
As per a study published in times of India dt. 4th
September 2006, the rankings of ten cities as per
earning, investing and living criteria are as follows:
City R a n k i n g s
Earning Investing Living
Bangalore 2 6 1
Coimbtore 5 1 5Surat 1 2 10
Mumbai 7 5 2
Pune 3 4 7
Chennai 9 3 3
Delhi 4 7 8
Hyderabad 8 8 6
Kolkata 10 10 4
Ahmedabad 6 9 9( Source : City Skyline of India 2006 published by Indicus Analytics)
Here, the test statistic is
2
2
21
s
sF ~ F(k-1), k(n-1) (7)
which is distributed as fishers F with {k 1 , k(n1)} d.f.
where k is the number of cities equal to 10 in the illustration, n is the number of criteria of rankings,equal to 3 in the illustration,
The calculations for 21s and2
2s are indicated below
)1(
2
1kn
ss d (8)
where sd is the sum of squares of difference between mean rank of a city and the overall mean ranks
of all cities.
)1(
)(2
2nk
n
ss
s
d
(9)
where,12
)1(2
knks (10)
The required calculations are illustrated belowRankings Sum of
City
Rankings
Mean of
City
Rankings
Difference
from Grand
Mean Ranking
Squared Difference
from Grand Mean
RankingCity Earning Investing Living
Bangalore 2 6 1 9 3 -2.5 6.25
Coimbtore 5 1 5 11 3.67 -1.83 3.36
Surat 1 2 10 13 4.33 -1.17 1.36
Mumbai 7 5 2 14 4.67 -0.83 0.69
Pune 3 4 7 14 4.67 -0.83 0.69
Chennai 9 3 3 15 5 -0.5 0.25
Delhi 4 7 8 19 6.33 0.83 0.69
Hyderabad 8 8 6 22 7.33 1.83 3.36
Kolkata 10 10 4 24 8 2.5 6.25
Ahmedabad 6 9 9 24 8 2.5 6.25
Total =165Mean =
5.5Total 29.2
-
7/25/2019 Advanced Data Analysis Binder 2015
23/165
23
n = 3, k = 10
12
)1100(103s = 247.5
d = 29.2
)110(3
2.2921s = 1.08
)13(10
)3
2.295.247(
2
2s = 11.89
2
1
2
2
s
sF ( Because 22s >
2
1s )
It may be recalled that in the Fishers F ratio of two sample variances, the greater one is taken in the
numerator, and d.f. of F are taken, accordingly
08.1
89.11F = 11.00
F27, 9= 2.25
Since the calculated value is more than the tabulated value of F, we reject the null hypothesis, and
conclude that rankings on the given parameter are not equal.
8.5 Kendals Rank Correlation Coefficient
In addition to the Spearmans correlation coefficient,there exists one more rank correlation
coefficient, introduced by Maurice Kendal, in 1938. Like Spearmans correlation coefficient it is alsoused when the data is available in ordinal form. It is more popular as Kendals Tau, and denoted by
the Greek letter tau( corresponding to t in English). It measures the extent of agreement or
association between rankings, and is defined as
= (ncnd) / ( nc+ nd)
where,
nc: Number of concordantpairs of rankingsnd: Number of disconcordantpairs of rankings
The maximum value of nc+ ndcould be the total number of pairs of ranks given by two different
persons or by the same person on two different criteria. For example, if the number of observations is,say 3 ( call them a, b and c), then the pairs of observations will be : ab, ac and bc. Similarly, if
there are four pairs, the possible pairs are six in number as follows:
ab, ac, ad, bc, bd, cd
It may be noted that, more the number of concordant pairs, more will be the value of the numerator
and more the value of the coefficient, indicating higher degree of consistency in the rankings.
A pair of subjects i.e. persons or units, is said to be concordant, if for a subject, the rank of bothvariables is higher than or equal to the corresponding rank of the both variables. On the other hand, if
for a subject, the rank of one variable is higher or lower than the corresponding rank of the othervariable and the rank of the other variable is opposite i.e. lower / higher than the corresponding rank
of the other variable, the pair of subjects is said to be discordant.
The concepts of concordant and discordant pairs, and the calculation of are explained though anexample given below.
As per a study published in Times of India dt.4th
September 2006, several cities were ranked as per
the criteria of Earning, Investing and Living. It is given in the CD in the Chapter relating to
Simple Correlation and Regression. An extract from the full Table is given below.
-
7/25/2019 Advanced Data Analysis Binder 2015
24/165
24
City Ranking as per Earning Ranking as per Investing
Chennai 3 2
Delhi 1 4
Kolkata 4 3
Mumbai 2 1
We rearrange the table by arranging cities as per Earning ranking
City Ranking as per Earning Ranking as per Investing
Delhi 1 3
Mumbai 2 2
Chennai 3 1
Kolkata 4 4
Now, we form all possible pairs of cities. In this case, total possible pairs are 4C2 = 6, viz. DM, DC,
DK, MC,MK and CK.
The status of each pair i.e. whether, it is concordant or discordant, along with reasoning is given in
the following Table.
Pair Concordant (C ) / Disconcordant
(D)
Reasoning
DM D Because both pairs 1,2 and 3,2 are in opposite order
DC D Because both pairs 1,3 and 3,1 are opposite of each other
DK C Because both pairs 1,4 and 3,4 are in ascending order /
(same direction)
MC D Because both pairs 1,3 and 2,1are in opposite order
MK C Because both pairs 2,4 and 2,4 are ascending and of the
same order
CK C Because both pairs 3,4 and 1,4 are in ascending order /
(same direction)
It may be noted that for a pair of subjects ( cities in the above case), when a subject ranks higher on
one variable also ranks higher on the variable or even equal to the rank of the other variable, then thepair is said to be concordant. On the other hand, if a subject ranks higher on one variable and loweron the other variable, it is said to be discordant.
From the above Table, we note that;
nc = 3 and nd = 3
Thus, the Kendals coefficient of correlation or concordance is= ( 33 ) / 5 = 0.
As regards the possible values and interpretation of the value of , the following
If the association between the two rankings is perfect (i.e., the two rankings are the same) the
coefficient has value 1.If the association between the two rankings is perfect (i.e., one ranking is the reverse of the
other) the coefficient has value 1.In other cases, the value lies between 1 and 1, and increasing values imply increasingassociation between the rankings. If the rankings are completely independent, the coefficient
has value 0.
Incidentally, Spearmans correlation and Kendals correlations are not comparable, and their valuescould be different.
The Kendals Tau also measures the strength ofassociation of thecross tabulations.It also tests
the strength of association of the cross tabulations when both variables are measured at the
ordinal level,and in fact, is the only measure of association in ordinal form in a cross tabulation
data available in ordinal form.
9 Sign Test
http://en.wikipedia.org/wiki/Association_(statistics)http://en.wikipedia.org/wiki/Association_(statistics)http://en.wikipedia.org/wiki/Cross_tabulationshttp://en.wikipedia.org/wiki/Ordinal_levelhttp://en.wikipedia.org/wiki/Ordinal_levelhttp://en.wikipedia.org/wiki/Cross_tabulationshttp://en.wikipedia.org/wiki/Association_(statistics) -
7/25/2019 Advanced Data Analysis Binder 2015
25/165
25
The sign test is a nonparametric statistical procedure for fulfilling the following objective:
(i) To identify preference for one of the two brands of a product like tea, soft drink, mobilephone, TV, and/or service like cellular company, internet service provider.
(ii) To determine whether a change being contemplated or introduced is found favorable?The data collected in such situations is of the type, + (preference for one) or 3 (preference for
the other) signs. Since the data is collected in terms of plus and minus signs, the test is calledSign Test. The data having 10 observations is of the type:
+ , +,, +,,, +, +, +,
Illustration:The Director of a management institute wanted to have an idea of the opinion of the students about
the new time schedule for the classes
8.00 a.m. to 2.00p.m.
He randomly selected a representative sample of 20 students, and recorded their preferences. The datawas collected in the following form:
Student No. In Favour of the ProposedOption : 1, Opposed to the
Option: 2
Sign( + for 1,for 2 )
1 1 +
2 1 +
3 2
4 1 +
5 2
6 2
7 2
8 2
9 1 +
10 1 +
11 2
12 1 +
13 2
14 2
15 1 +
16 1 +
17 1 +
18 1 +
19 1 +
20 1 +
Here the hypothesis to be tested is:H0:p = 0.50
H1 : p 0.50Under the assumption that
H0is true i.e. p = 0.50, the number of + signs follows a binomial distribution with p = .50.
Let x denote the number of + signs.
It may be noted that if the value of x is either low or high, then the null hypothesis will be rejected in
favour of the alternative
H1 : p 0.50For = 0.05, the low value is the value of x for which the area is less than 0.025, and the high value
of x forwhich the area is more than 0.025
-
7/25/2019 Advanced Data Analysis Binder 2015
26/165
26
If the alternative is
H1: p < 0.50.
Then the value of x has to be low enough to reject the null hypothesis in favour of
H1: p < 0.50.If the alternative is
H1: p > 0.50.
Then the value of x has to be high enough to reject the null hypothesis in favour ofH1: p > 0.50.
With a sample size of n =20, one can refer to the Table below showing the probabilities for all the
possible values of the binomial probability distribution with p = 0.5.
Number of + Signs Probability
0 0.0000
1 0.0000
2 0.0002
3 0.0011
4 0.0046
5 0.0148
6 0.03707 0.0739
8 0.1201
9 0.1602
10 0.1762
11 0.1602
12 0.1201
13 0.0739
14 0.0370
15 0.0148
16 0.004617 0.0011
18 0.0002
19 0.0000
20 0.0000
The binomial probability distribution as shown as shown above can be used to provide the decisionrule for any signtest up to a sample size of n = 20. With the null hypothesis p = .50 and the sample
size n, the decision rule can be established for any level of significance. In addition, by consideringthe probabilities in only the lower or upper tail of the binomial probability distribution, we can
develop rejection rules for one-tailed tests.
The above Table gives the probability of the number of plus signs under the assumption that H0is
true, and is, therefore, the appropriate sampling distribution for the hypothesis test. This samplingdistribution is used to determine a criterion for rejecting H0.Thisapproach is similar to the method
used for developing rejection criteria for hypothesis testing given in the Chapter 11 on Statistical
Inference.For example, let = .05, and the test be two sided. In this case the alternative hypothesis will be
H1 : m 0.50, and we would have a critical or rejection region area of 0.025 in each tail of the distribution.Starting at the lower end of the distribution, we see that the probability of obtaining zero, one, two,
three or four plus signs is 0.0000 + 0.0000 +0 .0002 + .0011 + .0046 + 0.0148 = .0 207. Note that we
stop at 5 + signs because adding the probability of six + signs would make the area in the lower tail
equal to .0 207 + .0370 = .0577, which substantially exceeds the desired area of .025. At the upper
-
7/25/2019 Advanced Data Analysis Binder 2015
27/165
27
end of the distribution, we find the same probability of .0 207 corresponding to 15, 16, or 17, 18, 19
or 20 + signs. Thus, the closest we can come to = .05 without exceeding it is .0207 + .0207 = 0.0414. We therefore adopt the following rejection criterion.
Reject H0if the number of + signs is less than 6 or greater than 14Since the number of + signs in the given illustration are 12, we cannot reject the null Hypothesis: and
thus the data reveals that the students are not againstthe option.
It may be noted that the Table T3 for Binomial distribution does not provide probabilities for samplesizes greater than 20. In such cases only , we can use the large-sample normal approximation ofbinomial probabilities to determine the appropriate rejection rule for the sign test.
13. .Sign Test for Paired DataThe sign test can also be used to test the hypothesis that there is "no difference" between twodistributions of continuous variables x and y.
Let p = P(x > y), and then we can test the null hypothesis
H0 : p = 0.50
This hypothesis implies that given a random pair of measurements (xi, yi), then both xiand yiareequally likely to be larger than the other.
To perform the test, we first collect independent pairs of sample data from the two populations as
(x1, y1), (x2, y2), . . ., (xn, yn)We omit pairs for which there is no difference so that we may have a reduced sample of pairs. Now,
let x be the number of pairs for which xiyi> 0. Assuming that Ho is true, then r follows a
binomial distribution with p = 0.5. It may be noted that if the value of r is either low or high, then the
null hypothesis will be rejected in favour of the alternative
H1 : p 0.50For = 0.05, the low value is the value of x for which the area is less than 0.025, and the high value
of x for which the area is more than 0.025This alternative means that the x measurements tend to be higher than y measurements.
The right-tail value is computed by total of probabilities that are greater than x, and is the p-value for
the alternative H1: p > 0.50. This alternative means that the y measurements tend to be higher.
For a two-sided alternative H1: p 0.50, the p-value is twice the smallest tail-value.Illustration
The HRD Chief of an organisation wants to assess whether there is any significant difference in the
marks obtained by twelve trainee officers in the papers on Indian and Global Perspectives conductedafter the induction training.
(72, 70) (82, 79) (78, 69) (80, 74) (64, 66) (78, 75)
(85, 86) (83, 77) (83, 88) (84, 90) (78, 72) (84, 82)
We convert the above data in the following tabular form where + indicates values of yi > xi .
Trainee Officer
1 +
2 +3 +
4 +
5
6 +
7
8 +
9
10
11 +
12 +
-
7/25/2019 Advanced Data Analysis Binder 2015
28/165
28
It may be noted that the number of + signs is 8.
Here, we test the null hypothesis that there is "no significant difference" in the scores in the two
papers, with the two-sided alternative that there is a significant difference. In symbolic notation:
H0 : m1= m2
H1: m1m2Number of + Signs Probability
0 0.0002
1 0.0029
2 0.0161
3 0.0537
4 0.1208
5 0.1934
6 0.2256
7 0.1934
8 0.12089 0.0537
10 0.0161
11 0.0029
12 0.0002
Let the level of significance be = .05.
In a two sided test, we would have a rejection region or area of approximately .025 in each tail of the
distribution.
From the above table, we note that the probability of obtaining zero, one, or two plus signs is .0002 +
.0029 + .0161 = .0 192. Thus, , the probability of getting number of plus signs up to 3 is 0.0192,
Since it is less than i.e. 0.025, theHo: will be rejected.
Similarly, the probability of getting 10, 11 and 12 + signs is .0002 + .0029 + .0161 = 0.0192. Thus, ,the probability of getting number of plus signs beyond 10 is 0.0192, Since it is less than i.e.
0.025, theHo: will be rejected
Thus, the Ho: will be rejected, if the number of + signs is up to 3 or beyond 9, at 5% level ofsignificance. We. therefore, follow the following rejection criterion.
Reject H0if the number of + signs is less than 3 or greater than 9Since, in our case the number of plus signs is only 8, we cannot reject the null hypothesis. Thus, we
conclude that there is no significant difference between the marks obtained in the papers on Indianand Global perspectives.
If we want to test that the average marks obtained in the paper on Indian Perspective is more than theaverage marks in the paper on Global Perspectives, the null hypothesis is
H0: m1= m2
, and the alternative hypothesis is
H1: m1> m2
Thus, the test is one sided. If the level of significance is 0.05, the entire critical region will be on the
right side. The probability of getting number of plus signs more than 8 is 0.0792 which is more than
0.05, and, therefore, the null hypothesis cannot be rejected.
-
7/25/2019 Advanced Data Analysis Binder 2015
29/165
29
Statistical analyses using SPSS
Introduction
This page shows how to perform a number of statistical tests using SPSS. Each section gives a brief
description of the aim of the statistical test, when it is used, an example showing the SPSS commands
and SPSS (often abbreviated) output with a brief interpretation of the output. You can see the pageChoosing the Correct Statistical Test for a table that shows an overview of when each test is
appropriate to use. In deciding which test is appropriate to use, it is important to consider the type of
variables that you have (i.e., whether your variables are categorical, ordinal or interval and whetherthey are normally distributed), seeWhat is the difference between categorical, ordinal and interval
variables? for more information on this.
About the hsb data file
Most of the examples in this page will use a data file called hsb2, high school and beyond. This datafile contains 200 observations from a sample of high school students with demographic information
about the students, such as their gender (female), socio-economic status (ses) and ethnic background(race). It also contains a number of scores on standardized tests, including tests of reading (read),writing (write), mathematics (math) and social studies (socst). You can get the hsb data file by
clicking onhsb2.
One sample t-test
A one sample t-test allows us to test whether a sample mean (of a normally distributed intervalvariable) significantly differs from a hypothesized value. For example, using thehsb2 data file,say
we wish to test whether the average writing score (write) differs significantly from 50. We can do
this as shown below.
t-test/testval = 50/variable = write.
http://www.ats.ucla.edu/stat/mult_pkg/whatstat/choosestat.htmlhttp://www.ats.ucla.edu/stat/mult_pkg/whatstat/nominal_ordinal_interval.htmhttp://www.ats.ucla.edu/stat/mult_pkg/whatstat/nominal_ordinal_interval.htmhttp://www.ats.ucla.edu/stat/spss/webbooks/reg/hsb2.savhttp://www.ats.ucla.edu/stat/spss/whatstat/whatstat.htm#hsbhttp://www.ats.ucla.edu/stat/spss/whatstat/whatstat.htm#hsbhttp://www.ats.ucla.edu/stat/spss/webbooks/reg/hsb2.savhttp://www.ats.ucla.edu/stat/mult_pkg/whatstat/nominal_ordinal_interval.htmhttp://www.ats.ucla.edu/stat/mult_pkg/whatstat/nominal_ordinal_interval.htmhttp://www.ats.ucla.edu/stat/mult_pkg/whatstat/choosestat.html -
7/25/2019 Advanced Data Analysis Binder 2015
30/165
30
The mean of the variable writefor this particular sample of students is 52.775, which is statistically
significantly different from the test value of 50. We would conclude that this group of students has a
significantly higher mean on the writing test than 50.
One sample median test
A one sample median test allows us to test whether a sample median differs significantly from ahypothesized value. We will use the same variable, write, as we did in theone sample t-test example
above, but we do not need to assume that it is interval and normally distributed (we only need toassume that writeis an ordinal variable). However, we are unaware of how to perform this test in
SPSS.
Binomial test
A one sample binomial test allows us to test whether the proportion of successes on a two-levelcategorical dependent variable significantly differs from a hypothesized value. For example, using
thehsb2 data file,say we wish to test whether the proportion of females (female) differs significantly
from 50%, i.e., from .5. We can do this as shown below.
npar tests/binomial (.5) = female.
The results indicate that there is no statistically significant difference (p = .229). In other words, the
proportion of females in this sample does not significantly differ from the hypothesized value of 50%.
Chi-square goodness of fit
A chi-square goodness of fit test allows us to test whether the observed proportions for a categorical
variable differ from hypothesized proportions. For example, let's suppose that we believe that the
general population consists of 10% Hispanic, 10% Asian, 10% African American and 70% White
folks. We want to test whether the observed proportions from our sample differ significantly fromthese hypothesized proportions.
npar test/chisquare = race/expected = 10 10 10 70.
http://www.ats.ucla.edu/stat/spss/whatstat/whatstat.htm#1sampthttp://www.ats.ucla.edu/stat/spss/whatstat/whatstat.htm#hsbhttp://www.ats.ucla.edu/stat/spss/whatstat/whatstat.htm#hsbhttp://www.ats.ucla.edu/stat/spss/whatstat/whatstat.htm#1sampt -
7/25/2019 Advanced Data Analysis Binder 2015
31/165
31
These results show that racial composition in our sample does not differ significantly from thehypothesized values that we supplied (chi-square with three degrees of freedom = 5.029, p = .170).
Two independent samples t-test
An independent samples t-test is used when you want to compare the means of a normally distributed
interval dependent variable for two independent groups. For example, using thehsb2 data file,say
we wish to test whether the mean for writeis the same for males and females.
t-test groups = female(0 1)/variables = write.
The results indicate that there is a statistically significant difference between the mean writing score
for males and females (t = -3.734, p = .000). In other words, females have a statistically significantlyhigher mean score on writing (54.99) than males (50.12).
Wilcoxon-Mann-Whitney test
The Wilcoxon-Mann-Whitney test is a non-parametric analog to the independent samples t-test and
can be used when you do not assume that the dependent variable is a normally distributed interval
variable (you only assume that the variable is at least ordinal). You will notice that the SPSS syntax
for the Wilcoxon-Mann-Whitney test is almost identical to that of the independent samples t-test. Wewill use the same data file (thehsb2 data file)and the same variables in this example as we did in the
http://www.ats.ucla.edu/stat/spss/whatstat/whatstat.htm#hsbhttp://www.ats.ucla.edu/stat/spss/whatstat/whatstat.htm#hsbhttp://www.ats.ucla.edu/stat/spss/whatstat/whatstat.htm#hsbhttp://www.ats.ucla.edu/stat/spss/whatstat/whatstat.htm#hsb -
7/25/2019 Advanced Data Analysis Binder 2015
32/165
32
independent t-test example above and will not assume that write, our dependent variable, is normally
distributed.
npar test/m-w = write by female(0 1).
The results suggest that there is a statistically significant difference between the underlying
distributions of the writescores of males and the writescores of females (z = -3.329, p = 0.001).
Chi-square test
A chi-square test is used when you want to see if there is a relationship between two categoricalvariables. In SPSS, the chisqoption is used on the statisticssubcommand of the crosstabscommand
to obtain the test statistic and its associated p-value. Using thehsb2 data file,let's see if there is a
relationship between the type of school attended (schtyp) and students' gender (female). Remember
that the chi-square test assumes that the expected value for each cell is five or higher. Thisassumption is easily met in the examples below. However, if this assumption is not met in your data,
please see the section on Fisher's exact test below.
crosstabs
/tables = schtyp by female/statistic = chisq.
http://www.ats.ucla.edu/stat/spss/whatstat/whatstat.htm#2ittesthttp://www.ats.ucla.edu/stat/spss/whatstat/whatstat.htm#hsbhttp://www.ats.ucla.edu/stat/spss/whatstat/whatstat.htm#hsbhttp://www.ats.ucla.edu/stat/spss/whatstat/whatstat.htm#2ittest -
7/25/2019 Advanced Data Analysis Binder 2015
33/165
33
These results indicate that there is no statistically significant relationship between the type of school
attended and gender (chi-square with one degree of freedom = 0.047, p = 0.828).
Let's look at another example, this time looking at the linear relationship between gender (female)
and socio-economic status (ses). The point of this example is that one (or both) variables may have
more than two levels, and that the variables do not have to have the same number of levels. In this
example, femalehas two levels (male and female) and seshas three levels (low, medium and high).
crosstabs/tables = female by ses/statistic = chisq.
Again we find that there is no statistically significant relationship between the variables (chi-squarewith two degrees of freedom = 4.577, p = 0.101).
Fisher's exact test
The Fisher's exact test is used when you want to conduct a chi-square test but one or more of your
cells has an expected frequency of five or less. Remember that the chi-square test assumes that eachcell has an expected frequency of five or more, but the Fisher's exact test has no such assumption and
can be used regardless of how small the expected frequency is. In SPSS unless you have the SPSSExact Test Module, you can only perform a Fisher's exact test on a 2x2 table, and these results are
presented by default. Please see the results from the chi squared example above.
One-way ANOVA
A one-way analysis of variance (ANOVA) is used when you have a categorical independent variable
(with two or more categories) and a normally distributed interval dependent variable and you wish totest for differences in the means of the dependent variable broken down by the levels of the
independent variable. For example, using thehsb2 data file,say we wish to test whether the mean of
writediffers between the three program types (prog). The command for this test would be:
http://www.ats.ucla.edu/stat/spss/whatstat/whatstat.htm#hsbhttp://www.ats.ucla.edu/stat/spss/whatstat/whatstat.htm#hsb -
7/25/2019 Advanced Data Analysis Binder 2015
34/165
34
oneway write by prog.
The mean of the dependent variable differs significantly among the levels of program type. However,
we do not know if the difference is between only two of the levels or all three of the levels. (The F
test for the Modelis the same as the F test for progbecause progwas the only variable entered intothe model. If other variables had also been entered, the F test for the Modelwould have been
different from prog.) To see the mean of writefor each level of program type,
means tables = write by prog.
From this we can see that the students in the academic program have the highest mean writing score,
while students in the vocational program have the lowest.
Kruskal Wallis test
The Kruskal Wallis test is used when you have one independent variable with two or more levels and
an ordinal dependent variable. In other words, it is the non-parametric version of ANOVA and a
generalized form of the Mann-Whitney test method since it permits two or more groups. We will use
the same data file as theone way ANOVA example above (thehsb2 data file)and the same variablesas in the example above, but we will not assume that writeis a normally distributed interval variable.
npar tests/k-w = write by prog (1,3).
http://www.ats.ucla.edu/stat/spss/whatstat/whatstat.htm#1anovahttp://www.ats.ucla.edu/stat/spss/whatstat/whatstat.htm#hsbhttp://www.ats.ucla.edu/stat/spss/whatstat/whatstat.htm#hsbhttp://www.ats.ucla.edu/stat/spss/whatstat/whatstat.htm#1anova -
7/25/2019 Advanced Data Analysis Binder 2015
35/165
35
If some of the scores receive tied ranks, then a correction factor is used, yielding a slightly different
value of chi-squared. With or without ties, the results indicate that there is a statistically significantdifference among the three type of programs.
Paired t-test
A paired (samples) t-test is used when you have two related observations (i.e., two observations per
subject) and you want to see if the means on these two normally distributed interval variables differfrom one another. For example, using thehsb2 data file we will test whether the mean of readis
equal to the mean of write.
t-test pairs = read with write (paired).
These results indicate that the mean of readis not statistically significantly different from the mean
of write(t = -0.867, p = 0.387).
Wilcoxon signed rank sum test
The Wilcoxon signed rank sum test is the non-parametric version of a paired samples t-test. You use
the Wilcoxon signed rank sum test when you do not wish to assume that the difference between the
two variables is interval and normally distributed (but you do assume the difference is ordinal). Wewill use the same example as above, but we will not assume that the difference between readand
writeis interval and normally distributed.
http://www.ats.ucla.edu/stat/spss/whatstat/whatstat.htm#hsbhttp://www.ats.ucla.edu/stat/spss/whatstat/whatstat.htm#hsb -
7/25/2019 Advanced Data Analysis Binder 2015
36/165
36
npar test/wilcoxon = write with read (paired).
The results suggest that there is not a statistically significant difference between readand write.
If you believe the differences between readand writewere not ordinal but could merely be classifiedas positive and negative, then you may want to consider a sign test in lieu of sign rank test. Again,
we will use the same variables in this example and assume that this difference is not ordinal.
npar test/sign = read with write (paired).
-
7/25/2019 Advanced Data Analysis Binder 2015
37/165
37
We conclude that no statistically significant difference was found (p=.556).
McNemar test
You would perform McNemar's test if you were interested in the marginal frequencies of two binary
outcomes. These binary outcomes may be the same outcome variable on matched pairs (like a case-control study) or two outcome variables from a single group. Continuing with the hsb2 dataset used
in several above examples, let us create two binary outcomes in our dataset: himathand hiread.
These outcomes can be considered in a two-way contingency table. The null hypothesis is that theproportion of students in the himathgroup is the same as the proportion of students in hireadgroup
(i.e., that the contingency table is symmetric).
compute himath = (math>60).compute hiread = (read>60).execute.
crosstabs/tables=himath BY hiread/statistic=mcnemar/cells=count.
McNemar's chi-square statistic suggests that there is not a statistically significant difference in the
proportion of students in the himathgroup and the proportion of students in the hireadgroup.
One-way repeated measures ANOVA
-
7/25/2019 Advanced Data Analysis Binder 2015
38/165
38
You would perform a one-way repeated measures analysis of variance if you had one categorical
independent variable and a normally distributed interval dependent variable that was repeated at least
twice for each subject. This is the equivalent of the paired samples t-test, but allows for two or more
levels of the categorical variable. This tests whether the mean of the dependent variable differs by thecategorical variable. We have an example data set calledrb4wide,which is used in Kirk's book
Experimental Design. In this data set, yis the dependent variable, ais the repeated measure and sis
the variable that indicates the subject number.
glm y1 y2 y3 y4/wsfactor a(4).
http://www.ats.ucla.edu/stat/spss/whatstat/rb4wide.savhttp://www.ats.ucla.edu/stat/spss/whatstat/rb4wide.sav -
7/25/2019 Advanced Data Analysis Binder 2015
39/165
39
You will notice that this output gives four different p-values. The output labeled "sphericity
assumed" is the p-value (0.000) that you would get if you assumed compound symmetry in the
variance-covariance matrix. Because that assumption is often not valid, the three other p-values offer
various corrections (the Huynh-Feldt, H-F, Greenhouse-Geisser, G-G and Lower-bound). No matterwhich p-value you use, our results indicate that we have a statistically significant effect of aat the .05
level.
Factorial ANOVA
A factorial ANOVA has two or more categorical independent variables (either with or without the
interactions) and a single normally distributed interval dependent variable. For example, using the
hsb2 data file we will look at writing scores (write) as the dependent variable and gender (female)
and socio-economic status (ses) as independent variables, and we will include an interaction of
femaleby ses. Note that in SPSS, you do not need to have the interaction term(s) in your data set.
Rather, you can have SPSS create it/them temporarily by placing an asterisk between the variablesthat will make up the interaction term(s).
glm write by female ses.
http://www.ats.ucla.edu/stat/spss/whatstat/whatstat.htm#hsbhttp://www.ats.ucla.edu/stat/spss/whatstat/whatstat.htm#hsb -
7/25/2019 Advanced Data Analysis Binder 2015
40/165
40
These results indicate that the overall model is statistically significant (F = 5.666, p = 0.00). The
variables femaleand sesare also statistically significant (F = 16.595, p = 0.000 and F = 6.611, p =0.002, respectively). However, that interaction between femaleand sesis not statistically significant
(F = 0.133, p = 0.875).
Friedman test
You perform a Friedman test when you have one within-subjects independent variable with two ormore levels and a dependent variable that is not interval and normally distributed (but at least
ordinal). We will use this test to determine if there is a difference in the reading, writing and math
scores. The null hypothesis in this test is that the distribution of the ranks of each type of score (i.e.,
reading, writing and math) are the same. To conduct a Friedman test, the data need to be in a long
format. SPSS handles this for you, but in other statistical packages you will have to reshape the databefore you can conduct this test.
npar tests/friedman = read write math.
-
7/25/2019 Advanced Data Analysis Binder 2015
41/165
41
Friedman's chi-square has a value of 0.645 and a p-value of 0.724 and is not statistically significant.
Hence, there is no evidence that the distributions of the three types of scores are different.
Correlation
A correlation is useful when you want to see the relationship between two (or more) normally
distributed interval variables. For example, using thehsb2 data file we can run a correlation betweentwo continuous variables, readand write.
correlations/variables = read write.
In the second example, we will run a correlation between a dichotomous variable, female, and acontinuous variable, write. Although it is assumed that the variables are interval and normally
distributed, we can include dummy variables when performing correlations.
correlations/variables = female write.
In the first example above, we see that the correlation between readand writeis 0.597. By squaring
the correlation and then multiplying by 100, you can determine what percentage of the variability is
shared. Let's round 0.597 to be 0.6, which when squared would be .36, multiplied by 100 would be
36%. Hence readshares about 36% of its variability with write. In the output for the secondexample, we can see the correlation between writeand femaleis 0.256. Squaring this number yields
.065536, meaning that femaleshares approximately 6.5% of its variability with write.
Simple linear regression
Simple linear regression allows us to look at the linear relationship between one normally distributed
interval predictor and one normally distributed interval outcome variable. For example, using the
hsb2 data file,say we wish to look at the relationship between writing scores (write) and readingscores (read); in other words, predicting writefrom read.
regression variables = write read/dependent = write/method = enter.
http://www.ats.ucla.edu/stat/spss/whatstat/whatstat.htm#hsbhttp://www.ats.ucla.edu/stat/spss/whatstat/whatstat.htm#hsbhttp://www.ats.ucla.edu/stat/spss/whatstat/whatstat.htm#hsbhttp://www.ats.ucla.edu/stat/spss/whatstat/whatstat.htm#hsb -
7/25/2019 Advanced Data Analysis Binder 2015
42/165
42
We see that the relationship between writeand readis positive (.552) and based on the t-value(10.47) and p-value (0.000), we would conclude this relationship is statistically significant. Hence,
we would say there is a statistically significant positive linear relationship between reading andwriting.
Non-parametric correlation
A Spearman correlation is used when one or both of the variables are not assumed to be normally
distributed and interval (but are assumed to be ordinal). The values of the variables are converted in
ranks and then correlated. In our example, we will look for a relationship between readand write.
We will not assume that both of these variables are normal and interval.
nonpar corr/variables = read write
/print = spearman.
The results suggest that the relationship between readand write(rho = 0.617, p = 0.000) is
statistically significant.
-
7/25/2019 Advanced Data Analysis Binder 2015
43/165
43
MULTIVARIATE STATISTICAL TECHNIQUES
Contents
Introduction Multiple Regression Discriminant Analysis Logistic Regression
Multivariate Analysis of Variance (MANOVA) Factor Analysis
Principal Component Analysis
Common Factor Analysis (Principle Axis Factoring) Canonical Correlation Analysis Cluster Analysis Conjoint Analysis Multidimensional Scaling
1 Relevance and Introduction
In general, what is true about predicting the success of a politician with the help of intelligencesoftware, as pointed out above by the President of India, is equally true for predicting the success of
products and services with the help of statistical techniques. In this Chapter, we discuss a number ofstatistical techniques which are especially useful in designing of products and services. The productsand services could be physical, financial, promotional like advertisement, behavioural like
motivational strategy through incentive package or even educational like training
programmes/seminars, etc. These techniques, basically involve reduction of data and subsequent itssummarisation, presentation and interpretation. A classical example of data reduction and
summarisation is provided by SENSEX (Bombay Stock Exchange) which is one number like 18,000,
but it represents movement in share prices listed in Bombay Stock Exchange. Yet another example is
the Grade Point Average, used for assessment of MBA students, which reduces and summarisesmarks in all subjects to a single number.
In general, any live problem whether relating to individual, like predicting the cause of an ailment or
behavioural pattern, or relating to an entity, like forecasting its futuristic status in terms of productsand services, needs collection of data on several parameters. These parameters are then analysed to
summarise the entire set of data with a few indicators which are then used for drawing conclusions.
The following techniques (with their abbreviations in brackets) coupled with the appropriate
computer software like SPSS, play a very useful role in the endeavour of reduction andsummarisation of data for easy comprehension.
Multiple Regression Analysis (MRA)
Discriminant Analysis (DA)
Logistic Regression (LR)
Multivariate Analysis of Variance (MANOVA)( introduction)
Factor Analysis (FA)
Principal Component Analysis ( PCA)
Canonical Correlation Analysis (CRA) ( introduction)
Cluster Analysis
Conjoint Analysis
Multidimensional Scaling( MDS)
Before describing these techniques in details, we provide their brief description as also indicate
their relevance and uses, in a tabular format given below. This is aimed at providing
motivation for learning these techniques and generating confidence in using SPSS for arriving
at final conclusions/solutions in a research study. The contents of the Table will be fully
comprehended after reading all the techniques.
-
7/25/2019 Advanced Data Analysis Binder 2015
44/165
44
Statistical Techniques, Their Relevance and Uses for Designing and Marketing of Products and
Services
Technique Relevance and Uses
Multiple Regression Analysis (MRA)
It deals with the study of relationshipbetween one metric dependent variableand more than one metric independent
variables
One could assess the individual impact of the
independent variables on the dependent variableGiven the values of the independent variables, onecould forecast the value of the dependent variable.
For example the sale of a product depends on
expenditures on advertisements as well as on R & D.Given the values of these two variables, one could
establish a relationship among these variables and the
dependent variable, say, profit. Subsequently, if the
relationship is found appropriate it could be used topredict the profit with the knowledge of the two types of
expenditures.
Discriminant Analysis
It is a statistical technique for classification ordetermining a linear function, called
discriminant function, of the variables which
helps in discriminating between two groups of
entities or individuals.
The basic objective of discriminant analysis is toperform a classification function.
From the analysis of past data, it can classify a givengroup of entities or individuals into two categories
one those which would turn out to be successful andothers which would not be so. For example, it can
predict whether a company or an individual would
turn out to be a good borrower.
With the help of financial parameters, a firm could beclassified as worthy of extending credit or not.
With the help of financial and personal parameters,an individual could be classified as eligible for loan
or not or whether he would be a buyer of a particularproduct/service or not.
Salesmen could be classified according to their age,health, sales aptitude score, communication
ability score, etc..
Logistic Regression
It is a technique that assumes theerrors are drawn from a binomialdistribution.
In logistic regression the dependentvariable is the probability that anevent will occur, hence it isconstrained between 0 and 1.
All of the predictors can be binary,a mixture of categorical andcontinuous or just continuous.
Logistic regression is highly useful in biometrics and
health sciences. It is used frequently by
epidemiologists for the probability (sometimes
interpreted as risk) that an individual will acquire a
disease during some specified period of vulnerability.Credit Card Scoring: Various demographic and
credit history variables could be used to predict if an
individual will turn out to be good or bad
customers.
Market Segmentation: Various demographic and
purchasing information could be used to predict if an
individual will purchase an item or not
Multivariate Analysis of Variance (
MANOVA)Determines whether statistically significantdifferences of means of several variables occur
-
7/25/2019 Advanced Data Analysis Binder 2015
45/165
45
It explores, simultaneously, therelationship between several non-metric independent variables(Treatments, say Fertilisers) and twoor moremetric dependant variables(say, Yield & Harvest Time). If there is
only one dependant variable, MANOVAis the same as ANOVA.
simultaneously between two levels of a variable. For
example, assessing whether(i) a change in the compensation system has brought about
changes in sales, profit and job satisfaction.
(ii) geographic region(North, South, East, West) has anyimpact on consumers preferences, purchase intentions or
attitudes towards specified products or services.(iii) a number of fertilizers have equal impact on the yield of
rice as also on the harvest time of the crop.
Principal Component Analysis (PCA)
Technique for forming set of newvariables that are linear combinations
of the original set of variables, and areuncorrelated. The new variables are
called Principal Components.
These variables are fewer in number ascompared to the original variables, butth