16.0 cm manual eng
TRANSCRIPT
Authors
Martin Klint Hansen
Nicholas Fritsche
Rasmus Porsgaard
Per Surland
Ulrick Tøttrup
Niels Yding Sørensen
Cand. Merc. Manual for SPSS 16.0
Description
Contains statistical methods used on master
level.
IT – department
Fall 2008
Table of Contents
1. ............................................................................................................................................................. 1 Introduction
2. ....................................................................................................................................... 2 General facts about SPSS
2.1 ............................................................................................................................ 2 The contents of the Data Editor2.1.1 ............................................................................................................................................. 2 The File menu2.1.2 ............................................................................................................................................. 2 The Edit menu2.1.3 ........................................................................................................................................... 2 The View menu2.1.4 ............................................................................................................................................ 3 The Data menu2.1.5 .................................................................................................................................. 3 The Transform menu2.1.6 ...................................................................................................................................... 3 The Analyze menu2.1.7 ........................................................................................................................................ 4 The Graphs menu2.1.8 ...................................................................................................................................... 4 The Utilities menu2.1.9 ........................................................................................................................................... 4 The Help Menu
2.2 ............................................................................................................................ 4 Contents of the Output Window
3. ....................................................................................................................................................... 6 Cluster analysis
3.1 .......................................................................................................................................................... 6 Introduction
3.2 .......................................................................................................................... 6 Hierarchical analysis of clusters3.2.1 ...................................................................................................................................................... 6 Example3.2.2 ................................................................................................................... 8 Implementation of the analysis3.2.3 ....................................................................................................................................................... 12 Output
3.3 ........................................................................... 14 K-means cluster analysis (Non-hierarchical cluster analysis)3.3.1 .................................................................................................................................................... 14 Example3.3.2 ................................................................................................................. 14 Implementation of the analysis3.3.3 ....................................................................................................................................................... 18 Output
4. ..................................................................................................................................... 21 Correspondence Analysis
4.1 ........................................................................................................................................................ 21 Introduction
4.2 .............................................................................................................................................................. 21 Example
4.3 ........................................................................................................................... 22 Implementation of the analysis4.3.1 .................................................................................................................... 23 Specification of the variables4.3.2 ........................................................................................................................................................ 24 Model4.3.3 .................................................................................................................................................... 25 Statistics4.3.4 .......................................................................................................................................................... 25 Plots
4.4 ................................................................................................................................................................. 26 Output
4.5 ................................................................................................... 31 HOMALS (Multiple correspondence analysis)
5. ................................................................................................................................ 32 Multiple analysis of variance
5.1 ........................................................................................................................................................ 32 Introduction
5.2 .............................................................................................................................................................. 33 Example
5.3 ........................................................................................................................... 33 Implementation of the analysis
5.4 ................................................................................................................................................................. 39 Output
6. ............................................................................................................................................ 43 Discriminant analysis
6.1 ........................................................................................................................................................ 43 Introduction
6.2 .............................................................................................................................................................. 44 Example
6.3 ........................................................................................................................... 44 Implementation of the analysis6.3.1 .................................................................................................................................................... 46 Statistics6.3.2 ..................................................................................................................................................... 47 Classify
6.3.3 .......................................................................................................................................................... 48 Save
6.4 ................................................................................................................................................................. 48 Output
7. ....................................................................................................................................................... 53 Profile analysis
7.1 ........................................................................................................................................................ 53 Introduction
7.2 .............................................................................................................................................................. 53 Example
7.3 ........................................................................................................................... 54 Implementation of the analysis7.3.1 ........................................................................................................................... 57 Are the profiles parallel?7.3.2 ....................................................................................................................... 60 Are the profiles congruent?7.3.3 ...................................................................................................................... 61 Are the profiles horizontal?
8. ....................................................................................................................................................... 63 Factor analysis
8.1 ........................................................................................................................................................ 63 Introduction
8.2 .............................................................................................................................................................. 63 Example
8.3 ........................................................................................................................... 64 Implementation of the analysis8.3.1 .............................................................................................................................................. 65 Descriptives8.3.2 .................................................................................................................................................. 66 Extraction8.3.3 ..................................................................................................................................................... 67 Rotation8.3.4 ........................................................................................................................................................ 68 Scores8.3.5 ...................................................................................................................................................... 68 Options
8.4 ................................................................................................................................................................. 69 Output
9. ......................................................................................................................... 74 Multidimensional scaling (MDS)
9.1 ........................................................................................................................................................ 74 Introduction
9.2 .............................................................................................................................................................. 74 Example
9.3 ........................................................................................................................... 75 Implementation of the analysis9.3.1 ........................................................................................................................................................ 77 Model9.3.2 ...................................................................................................................................................... 78 Options
9.4 ................................................................................................................................................................. 79 Output
10. .................................................................................................................................................. 86 Conjoint Analysis
10.1 ................................................................................................................................................... 86 Introduction
10.2 ......................................................................................................................................................... 86 Example
10.3 .................................................................................................................. 87 Generating an orthogonal design10.3.1 ................................................................................................................................................. 90 Options
10.4 ...................................................................................................................................................... 90 PlanCards
10.5 ........................................................................................................................................... 92 Conjoint analysis
10.6 ............................................................................................................................................................ 94 Output
11. .......................................................................................................................................................... 97 Item analysis
11.1 ................................................................................................................................................... 97 Introduction
11.2 ......................................................................................................................................................... 97 Example
11.3 ...................................................................................................................... 98 Implementation of the analysis11.3.1 ................................................................................................................................................... 99 Model11.3.2 ............................................................................................................................................... 99 Statistics
11.4 .......................................................................................................................................................... 101 Output
12. .......................................................................................................................... 104 Introduction to Matrix algebra
12.1 ................................................................................................................................................ 104 Initialization
12.2 ................................................................................................................ 105 Importing variables into matrices
12.3 ..................................................................................................................................... 105 Matrix computation
12.4 .......................................................................................................................... 109 General Matrix operations
12.5 .......................................................................................................................... 118 Matrix algebraic functions
APPENDIX A [Choice of Statistical Methods] ............................................................................................................ 122
Literature List ............................................................................................................................................................... 123
Introduction
1. Introduction
The purpose of this manual is to introduce the reader to the use of SPSS for the statistic analyses
used at Master Level (Cand. Merc.). For data manipulation and basic statistical analysis (such as T-
test, ANOVA, log linear and logit loglinear analysis) please refer to the manual “Introduction to
SPSS 16.0”.
The manual is to be considered a collection of examples. The idea is to introduce the reader to the
statistical methods by means of fairly simple examples based on a specific problem. The manual
suggests a solution and provides a short interpretation. However, it is important to emphasize that it
is no more than a suggestion - and not a final solution. The data sets, which are referred to, can be
found on the following server location (accessible for all students at The Aarhus School of
Business): \\okf-filesrv1\exemp\Spss\Manual.
The manual provides a short description of each statistical analysis including its main purpose, and
exemplifies a relevant situation to use the treated test. However, no theoretic basis is provided,
which means that the reader is expected to find the specific theoretical literature him/herself.
Inspiration can be found in the list of literature at the back of this manual, or at the beginning of
each chapter.
The manual has been written for SPSS version 16.0. If using other versions differences may occur.
Suggestions for improvements or corrections can be handed in at the IT instructors’ office in room
H14 or sent by e-mail to the following address: [email protected]
IT Department, The Aarhus School of Business, Autumn 2007.
1
General facts about SPSS
2. General facts about SPSS
SPSS consists of two parts – a Data Editor and an Output Window. In the Data Editor,
Manipulation of data is carried out and commands are executed. In the Output window all results,
tables and figures are displayed, while at the same time operating as a log window.
2.1 The contents of the Data Editor
The menu bar is located at the top of the Data Editor, and each of its items is discussed below.
2.1.1 The File menu
This item is used for the administration of data, i.e. loading, saving, importing and exporting data
and printing. The options are roughly similar to those of other Windows based programs.
2.1.2 The Edit menu
Edit is used for editing the contents of the current window. Here, the cut, copy and paste functions
are found. In addition, the options function makes it possible to change the fonts used in the
outputs, the decimal separator, etc.
2.1.3 The View menu
In this item it is possible to switch the status bar, gridlines, etc. on and off. The font and font size
Data Editor used in are defined here as well.
2
General facts about SPSS
2.1.4 The Data menu
This is where various manipulations of data are made possible. Examples include the definition of
new variables (Define Variable…) and the sorting of variables (Sort Cases…), etc.
2.1.5 The Transform menu
The Transform menu gives the user the opportunity to recode variables, generate random numbers,
rank data sets (construct ordinal data), define missing values, etc.
2.1.6 The Analyze menu
In this menu the user chooses which statistic analysis to use. In the table below the available
analyses are listed together with a short description.
Method of Analysis Description
Reports Case- and report summaries
Descriptive Statistics Descriptive statistics, frequencies, P-P- and Q-Q plots etc.
Custom Tables Construction of various tables
Compare Means Comparison of various mean values, e.g. by the use of a T-test or
ANOVA
General Linear Model Estimation by means of GLM, MANOVA
Generalized Linear Model Expand of the opportunities of the regression and General Linear
Model. Specifically of models on data which does not satisfy the
condition of normally distributed errors and models containing links
between independent variables.
Mixed Models Flexible modeling with data that are correlated and non-constant
variability.
Correlate Various association measures for the variables of the data set.
Among other things it is possible to calculate covariance, Pearson’s
correlation coefficient, etc.
Regression Regression; both linear, logistic and curve estimation
Loglinear General loglinear analysis and logit loglinear analysis
Classify Cluster- and discriminant analysis
Data Reduction Factor analysis and Correspondence analysis
Scale Item analysis and multidimensional scaling
Nonparametric Test Chi-square-, binomial-, hypothesis tests and test of independence
Time Series Autoregression and ARIMA
3
General facts about SPSS
Survival Kaplan-Maier, Cox-regression, etc.
Multiple Response Frequency tables and crosstabs for multiple response datasets
Missing Value Analysis Description of patterns concerning missing values.
2.1.7 The Graphs menu
In the Graphs menu there are several possibilities to get a graphical overview of the data set. The
options include the construction of histograms, line graphs, pie charts, box plots as well as pareto-,
pp- and qq-diagrams.
2.1.8 The Utilities menu
Here it is possible to obtain information about the different variables, e.g. type, length, etc. If only
some of the variables are needed, the user has the opportunity to construct a new data set based on
the existing variables. This is done by clicking Define Sets and choosing the relevant variables
followed by choosing Use Sets from the menu, where it is possible to use the new data sets.
2.1.9 The Help Menu
In the Help menu it is possible to seek help about how, the specific calculations and data manipulations are
handled in SPSS. The most important submenu is Topics, where it is possible carry out a search on keywords
using the index bar. You can also press the F1 key anywhere in SPSS to get help about the specific menu
you are working in. If you want to do a regression and need some help in the Statistics menu you can simply
press F1 and information about the different possibilities within the Statistics menu will be displayed.
2.2 Contents of the Output Window
As mentioned, the Output Window shows results, tables and figures and works as a log window. In
the Window menu the user can switch between the Data Editor and the Output Window.
4
General facts about SPSS
In the left side of the Output Window a tree structure is displayed, which provides a good overview
of results, figures and tables. A specific output can be viewed by clicking on it in the tree structure,
whereupon it will appear in the right side of the screen. If an error is committed, a log-item will pop
up. By clicking this item a window will appear, where a possible explanation to the problem is
provided.
5
Cluster analysis
3. Cluster analysis
The following analysis and interpretation of Cluster analysis is based on the following literature:
- ”Videregående data-analyse med SPSS og AMOS”, Niels Blunch 1. udgave 2000, Systime.
Chp. 1, p. 3 – 29.
3.1 Introduction
Cluster analysis is a multivariate procedure for detecting groupings in the data where there is no
clarity. Cluster analysis is often applied as an explorative technique. The purpose of a cluster
analysis is to divide the units of the analysis into smaller clusters so that the observations in the
cluster are homogenous and the observations in the other clusters in one way or the other are
different from these.
In contrast to the discriminant analysis, the groups in the cluster analysis are unknown from the
outset. This means that the groups are not based on pre-defined characteristics. Instead the groups
are created on basis of the characteristics of the data material. It is important to note that cluster
analysis should be used with extreme caution since SPSS will always find a structure no matter if
there is one present or not. The choice of method should therefore always be carefully considered
and the output should be critically examined before the result of the analysis is employed.
Cluster analysis in SPSS is divided into hierarchical and K-means clustering (non-hierarchical
analysis) Examples of both types of analyses are found below.
3.2 Hierarchical analysis of clusters
In the hierarchical method, clustering begins by finding the closest pair of objects (cases or
variables) according to a distance measure and combining them to form a cluster. This algorithm
starts with each case (or variable) in a separate cluster and combines clusters one step at a time until
all data is placed in a cluster. The method is called hierarchical because it does not allow single
objects to change cluster once they have been joined; this requires that the entire cluster be changed.
As indicated, the method can be used for both cases and variables.
3.2.1 Example
Data set: \\okf-filesrv1\exemp\Spss\Manual\Hierakiskdata.sav
6
Cluster analysis
Data on a number of economic variables is registered for 15 randomly chosen countries. By means
of hierarchical cluster analysis it is now possible to find out which countries are similar with regard
to the variables in question. In other words the task is to divide the countries into relevant clusters.
The chosen countries are:
- Argentina, Austria, Bangladesh, Bolivia, Brazil, Chile, Denmark, The Dominican Republic,
India, Indonesia, Italy, Japan, Norway, Paraguay and Switzerland.
The following variables are included in the analysis:
- urban (number of people living in cities), lifeexpf (average life span for women), literacy
(number of people that can read), pop_incr (increase in population in % per year), babymort
(mortality for newborns per 1000 newborns), calories (daily calorie intake), birth_rt (birth
rate per capita), death_rt (death rate per 1000 inhabitants), log_gdp (the logarithm to BNP
per inhabitant), b_to_d (birth rate in proportion to mortality), fertilty (average number of
babies), log_pop (the logarithm of a number of inhabitants).
Both the hierarchical and the non-hierarchical method will be described in the following
paragraphs. However it should be noted that for both analysis methods the cluster structure is made
based on cases. As it will be mentioned later, it is possible to conduct the hierarchical analysis
based on variables, however this is beyond the scope of the manual.
7
Cluster analysis
3.2.2 Implementation of the analysis
Choose the following commands in the menu bar:
Hereby the following dialogue box will appear:
8
Cluster analysis
The relevant variables are chosen and added to the Variable(s) window (see below). Under Cluster
it is chosen whether the analysis has to do with observations or variables, i.e. if the user wants to
group variables or observations into clusters. Since this example has to do with observations, Cases
are chosen.
Provided that the hierarchical analysis is based on cases, such as the above standing example, it is
very important that the chosen variable is defined as a string ( in variable view – type).
Under Display it is possible to include a display of plots and/or statistics. The Statistics and
Methods buttons must be chosen before the analysis can be carried through. When these dialogue
boxes are activated it is possible to click the Save button. Subsequently, the above-mentioned
buttons will be described:
3.2.2.1 Statistics
Choosing Agglomeration schedule helps identify which clusters are combined in each iterative step.
Proximity matrix shows how the distances or the similarities between the units (cases or variables)
appear in the output.
Cluster membership shows how every single observations- or variables cluster membership appear
in the output.
3.2.2.2 Plots
A dendrogram or an icicle plot can be chosen here. In this example as well as in the following, a
dendrogram will be chosen, since the interpretation of the cluster analysis is often based on this.
9
Cluster analysis
3.2.2.3 Method
In this dialogue box the methods on which the cluster analysis is based are indicated. In Cluster
method a number of methods can be chosen. The most important are described below:
• Nearest neighbour (single linkage): The clusters are created by minimizing the distance
between pairs of clusters.
• Furthest neighbour (complete linkage): The distance between two clusters is calculated as
the distance between the two elements (one from each cluster) that are furthest apart.
10
Cluster analysis
• Between-groups linkage (average linkage): The distance between two clusters is calculated
as the average of all distance combinations (which is given by all distances between
elements from the two clusters).
• Ward’s method (Ward’s algorithm): For each cluster a stepwise minimization of the sum of
the squared distances to the centre (the centroid) within each cluster is carried out.
The result of the cluster analysis depends heavily on the chosen method. As a consequence, one
should try different methods before completing the analysis. In this example Between-groups
linkage has been chosen.
Measure provides the user with the choice of different distance measures depending on the type of
data in question. The type of the data in question should be stated (interval data, frequency table or
binary variables). In this example interval has been chosen.
After this, the distance measure must be specified in the interval drop-down box. A thorough
discussion of these is beyond the scope of this manual, although several are relevant. It should be
noted, hovewer, that Pearson’s correlation is often used in connection with variables, while
Squared Euclidean distance is often used in connection with cases. The latter is default in SPSS,
and it is chosen in the example as well, since observations are considered here. When using
Pearson’s correlation one should be aware that larger distance measures mean less distance
(measures range from –1 to 1).
Transform values is used for standardizing data before the calculation of the distance matrix, which
is often relevant for the cluster analysis. This is due to the fact that variables with high values
contribute more than variables with low values to the calculation of distances. Consequently, this
matter should be carefully considered. In this case, the values are transformed to a scale ranging
from 0 to 1.
Transform Measures is used for transforming data after the calculation of the distance matrix,
which is not relevant in this case.
3.2.2.4 Save
This dialogue box gives the user the opportunity to save cluster relationships for observations or
variables as new variables in the data set. This option is ignored in the example.
11
Cluster analysis
3.2.3 Output
After the specification of the options the analysis is carried out. This results in a number of outputs,
of which three parts are commented on and interpreted below.
(1) Proximity matrix shows the distances that have been calculated in the analysis, in this case the
calculated distances between the countries. For example, the distance between Norway and
Denmark is 0.154 when using the Squared Euclidean distance. The cluster analysis itself is carried
out on the basis of this distance matrix. As can be seen, the matrix is symmetric:
Proximity Matrix
,000 1,008 4,883 2,233 ,442 ,423 1,024 ,936 3,164 1,353 ,877 ,836 ,694 2,286 ,8401,008 ,000 7,671 4,809 1,885 1,895 ,184 2,667 5,653 2,746 ,191 ,742 ,135 4,659 ,1284,883 7,671 ,000 1,398 3,032 5,109 8,498 3,024 ,484 1,727 7,886 7,867 7,655 4,089 7,8302,233 4,809 1,398 ,000 1,590 1,883 5,276 ,788 1,367 1,261 5,157 4,817 4,318 1,370 4,600
,442 1,885 3,032 1,590 ,000 ,921 2,106 ,787 1,665 ,547 1,611 1,452 1,717 2,572 1,837,423 1,895 5,109 1,883 ,921 ,000 2,081 ,446 3,543 1,721 1,771 1,217 1,341 1,309 1,441
1,024 ,184 8,498 5,276 2,106 2,081 ,000 3,125 6,437 3,488 ,355 ,994 ,154 5,243 ,342,936 2,667 3,024 ,788 ,787 ,446 3,125 ,000 2,160 ,911 2,772 2,286 2,231 ,920 2,311
3,164 5,653 ,484 1,367 1,665 3,543 6,437 2,160 ,000 ,739 5,479 5,270 5,684 3,344 5,7631,353 2,746 1,727 1,261 ,547 1,721 3,488 ,911 ,739 ,000 2,657 2,639 2,872 2,320 2,759
,877 ,191 7,886 5,157 1,611 1,771 ,355 2,772 5,479 2,657 ,000 ,312 ,311 4,883 ,236,836 ,742 7,867 4,817 1,452 1,217 ,994 2,286 5,270 2,639 ,312 ,000 ,634 4,245 ,561,694 ,135 7,655 4,318 1,717 1,341 ,154 2,231 5,684 2,872 ,311 ,634 ,000 4,013 ,112
2,286 4,659 4,089 1,370 2,572 1,309 5,243 ,920 3,344 2,320 4,883 4,245 4,013 ,000 3,952,840 ,128 7,830 4,600 1,837 1,441 ,342 2,311 5,763 2,759 ,236 ,561 ,112 3,952 ,000
Case1:Argentina2:Austria3:Bangladesh4:Bolivia5:Brazil6:Chile7:Denmark8:Domincan R9:India10:Indonesia11:Italy12:Japan13:Norway14:Paraguay15:Switzerlan
1:Argentina 2:Austria 3:Bangladesh 4:Bolivia 5:Brazil 6:Chile 7:Denmark8:Domincan
R. 9:India 10:Indonesia 11:Italy 12:Japan 13:Norway14:Paraguay15:
Switzerland
Squared Euclidean Distance
This is a dissimilarity matrix
The next output is (2) Agglomeration Schedule, which shows when the individual observations or
clusters are combined. In the example observation 13 (Norway) is first combined with observation
15 (Switzerland), where after observation 2 (Austria) is tied to the very same cluster. This
procedure continues until all observations are combined. The Coefficients column shows the
distance between the observations/clusters that are combined.
12
Cluster analysis
Agglomeration Schedule
13 15 ,112 0 0 22 13 ,132 0 1 32 7 ,227 2 0 42 11 ,273 3 0 81 6 ,423 0 0 93 9 ,484 0 0 145 10 ,547 0 0 102 12 ,649 4 0 131 8 ,691 5 0 101 5 1,023 9 7 124 14 1,370 0 0 121 4 1,716 10 11 131 2 2,718 12 8 141 3 4,651 13 6 0
Stage1234567891011121314
Cluster 1 Cluster 2Cluster Combined
Coefficients Cluster 1 Cluster 2
Stage Cluster FirstAppears
Next Stage
The (3) Dendrogram shown below is a simple way of presenting the cluster analysis:
The dendrogram indicates that there are three clusters in the data set. These are:
13
Cluster analysis
• Norway, Denmark, Austria, Switzerland, Italy and Japan (i.e. the European countries and
Japan).
• Brazil, Argentina, Chile, The Dominican Republic, Bolivia, Paraguay and Indonesia (i.e. the
South American countries and Indonesia).
• India and Bangladesh (i.e. the Asian countries).
The cluster structure is found by making a sectional elevation through the dendrogram, where the
distance is relatively long. In the example above the sectional elevation is made between 10 and 15.
This structure is very much in accordance with the prior expectations one might have regarding the
economic variables in question.
3.3 K-means cluster analysis (Non-hierarchical cluster analysis)
Contrary to the hierarchical cluster analysis, the non-hierarchical cluster analysis does allow
observations to change from one cluster to another while the analysis is proceeding. The method is
based on an iterative procedure where every single observation is grouped into a number of clusters
until the relation, the variance between the clusters and the variance within the clusters, is
maximized. The procedure itself will not be explained further here. It is noted, however, that the
user has the opportunity to choose clusters as well as so-called cluster-centres, which will often be
an advantage.
The non-hierarchical cluster analysis is often used in large data sets since the method is capable of
treating more observations than the hierarchical method. Apart from this the method is only used in
analysis regarding observations and not analysis regarding variables.
3.3.1 Example
Data: \\okf-filesrv1\exemp\Spss\Manual\K-meansdata.sav
The analysis is based on the same example as the hierarchical cluster analysis in section 3.2.1.
3.3.2 Implementation of the analysis
Before the cluster analysis is carried out it is necessary to standardize the variables from the
previous example. As mentioned earlier this is done to prevent large values from swinging too
much weight on the analysis. When using the k-means method it is not possible to make any
14
Cluster analysis
transformations during the analysis, as described in the hierarchical method. Therefore, this must be
done prior to the analysis.
Choose Analyze – Descriptive statistics – Descriptives from the menu bar. This results in a dialogue
box, which should be completed as follows:
After this it is time to carry out the analysis on the basis of the new standardized variables.
Choose the following from the menu bar:
15
Cluster analysis
This results in the following dialogue box:
Here all the standardized variables are moved to the Variables window. Since the names of the
countries are meant to be included in the output, the variable Country is moved to the Label Cases
by window.
As opposed to the hierarchical cluster analysis it is not a requirement for the non-hierarchical
cluster analysis that the variable used to group the clusters by (in this case country) is defined as a
string.
Method is set to Iterate and classify as default, since the classification is to be performed on the
basis of iterations.
In that connection it must be decided how many clusters are wanted as a result of the iteration
process. This is chosen in the Number of Clusters window. The number of clusters depends on the
data set as well as on preliminary considerations. From the result of the hierarchical analysis there is
reason to believe that the number of clusters is three, so this is the number that has been entered. If
one has detailed knowledge of the number and characteristics of expected clusters, it is possible to
define cluster centres in Centers.
16
Cluster analysis
In addition to the options described above there is a number of ways, in which the user can
influence the iteration procedure and the output. These are dealt with below.
3.3.2.1 Iterate
Clicking the Iterate button results in the following dialogue box with three changeable settings:
Maximum iteration indicates the number of iterations in the algorithm, which ranges between 1 and
999. Default is 10 and this will be used in this example as well.
Convergence Criterion defines the limit, at which the iteration is ended. The value must be between
0 and 1. In this example a value of 0.001 is used, which means that the iteration ends when none of
the cluster centres move by more than 0.1 percent of the shortest distance between two clusters.
Selecting Use running means is the mean values will be recalculated each time an observation has
been assigned to a cluster. By default this only happens when all observations have been assigned to
the clusters.
3.3.2.2 Save
Under Save it is possible to save the cluster number in a new variable in the data set. In addition it is
possible to create a new variable that indicates the distance between the specific observation and the
mean value of the cluster. See the dialogue box below:
3.3.2.3 Options
Clicking Options gives the user the following possibilities:
17
Cluster analysis
Initial cluster centers are the initial estimates of the cluster mean values (before any iterations).
ANOVA table carries out an F-test for the cluster variables. However, since the observations are
assigned to clusters on the basis of the distance between the clusters, it is not advisable to use this
test for the null hypothesis that there is no difference between the clusters.
Cluster information for each case ensures that a cluster number is assigned to each observation –
i.e. information about the cluster relationship. Furthermore, it produces information about the
distance between the observation and the mean value of the cluster.
Choosing exclude cases listwise in the Missing values section means, that a respondent will be
excluded from the calculations of the clusters, if there is a missing value for any of the variables.
Exclude cases pairwise will classify a respondent in the closest cluster on the basis of the remaining
variables. Therefore, a respondent will always be classified, unless there are missing values for all
variables.
3.3.3 Output
The analysis results in a number of outputs, of which the most relevant will be interpreted and
commented on the following page.
18
Cluster analysis
The output (1) Cluster Membership shows, as the name indicates, the countries’ relation to the
clusters. From the output it is obvious that, except for a few changes, the cluster structure is
identical to that of the hierarchical method. Distance is a measure of how far the individual country
is located compared to the centre of its cluster. The output shows that Brazil is the country, which is
located furthest away from its cluster centre with a distance of 3,075.
The next output to be commented on is (2) Final Cluster Centers. From this output the mean value
of the standardized variables for each of the three clusters appear. It is therefore possible to compare
the clusters on the basis of each variable.
19
Cluster analysis
The last output, which will be commented on is (3) Distances between Final Cluster Centers.
Distances between Final Cluster Centers
4,301 4,3214,301 5,8184,321 5,818
Cluster123
1 2 3
This output indicates the individual distances between the clusters. In this example cluster 1 and 2
are most alike. A closer look at output (2) is required in order to explore the differences and
similarities in depth.
20
Correspondence Analysis
4. Correspondence Analysis
The following explanation and interpretation of the correspondence analysis is based on the
following literature:
- “SPSS Categories® 10.0”, Jacqueline J. Meulman & Willem J. Heiser, SPSS Inc. 1999. Chp.
1, p. 10-11; Chp. 5, p. 45-46; Chp. 11, p. 147
For further reading see:
- “Theory and Applications of Correspondence Analysis”, Michael J. Greenacre, Academic
Press inc., 1984.
4.1 Introduction
The purpose of correspondence analysis is to make a graphical illustration of data in cross tables,
since the relation between rows and columns in a cross table can be hard to comprehend. Often, a
cross table does not create a well-arranged picture of the characteristics of a data set – i.e. the
relation between two variables. This is especially the case when nominal scaled data are considered
(data without any natural order or rank).
Factor analysis is normally perceived as the standard technique used to describe the connection
between variables. However an important requirement for factor analysis is that the data is interval
scaled. Correspondence analysis makes it possible to perform an analysis of the relationship
between nominal scaled variables graphically on a multidimensional level (see appendix 1). Row
and column values are calculated and plots are produced based on this. Categories that are alike will
be placed close to each other in a plot. This makes the analysis a useful technique for market
segmentation.
Please note that it is possible to make a multiple correspondence analysis (homogeneity analysis)
with more than two variables. See the last section of this chapter.
4.2 Example1
Data set: \\okf-filesrv1\exemp\Spss\Manual\Korrespondancedata.sav
A company wants to analyze which brands are preferred by different age groups.
1 The example is based on Blunch, Niels, “Analyse af markedsdata”, 2. rev. udgave 2000, Systime, pp. 70-78
21
Correspondence Analysis
• Brands: A, B, C and D
• Age groups: 20-29 years, 30-39 years, 40-49 years and >50 years
210 random respondents distributed across the age groups have been asked about their preference
for the four brands. Firstly, it must be clarified whether the objective is to study the differences
between the age groups with regard to their choice of brand, to map out the brands’ composition in
relation to age, or a combination of both. In the last mentioned example, which is shown here, it is
necessary to run the analysis twice. The first step is to carry out an analysis based on row profiles or
Row principal as it is called in SPSS. After this the analysis is carried out based on column profiles
or Column principal as SPSS calls it.
4.3 Implementation of the analysis
The starting point of the correspondence analysis is shown below:
22
Correspondence Analysis
4.3.1 Specification of the variables
A dialogue box now appears, and it should be filled out as follows:
The age variable is moved to the Row window and the brand variable is moved to Column. The
question marks behind the variables appear until the Range has been defined, i.e. the minimum and
maximum values.
This must be done for each variable by clicking the Define Range button. This results in the
dialogue box shown below:
For both variables in the data set Range is from 1 to 4.
The output from the correspondence analysis has yet to be specified. This is done by specifying the
settings in Model, Statistics and Plots. Each of these will be described separately below.
23
Correspondence Analysis
4.3.2 Model
By clicking the Model button the user can define which methods should be used for estimating the
model. First, the desired number of dimensions for the solution is defined. A two-dimensional
solution is desired in this example. Distance Measure deals with the method used for the estimation
of the distance matrix. For a standard correspondence analysis as this the Chi Square distance
measure is chosen. This automatically activates Row and column means are removed as the
Standardization Method.
The last setting has to do with the definition of the method by which normalization takes place. The
chosen method is important for the later interpretation of the analysis, which is discussed below (see
section 4.4 output).
As mentioned above, the analysis must be carried out twice to obtain a graphic picture of the row
and the column profiles and to provide an opportunity for a comparison of the two.
First, the analysis is carried out on the basis of the row profiles, because our first priority is to
examine the difference between the age groups with regard to brand choice. Therefore, Row
principal is chosen.
24
Correspondence Analysis
4.3.3 Statistics
Clicking the Statistics button makes it possible to calculate various statistic measures.
Correspondence table displays a cross tabulation of the data. This table provides a very clear
picture of the distribution in the individual groups. The Row and Column profiles are useful as well,
so they are also selected. In Confidence Statistics it is possible to calculate standard deviations and
correlations for either the row or the column profiles. Which additional measures to choose,
depends on the individual analysis and its purpose.
4.3.4 Plots
The Plots menu is divided into two parts. Scatterplots makes it possible to plot the individual
variables (Row and Column points) against the individual dimensions (in this case age and brand),
as well as to plot both variables against these dimensions (Biplot).
25
Correspondence Analysis
ID label width for scatterplots is used to define the maximum number of characters (letters) to be
used in the description of an individual point in the plot. When there are a lot of points in the plot, it
is often a good idea to reduce this number in order to avoid confusion in the scatterplot. In this case
a reduction is not necessary.
The second part of the Plots menu, Lineplots, makes it possible to plot the two variables
individually against each of the dimensions. Transformed row categories produce a plot of the row
categories (i.e. age) against the matching row scores. A similar plot with the column categories (i.e.
brand) and the column scores instead is produced, when Transformed column categories is chosen.
Since Row principal was chosen above, one part of the scatterplot will be identical to the plot of the
row profiles. The analysis is carried out by clicking Continue followed by OK. Finally it is possible
to restrict the number of dimensions in the Plot Dimensions submenu.
4.4 Output
The output from the correspondence analysis is discussed below. Some parts of the output are left
out, as they have not been considered relevant. The first output is (1) Correspondence table, which
is a cross tabulation of the data set.
Correspondence Table
30 18 6 3 5710 30 10 5 55
3 10 27 8 482 2 40 6 50
45 60 83 22 210
age20-29 years30-39 years40-49 years>50 yearsActive Margin
A B C Dbrand
Active Margin
The table indicates that 210 respondents are included in the analysis. Furthermore, it shows that
there are 55 respondents in the age group 30-39 years. Ten of these prefer brand A, 30 prefer brand
B, ten prefer brand C, and five people prefer brand D. It can be concluded that this age group is in
favor of brand B in contrast to the age group >50 years, where only two prefer brand B. For all age
groups in total there is a preference for brand C with 83 of the 210 respondents in favor of this
brand.
26
Correspondence Analysis
The next output, (2) Row Profiles, is a cross tabulation of the row profiles:
Row Profiles
,526 ,316 ,105 ,053 1,000,182 ,545 ,182 ,091 1,000,063 ,208 ,563 ,167 1,000,040 ,040 ,800 ,120 1,000,214 ,286 ,395 ,105
age20-29 years30-39 years40-49 years>50 yearsMass
A B C D Active Marginbrand
The horizontal percentages display the difference between the individual age groups with regard to
their brand choice. For example, 18.2 % in the age group 30-39 years prefer brand A, 54.4 % prefer
B, 18.2 % prefer C, and 9.1 % prefer D. As mentioned above, the largest preference for all 210
respondents is brand C. As can be seen, this preference represents 39.5 %.
A cross tabulation of the column profiles is found in (3) Column Profiles:
Column Profiles
,667 ,300 ,072 ,136 ,271,222 ,500 ,120 ,227 ,262,067 ,167 ,325 ,364 ,229,044 ,033 ,482 ,273 ,238
1,000 1,000 1,000 1,000
age20-29 years30-39 years40-49 years>50 yearsActive Margin
A B C D Massbrand
The vertical percentages display the age composition of each individual brand. For example, of all
the 45 respondents preferring brand A, 66.7 % belong to the age group 20-29 years, 22.2 % belong
to the age group 30-39 years, 6.7 % belong to the age group 40-49 years, and only 4.4 % belong to
the age group >50 years.
In order to determine the degree of pure coincidence in the data set with regard to the differences
described above, a χ2-test is carried out. It tests a homogeneity hypothesis, i.e. a test of whether the
individual age groups can be assumed to be alike. The result is to be found in (4) Summary:
27
Correspondence Analysis
Summary
,648 ,420 ,808 ,808 ,048 ,153,309 ,095 ,184 ,991 ,075,068 ,005 ,009 1,000
,520 109,191 ,000a 1,000 1,000
Dimension123Total
SingularValue Inertia Chi Square Sig. Accounted for Cumulative
Proportion of Inertia
StandardDeviation 2
Correlation
Confidence SingularValue
9 degrees of freedoma.
It appears from the output that the hypothesis regarding equal market profiles for the four age
groups must be rejected. This conclusion is based on a χ2-value of 109.19 with a corresponding p-
value of 0.000. Therefore, the profiles can be expected to be different, and in the plot below they
will be separated from each other. In the case of an insignificant p-value the dispersion would have
been small, and it would not have been possible to conclude dissimilarity in the preferences. (4)
Summary also shows how much of the dispersion that can be maintained in a graphing of only two
dimensions. The first dimension (age) explains 80.8 % of the inertia, and the second dimension
(brand) explains 18.4 % of the inertia. This leaves only 0.9 % for the third, excluded, dimension.
Below the interesting Biplot (5) Row and Column Points/Row Principal Normalization is shown:
28
Correspondence Analysis
The plot shows the row and the column profiles together. However, it is only possible to compare
the row points (age) individually, since Row Principal was chosen as the normalization method.
The smaller the distances between the age groups, the more similar they are with regard to brand
preferences and vice versa. In this case the age groups 40-49 years and >50 years are most alike.
However, it is not possible to conclude anything on the basis of the location of the brands.
As mentioned above, the analysis must be run twice in order to compare the individual column
points (brand). In the second analysis Column Principal should be chosen as normalization method.
The procedure was shown in section 4.3.2 in the dialogue box Correspondence Analysis: Model:
Part of the dialogue box Correspondence Analysis: Model (see section 4.3.2).
This results in a new Biplot (6) Row and Column Points/Column Principal Normalization:
29
Correspondence Analysis
The plot shows the row and the column profiles together. Again it is only possible to compare the
column points (brand) individually, since Column Principal has now been chosen as normalization
method. From the plot it appears that brand C and D are most alike with regard to the age groups, to
which they appeal.
Unfortunately, it is not possible to create “The French plot”, a combination of the two plots (5) and
(6), directly in SPSS. However, this can be done from the output instead. From the analysis where
Row Principal was used, the coordinates for both the first and the second dimension for the row
variable (age) can be viewed. This is shown in (7) Overview Row Points below.
Overview Row Pointsa
,271 -,756 ,354 ,189 ,369 ,356 ,820 ,180 1,000,262 -,399 -,445 ,094 ,099 ,542 ,443 ,552 ,995,229 ,462 -,097 ,054 ,116 ,022 ,906 ,040 ,946,238 ,856 ,179 ,183 ,416 ,080 ,952 ,041 ,993
1,000 ,520 1,000 1,000
age20-29 years30-39 years40-49 years>50 yearsActive Total
Mass 1 2
Score in Dimension
Inertia 1 2
Of Point to Inertia ofDimension
1 2
Contribution
Of Dimension to Inertia of PointTotal
Row Principal normalizationa.
After this the coordinates for both dimensions for the column variable (brand) can be viewed from
the analysis, where Column Principal was used. This is shown in (8) Overview Column Points
below.
Overview Column Pointsa
,214 -,808 ,448 ,183 ,333 ,451 ,764 ,236 1,000,286 -,494 -,409 ,118 ,166 ,500 ,593 ,405 ,998,395 ,710 ,086 ,203 ,475 ,031 ,983 ,014 ,998,105 ,321 -,127 ,016 ,026 ,018 ,657 ,103 ,761
1,000 ,520 1,000 1,000
brandABCDActive Total
Mass 1 2
Score in Dimension
Inertia 1 2
Of Point to Inertia ofDimension
1 2 TotalOf Dimension to Inertia of Point
Contribution
Column Principal normalizationa.
When the coordinates for each point (in this case there are eight points in total, i.e. four age groups
and four brands) have been found, they can be plotted into a coordinate system (Excel can easily be
used). This procedure as well as the interpretation of “The French plot” is left to the reader2.
2 Blunch, Niels J.: “Analyse af markedsdata”, p. 70.
30
Correspondence Analysis
4.5 HOMALS (Multiple correspondence analysis)
As mentioned in the beginning it is also possible to make a multiple correspondence analysis in
SPSS, which will be briefly presented here. HOMALS is an abbreviation for homogeneity analysis
by means of altering least squares. The purpose of HOMALS is to find a number of homogenous
groups. The analysis attempts to produce a solution where groups within the same category will be
close to each other, while groups that are different from each other will be plotted far from each
other.
An important difference from multiple correspondence analysis and the ordinary correspondence
analysis is that input for the multiple analysis is a data matrix where the rows represent objects and
the columns represent variables. In the ordinary correspondence analysis, however, it is possible for
both rows and columns to represent variables. The multiple correspondence analysis makes an
analysis with more than two variables possible. This is not possible in ordinary correspondence
analysis.
The purpose of multiple correspondence analysis is to describe the relationship between two or
more nominal scaled variables on a multidimensional level, containing the categories of variables as
well as the objects in these categories. It is a qualitative factor analysis where nominal scaled
variables are measured.
As an example, multiple correspondence analysis can be used to graphically illustrate the relation
between employment, income and gender. It is possible that the analysis will reveal that income and
gender discriminates between the respondents but that there is no difference with regard to
employment. The implementation of the analysis will not be shown in this manual.
31
Multiple analysis of variance
5. Multiple analysis of variance
The following explanation and interpretation of the correspondence analysis is based on the
following literature:
- ”Videregående data-analyse med SPSS og AMOS”, Niels Blunch 1. edition 2000, Systime.
Chp. 3, p. 51 – 67.
- ”Analyse af markedsdata”, Niels Blunch 2. rev. edition 2000, Systime. Chp. 6, p. 213-220
- ”Multivariate data analysis”, Hair, Anderson and more, 5. edition, Prentice Hall. Chp. 6, p.
326-387
5.1 Introduction
Multiple analysis of variance (MANOVA) is an extension of the traditional analysis of variance
(ANOVA), and is used for situations where there are more than one dependent variable, which is
dependent on one or more explanatory variables (see appendix 1). Similar to ANOVA the method is
used to measure or prove differences between groups or experimental treatments.
The difference between the two methods is that ANOVA identifies group differences based on one
single dependent variable. MANOVA, in contrast, identifies differences simultaneously across
several dependent variables. In other words, MANOVA clarifies differences between groups while
at the same time considering the correlations between the dependent variables. Notice that it is not
sufficient to carry through an ANOVA test for the dependent variables. There will be situations
where MANOVA is significant despite the fact that all ANOVA tests are insignificant. Likewise the
opposite can also occur – where the ANOVA test is significant and the MANOVA is insignificant.
A short notice on the connection between MANOVA and discriminant analysis: Provided that
MANOVA turns out to be significant it will be possible to find a linear function of the dependent
variables that separate the groups. Hereby, at least one of the discriminant functions is significant.
Therefore, MANOVA and discriminant analysis are opposite analyses. In MANOVA the groups are
known in advance, and the objective is to find their relationship with the values of the variables. In
discriminant analysis the objective is to find the groupings based on the values of the variables.
Obviously discriminant analysis, mentioned in chapter 4, should only be employed when the
MANOVA test is significant.
32
Multiple analysis of variance
5.2 Example
Data set: \\okf-filesrv1\exemp\Spss\Manual\Manovadata.sav
The objective is to test whether the effect of two pills given at the same time is more effective than
the intake of only one of the pills. 20 patients with the disease in question are distributed into four
groups by lot.
In group 1 the patients get a placebo, in group 2 they get both pills, in group 3 the patients get one
of the pills and in group 4 they get the other pill. The effect is measured on two variables: Y1 and
Y2. The data is shown in table 1 below:
Table 1
Treatments (Groups)
1 2 3 4
Y1 Y2 Y1 Y2 Y1 Y2 Y1 Y2
1
2
3
2
2
2
1
2
3
2
8
9
7
8
8
9
8
9
9
10
2
3
3
3
4
4
2
3
5
6
4
3
3
5
5
5
3
4
6
7
Means
2
2
8
9
3
4
4
5
The null hypothesis for this multivariate analysis of variance is that the treatment has no influence
on either Y1 or Y2. The alternative hypothesis is that the treatment can explain the effects on Y1 and
Y2.
5.3 Implementation of the analysis
The starting point of the MANOVA is shown below:
33
Multiple analysis of variance
When choosing GLM Multivariate a dialogue box with various options appears as shown below.
First the dependent variables must be chosen. In this case it is Y1 and Y2, so they are marked and
moved to the Dependent Variables window. All dependent variables must be interval scaled.
After this the group variable(s) that determines the groups to be compared on the basis of the
dependent variables must be defined in the Fixed Factor(s) window. In this example the group
variable (treatment) are moved to the window so that the four groups are compared with regard to
Y1 and Y2.
34
Multiple analysis of variance
In the Covariate(s) window variables can be added. The linear effect of the covariances on the
dependent variables will in that case be removed separately from the factor level effects. In this
example such variables are assumed to be non-existing.
In the WLS Weight window a variable can be added in order to assign different weights to the
observations. However, this is not relevant in this example.
The output and the estimation of the MANOVA have yet to be specified. This will be done by
describing the Model, Contrast, Plots, Post Hoc and Options buttons individually below.
5.3.1.1 Model
Clicking the Model button results in the following dialogue box, where the model can be specified.
In Specify Model the default setting is Full factorial, which means that a full model is specified –
i.e. a model containing the intercept, all factor and covariate main effects as well as all interactions.
The alternative is Custom, where the model is specified manually – i.e. terms from the full model
can be omitted. If Custom is chosen, the desired factors must be marked and moved to the window
in the right-hand side of the dialogue box. In this example a Full factorial model has been chosen.
In the bottom of the dialogue box it is possible to specify the method, by which the sum of squares
is calculated. The options are type 1, 2, 3 or 4. Type 3 is default, and this is the most widely used
method on The Aarhus School of Business as well. Therefore, type 3 is chosen for this example
also.
35
Multiple analysis of variance
Before clicking Continue it is a good idea to make sure that Include intercept in model is activated.
This ensures the calculation of an overall mean, which is required if Deviation is chosen as the
estimation method for the factors (see next section).
5.3.1.2 Contrast
Clicking the Contrast button results in the following dialogue box:
In Change Contrast the estimation method for each factor can be changed. No method is chosen by
default, but the user can choose between six different ones: Deviation, Simple, Difference, Helmert,
Repeated and Polynomial. A method is chosen by marking the factor in the Factors window and
clicking the appropriate method in the Contrast drop-down-box. The chosen method depends on
each individual data set. If Deviation or Simple is chosen, SPSS requires a specification of a
Reference Category (either the first or last variable in the data set). In this example Deviation has
been chosen.
5.3.1.3 Plots
Clicking this button makes it possible to get profile plots produced for each factor. This plot will
spread out on the dependent variables. In Horizontal Axis the user must specify the factor to be
plotted. When the chosen factor is moved from Factors to Horizontal Axis, the Add button must be
clicked for the factor to move to the Plots window in the bottom square.
36
Multiple analysis of variance
When there are several factors (grouping variables), the described procedure is repeated for all
factors, of which a profile plot is desired. Furthermore, it is possible to use Separate Lines for each
variable to combine two factors with each other in the plot. In this example where there is only one
factor, the factor (group) is moved to Horizontal Axis, and the Add button is clicked.
5.3.1.4 Post Hoc
When it has been determined that there is a difference between two or more groups, Post Hoc
Multiple Comparison can be used to determine which groups are significantly different from each
other. It is worth mentioning that the chosen tests are only carried out for each dependent variable
separately, which is similar to the comparisons that are made in ANOVA. In order to be able to
specify which test to use, it is necessary to specify which factors to be tested first. This is done at
the top of the dialogue box:
37
Multiple analysis of variance
There are several tests, which are categorized on the basis of whether or not equal variances in the
groups are assumed. For this reason, it is beyond the scope of this manual to have a thorough
discussion of each test, but at the Aarhus School of Business the tests that have been activated in the
dialogue box shown above are normally used.
5.3.1.5 Options
Clicking the Options button results in the dialogue box shown below. In Estimated Marginal Means
the factors or factor combinations (in the case of several factors) are chosen, for which an estimate
is desired. Since this example only contains a single factor, group is moved to the Display means
for window.
Display offers various ways of controlling the output from the GLM procedure. The boxes that are
selected in the dialogue box above will be described below.
• Descriptive statistics displays the observed mean value, the standard deviation for the value
of each dependent variable in the factor combinations.
• Estimates of effect size produce a partial, eta-squared value for each effect and each
parameter estimate.
• Observed power indicates the strength of the individual tests.
• Parameter estimates displays the parameter estimates, standard errors, t-tests, confidence
intervals, and the observed power for each test.
38
Multiple analysis of variance
• SSCP matrices display the sum-of-squares and cross-product matrices. When Residual
SSCP is chosen, a Bartlett’s test for sphericity of the residual covariance matrix is added.
• Homogeneity tests carries out Levene’s test for variance homogeneity for each dependent
variable across the existing factor levels/combinations. In addition Box’s M test for
homogeneity of the covariance matrices of the dependent variables across all factor
combinations is carried out.
• Spread vs. level plots and residual plots are very useful for checking the model
assumptions. Residual plots produce an observed/expected standardized residual plot for
each dependent variable.
Furthermore, it is possible to change the level of significance.
5.4 Output
When the procedure described above is carried out, click the OK button from the initial dialogue
box Multivariate in order for SPSS to carry out the analysis and produce the outputs. The outputs,
with the exception of a few, which have been considered irrelevant for a discussion, will be
commented on stepwise below.
Between-Subjects Factors
placebo 5begge piller 5pille A 5pille B 5
1234
gruppeValue Label N
Descriptive Statistics
2,00 ,707 58,00 ,707 53,00 ,707 54,00 1,000 54,25 2,447 202,00 ,707 59,00 ,707 54,00 1,581 55,00 1,581 55,00 2,847 20
gruppeplacebobegge pillerpille Apille BTotalplacebobegge pillerpille Apille BTotal
y1
y2
Mean Std. Deviation N
39
Multiple analysis of variance
The outputs shown above contains a description of the data set, among other things the number of
observations in each group output (1a). In output (1b) there is some descriptive statistics, which will
not be further discussed.
Output (2) displays Box’s test, which indicates whether or not the covariance matrices for the
individual groups are equal. This constitutes one of the assumptions of MANOVA, and it is similar
to the assumption regarding equal variances for all groups known from ANOVA. From the shown
output it appears that the null hypothesis of equal covariance matrices is accepted, which means that
the assumption is not violated.
Box's Test of Equality of Covariance Matricesa
13,100Box's M1,123F
9df12933,711df2
Sig. ,343Tests the null hypothesis that the observed covariancematrices of the dependent variables are equal across groups.
Design: Intercept+gruppea.
After this test of assumptions (3) Bartlett’s test of Sphericity follows. The null hypothesis is that the
residual covariance matrices are proportional with an identity matrix, which means that the two
dependent variables are uncorrelated. If the null hypothesis is accepted, there is no reason to carry
out the MANOVA, since it would be sufficient with a number of simple ANOVAs. As can be seen
from the output below, the null hypothesis is rejected, and MANOVA is therefore relevant in this
example.
Bartlett's Test of Sphericitya
,0166,212
2,045
Likelihood RatioApprox. Chi-SquaredfSig.
Tests the null hypothesis that the residual covariancematrix is proportional to an identity matrix.
Design: Intercept+gruppea.
The output (4) Multivariate Tests shown below illustrates a number of multivariate tests, which are
all significant. It shows that all four test values indicate that the null hypothesis of equal groups is
40
Multiple analysis of variance
rejected, since the p-value is 0. It can therefore be concluded that the combined dependent variables,
Y1 and Y2, vary across the treatment.
Multivariate Testsd
,976 303,141b 2,000 15,000 ,000 ,976 606,283 1,000,024 303,141b 2,000 15,000 ,000 ,976 606,283 1,000
40,419 303,141b 2,000 15,000 ,000 ,976 606,283 1,00040,419 303,141b 2,000 15,000 ,000 ,976 606,283 1,000
1,027 5,631 6,000 32,000 ,000 ,514 33,786 ,989,073 13,566b 6,000 30,000 ,000 ,731 81,396 1,000
11,414 26,632 6,000 28,000 ,000 ,851 159,791 1,00011,292 60,223c 3,000 16,000 ,000 ,919 180,670 1,000
Pillai's TraceWilks' LambdaHotelling's TraceRoy's Largest RootPillai's TraceWilks' LambdaHotelling's TraceRoy's Largest Root
EffectIntercept
gruppe
Value F Hypothesis df Error df Sig.Partial EtaSquared
Noncent.Parameter
ObservedPowera
Computed using alpha = ,05a.
Exact statisticb.
The statistic is an upper bound on F that yields a lower bound on the significance level.c.
Design: Intercept+grupped.
The SSCP matrices are shown in output (5) below. The group (H) and error (E) matrix displays the
variance between and within the groups, respectively. The matrices are displayed with bold letters.
It is the relationship between these matrices that forms the basis of the multivariate tests mentioned
above.
Between-Subjects SSCP Matrix
361,250 425,000425,000 500,000103,750 115,000115,000 130,000
10,000 7,0007,000 24,000
y1y2y1y2y1y2
Intercept
gruppe
Hypothesis
Error
y1 y2
Based on Type III Sum of Squares
The output (6) Tests of Between-Subjects-Effects shows whether or not the explanatory variable is
significant for the dependent variables. As can be seen, the contribution of the explanatory variable
group to the explanation of the variation of the dependent variables is significant for both of the
dependent variables.
41
Multiple analysis of variance
Tests of Between-Subjects Effects
103,750b 3 34,583 55,333 ,000 ,912 166,000 1,000130,000c 3 43,333 28,889 ,000 ,844 86,667 1,000361,250 1 361,250 578,000 ,000 ,973 578,000 1,000500,000 1 500,000 333,333 ,000 ,954 333,333 1,000103,750 3 34,583 55,333 ,000 ,912 166,000 1,000130,000 3 43,333 28,889 ,000 ,844 86,667 1,00010,000 16 ,62524,000 16 1,500
475,000 20654,000 20113,750 19154,000 19
Dependent Variabley1y2y1y2y1y2y1y2y1y2y1y2
SourceCorrected Model
Intercept
gruppe
Error
Total
Corrected Total
Type III Sumof Squares df Mean Square F Sig.
Partial EtaSquared
Noncent.Parameter
ObservedPowera
Computed using alpha = ,05a.
R Squared = ,912 (Adjusted R Squared = ,896)b.
R Squared = ,844 (Adjusted R Squared = ,815)c.
So far it has only been determined that one or more groups are different. Therefore it is relevant to
take a closer look at these differences. In SPSS this is done by means of two by two comparisons,
but as mentioned in section 5.3.2.4, these comparisons are only carried out for each dependent
variable separately. As a consequence these two by two comparisons are only partly relevant and
will not be shown here.
As mentioned in the introduction of this section, a significant MANOVA means that at least one
discriminant function is significant as well. Therefore a discriminant analysis is very often carried
out following the MANOVA. Hereby it is possible to interpret the differences between the groups
by studying the discriminant functions. Please refer to chapter 6 about discriminant analysis.
42
Discriminant analysis
6. Discriminant analysis
The explanation and interpretation of discriminant analysis presented in this section is based on the following
literature:
- ”Videregående data-analyse med SPSS og AMOS”, Niels Blunch 1. edition 2000, Systime.
Chp. 2, p. 29 – 48, Chp. 3, p. 49-77.
- ”Analyse af markedsdata”, Niels Blunch 2. rev. edition 2000, Systime. Chp. 6, p. 209-220
6.1 Introduction
Discriminant analysis is used for studying the relationship between one nominal scaled dependent
variable and one or more interval scaled explanatory variables (see appendix 1). In other words
discriminant analysis is used instead of regression analysis when the dependent variable is
qualitative.
The main purpose of discriminant analysis is to find the function of the explanatory variables that
discriminates best between two or more groups that are made up of the dependent variable. After
this it is possible to:
- Classify and place the units of the analysis (the respondents, the observations, etc.) in
several previously defined groups.
- Assess the discrimination ability of the explanatory variables.
A distinction is made between simple and multiple discriminant analysis. The difference is that the
simple discriminant analysis only discriminates between two groups, whereas the multiple analysis
discriminates between three or more groups. The group relation for both types is decided through
discriminant functions. In the multiple discriminant analysis there are two or more of these
functions depending on the number of groups and variables.
In this representation it is assumed that the above-mentioned discriminant functions are linear. The
following is an example of the simple discriminant analysis.
As mentioned in the previous section regarding MANOVA, a discriminant analysis is often
conducted as an extension of a significant MANOVA test. In this case a discriminant analysis will
ease the interpretation of the MANOVA. However it should be mentioned that the discriminant
43
Discriminant analysis
analysis can be performed even without a preceding MANOVA. The following analysis is carried
through under the assumption of a significant MANOVA.
6.2 Example
Data: \\okf-filesrv1\exemp\Spss\Manual\Diskriminantdata.sav
Salmon fishing is a very dear resource for both Alaska (USA) and Canada. Since it is a limited
resource it is very important that fishermen from both countries observe the fishing quotas. The
salmon have a remarkable lifecycle. They are born in freshwater and after approximately 1-2 years
they swim out into the ocean, i.e. into saltwater. After a couple of years in the ocean they return to
their place of birth to spawn and die. Before this withdrawal takes place the fishermen try to catch
the fish. With a view to regulate the catch, there is a desire to identify where the fish come from.
The salmon carries an interesting piece of information about their place of birth in the shape of
growth rings on their scale. These growth rings, which can be identified both for the time spent in
freshwater and in saltwater, are typically smaller on the Alaskan (salmon than on the Canadian
salmon.
On the basis of this information 50 salmons from Alaska and 50 salmons from Canada are
examined. The 100 salmon are examined for the following explanatory variables.
X1 = diameter of growth rings for the first year of the salmon’s life in freshwater (1/100 inches)
X2 = diameter of growth rings for the first year of the salmon’s life in saltwater (1/100 inches)
X3 = gender
Subsequently, a discriminant analysis based on the collected data is performed. The purpose is to
find a discriminant function, which is capable of classifying where the fish originates from, based
on the salmons’ growth rings.
6.3 Implementation of the analysis
The starting point of the discriminant analysis is shown below:
44
Discriminant analysis
The following dialogue box now appears:
In this dialogue box the desired grouping variable must be marked in the left square and moved to
the Grouping Variable window by clicking the arrow. In this case the grouping variable is country.
The range is defined by clicking the Define Range button so that SPSS can calculate the number of
groups. In this example 1 is specified as minimum and 2 as maximum.
The two variables, Freshwater [x1] and Saltwater [x2], which have been moved to the
Independents window, are the measured variables on the basis of which the classification will be
carried out.
45
Discriminant analysis
Furthermore, it must be specified whether the Stepwise method or the Independents together
method is desired. Using Stepwise method means that the variables that discriminate best are found
automatically. However, one must be aware of the possible problems with multicollinearity for the
explanatory variables. The Independents together method uses all the variables simultaneously for
the discrimination. In this example the latter method has been chosen, since both variables are to be
included in the discriminant function.
In the side of the dialogue box there is a number of buttons, which are used for specifying the
method that determines how the discriminant analysis is carried out. Subsequently, Statistics,
Classify and Save are described below. Method should only be activated if the Stepwise method has
been chosen, since this is where the specification of the criteria that apply when choosing the
independent variables takes place.
6.3.1 Statistics
This button is used for managing the estimation itself and defining which covariance matrices to
display in the output. The mean values for each group can be calculated here as well. Furthermore,
it is possible to carry out a test of whether pooling the covariance matrices for the two groups is
possible. The following dialogue box appears:
The purpose of the individual parts is described as follows:
By activating Means in Descriptives, a calculation of the mean and the standard deviation for each
group of independent variables and for the entire sample takes place.
46
Discriminant analysis
Univariate ANOVAs tests the hypothesis that the mean value for each independent variable is the
same for each group, i.e. an analysis of variance.
Selecting Box’s M, results in a test of the hypothesis that covariance matrices for each group are
equal, i.e. if it is possible to pool the matrices.
In Function Coefficients it is specified, whether Fisher’s method or the unstandardized method is
used for the calculation of the classification coefficients.
In Matrices it is specified which matrices to include in the output.
In this example it has been chosen to have Means calculated and to carry out Box’s M test to see if
the covariance matrices can be pooled. The coefficients are calculated by means of Fisher’s
method, since an estimate of the discriminant function itself is desired. Furthermore, an output of
the covariance matrices for each group as well as a single within-groups covariance matrix has been
chosen.
6.3.2 Classify
In the following dialogue box the classification is managed:
The options are as follows:
• Prior Probabilities specify the a priori probability that forms the basis of the classification.
Either the prior probability for group membership can be determined from the observed
group size, or it can be equal for all groups.
47
Discriminant analysis
• In Use Covariance Matrix3 it can be determined whether the pooled or the separate
covariance matrices should be used. This is normally not determined until Box’s M test has
been carried out.
• By activating Casewise results in Display the group membership will be displayed for each
observation. Selecting Summary table displays the result of the classification as well as a
table showing observed and expected group membership.
• In Plots various plots can be chosen.
In this example the prior probability will be equal for all groups and the estimation of the
discriminant function will be performed on the basis of separate covariance matrices, since there
is reason to believe that the differences between these are too great to justify a pooled matrix.
Furthermore, it has been chosen to display the classification result.
6.3.3 Save
In brief it is possible to save expected group membership, the probability for group membership and
the calculated discriminant scores. These values will be saved as new variables in the data set.
However, there will be no such new variables in this example.
6.4 Output
When the various settings in the dialogue boxes have been specified, SPSS performs the analysis.
The result is a number of outputs, of which the most relevant ones will be commented and
interpreted below.
Group Statistics
98,38 16,143 50 50,000429,66 37,404 50 50,000137,46 18,058 50 50,000366,62 29,887 50 50,000117,92 26,001 100 100,000398,14 46,240 100 100,000
ferskvandsaltvandferskvandsaltvandferskvandsaltvand
fødestedUSA
Canada
Total
Mean Std. Deviation Unweighted WeightedValid N (listwise)
3 This choice only has a consequence for the classification of the observations, and not for the calculation of the discriminant function, since equal covariance matrices are assumed within the groups. A violation of this assumption means that SPSS cannot perform the calculation.
48
Discriminant analysis
(1) Group Statistics shows mean values, standard deviations and the number of observations for
both groups for each of the explanatory variables. Furthermore, results for the entire data set
are shown.
The figures below are (2) The pooled covariance matrix and (3) The covariance matrix for the two
groups. The two matrices appear to be significantly different from each other, so it is unlikely that
the estimation of the discriminant function can be based on the pooled covariance matrix. This is
verified by (4) Box’s M test below.
Pooled Within-Groups Matricesa
293,349 -27,294-27,294 1146,173
ferskvandferskvand saltvand
Covariancesaltvand
The covariance matrix has 98 degrees of freedom.a.
Covariance Matrices
260,608 -188,093-188,093 1399,086326,090 133,505133,505 893,261
ferskvandsaltvandferskvandsaltvand
fødestedUSA
Canada
ferskvand saltvand
(4) Box’s M test below rejects the null hypothesis of equal covariance matrices. This is the
argument behind the former statement about performing the analysis on the basis of separate
covariance matrices.
Test Results
10,9383,565
31728720
,013
Box's MApprox.df1df2Sig.
F
Tests null hypothesis of equal population covariance matrices.
In the part of the output called Summary of Canonical Discriminant Functions it is relevant to look
at (5) Wilks’ Lambda.
49
Discriminant analysis
Wilks' Lambda
,321 110,223 2 ,000Test of Function(s)1
Wilks'Lambda Chi-square df Sig.
Wilks’ Lambda displays the capability of the discriminant function to discriminate. As the output
shows, the discriminant function is significant at all α-levels.
In (6) The unstandardized canonical Discriminant Function below, it appears that the mutual
relationship between the two variables is of interest to the analysis.
The discriminant function can be written as follows: D = 1.924 + 0.045 x1 – 0.018 x2.
The standardized discriminant function is displayed in (7) Standardized Canonical Discriminant
Function Coefficients. It has been calculated on the basis of a standardization of the explanatory
variables.
Standardized Canonical Discriminant Function Coefficients
,764-,611
ferskvandsaltvand
1Function
After this there are a number of outputs, which will not be commented here.
The part of the output called Classification contains relevant outputs, which indicate how good the
discriminant function is at discriminating. (8) Classification Function Coefficients displays Fisher’s
linear discriminant functions. These functions are calculated on the basis of the previously
mentioned unstandardized canonical discriminant function. From the output below it appears that
there is a function for both USA and Canada.
50
Discriminant analysis
By looking at the difference between the two functions the following discriminant function can be
written as:
Classification Function Coefficients
,371 ,499,384 ,332
-101,377 -95,835
ferskvandsaltvand(Constant)
USA Canadafødested
Fisher's linear discriminant functions
Y = -5.54121 – 0.12839 x1 + 0.05194 x2
In this function the values for the thickness of the growth rings for the individual fish can be
substituted for x1 and x2. If the value of the function is positive, the fish is classified as American,
and if it is negative, the fish is classified as Canadian. If there are more than two groups, Fisher’s
linear functions are used separately for each group. After this the function values are calculated for
each function, and the units of the analysis are assigned to the group with the highest function
value.
(9) Classification results show the classification of the fish on the basis of the Y function above. It
appears that 93% of the fish are classified correctly by means of the function. Thus 7% of the fish
are classified incorrectly, which deserves a comment. In that connection it should be noted that the
probability of incorrectly classifying Canadian fish as American is lower than the other way around.
This is due to the fact that the standard deviation for American salmon is larger than for Canadian
salmon (see output (3). It is important to be aware of this matter when performing a linear
discriminant analysis.
Classification Resultsa
46 4 503 47 50
92,0 8,0 100,06,0 94,0 100,0
fødestedUSACanadaUSACanada
Count
%
OriginalUSA Canada
Predicted GroupMembership
Total
93,0% of original grouped cases correctly classified.a.
51
Discriminant analysis
Another important matter to be aware of is that the discriminant function will often classify the
units of the analysis better than new units, since the function has been calculated from the sample
data. As a consequence, a verification of the discriminant function is often required before it is
used.
52
Profile analysis
7. Profile analysis
The explanation and interpretation of Profile Analysis is made based on the following literature:
- “SPSS, Advanced Models 10.0”, chp. 1 p. 1 – 14 and chp. 12 p. 95 – 108
7.1 Introduction
Profile analysis is useful for illustrating and visualizing a number of problems. The analysis is often
employed to compare averages in statistical models by the use of so-called profile plots. A profile
plot is a line in a co-ordinate system where every point corresponds to an average for an
independent variable (corrected for correlation) on a given factor level. A line can be illustrated for
every factor level. It is the comparison of these lines that is interesting.
It should be noted that there is a close connection between MANOVA and the analysis, which is
made on profile plots. This connection lies in the fact that there are more interval scaled dependent
variables and one or more nominal scaled explanatory variables.
7.2 Example
Data set: \\okf-filesrv1\exemp\Spss\Manual\Profildata.sav
The starting point for this example is the data set above. The data set is a result of a survey of the
way men and women experience their marriage. The following variables are registered in the data
set:
Dependent variables: q1 (contribution to the marriage)
q2 (the lapse of the marriage)
q3 (the degree of passion in the marriage)
q4 (the degree of friendship in the marriage)
Explanatory variables: Spouse (man/woman)
With the above-mentioned survey as a starting point the objective is to find out whether there is a
difference in the profiles that men and women create with reference to the dependent variables.
53
Profile analysis
7.3 Implementation of the analysis
The first thing to do is to plot the profiles in a co-ordinate system. It should be noted that it is
possible to choose profile plots in connection with GLM multivariate analyses (MANOVA).
However, this is not recommendable, since this approach does not allow for multiple plots. Instead
the following approach can be adopted:
This results in a dialogue box, which should be filled out as follows:
54
Profile analysis
This enables the user to plot several variables in the same graph as well as a display of summary
measures for the individual variables. After clicking Define the resulting dialogue box is filled out
as shown below:
Lines Represent specifies which lines are included in the graph. Category Axis specifies the
variable, which is used for categorization.
55
Profile analysis
To get the profiles plotted it is now necessary to get the questions displayed in accordance to the profiles and
not the other way around as is the current case. This is obtained by opening the graph editor by double
clicking on the graph and selecting properties in the edit menu. When this is done the explanatory variables
depiction is changed by selecting Group instead of X Axis. The questions will now be grouped by the
outcomes in the explanatory variable instead of the dependent variables. See below.
legendq4q3q2q1
Cou
nt
5
4
3
2
1
0
4,5334,633
4,1
3,833
4,4
4,333
3,967
3,9
WifeHusbandWifeHusband
Spouse
The profiles have now been plotted, and a visual comparison can be made. It appears that the
profiles are likely to be parallel and perhaps even congruent. Whether the profiles are horizontal is
more doubtful. The following sections address the following issues:
56
Profile analysis
• Are the profiles horizontal and parallel? (7.3.1) ⇒ Both genders attach equal weight to the
factors, but there is a difference in the level.
• Are the profiles congruent? (7.3.2) ⇒ Both genders attach weight to the factors.
• Are the profiles at the same level? (7.3.3) ⇒ Both genders find that all factors have equal
influence.
7.3.1 Are the profiles parallel?
The starting point for all three issues mentioned above is shown below:
The resulting dialogue box is filled out by moving the dependent variables to the Dependent
Variables window and the explanatory variable to the Fixed Factor(s) window as shown below:
57
Profile analysis
By clicking the Model button it is possible to specify the model for the current analysis
(parallelism). The dialogue box is filled out as follows:
After this Continue is selected.
In order to perform the current analysis it is necessary to change the method of the SPSS
programme slightly. This is done by means of the syntax window.
58
Profile analysis
Since the model has been almost completely set up by SPSS already, it is a good idea to make use
of that in the syntax. The conversion of the model to the syntax is done by clicking the Paste button.
After that two additional lines should be added as shown. LMATRIX defines the two levels to be
compared in the spouse variable. MMATRIX defines the C matrix. A thorough description of the
theory behind the used matrices is beyond the scope of this manual. Please refer to SPSS Advanced
Models, which is located at the IT instructors’ office in room H15 instead.
To make SPSS run the analysis the syntax should be marked (i.e. by dragging the mouse over it and
at the same time holding down the left mouse-button) and chosen from the menu bar.
7.3.1.1 Output
The procedure described above results in a number of outputs. One of those is an output known
from the MANOVA analysis (see section 7 for an interpretation of this). In the part of the output
named Custom Hypothesis Tests #1 an output is found, which indicates whether or not the null
hypothesis regarding parallelism is true. This is (1) Multivariate Test Results shown below.
From the output it appears that the null hypothesis is accepted at α = 0.05, since the p-value is
0.063. The conclusion is therefore that the profiles are parallel. However, this conclusion is
sensitive to the choice of the level of significance.
59
Profile analysis
Multivariate Test Results
,121 2,580a 3,000 56,000 ,063,879 2,580a 3,000 56,000 ,063,138 2,580a 3,000 56,000 ,063,138 2,580a 3,000 56,000 ,063
Pillai's traceWilks' lambdaHotelling's traceRoy's largest root
Value F Hypothesis df Error df Sig.
Exact statistica.
7.3.2 Are the profiles congruent?
For this analysis the previous syntax can be used again with minor changes. Instead of inserting the
C-matrix the 1-matrix should be inserted. The sum of the coefficients must be 1, which explains
why each has been multiplied by 0.25. The complete syntax is as follows:
The syntax is run in the same way as the previous.
7.3.2.1 Output
As in the previous section the relevant output is found in Custom Hypothesis Tests #1. This output
indicates whether the null hypothesis of congruent profiles is true. In this particular example SPSS
only uses the univariate result. This means that the correlation between the dependent variables is
not taken into account.
60
Profile analysis
Test Results
Transformed Variable: T1
2343,750 1 2343,750 1,533 ,22188687,500 58 1529,095
SourceContrastError
Sum ofSquares df Mean Square F Sig.
(2) Test Results shows that the null hypothesis of congruent profiles cannot be rejected at α = 0.05,
since the p-value is 0.221. Therefore, it can be concluded that the profiles are congruent.
7.3.3 Are the profiles horizontal?
For this analysis the C-matrix from section 7.3.1 is reinserted, and an intercept is added as shown
below:
The syntax is run in the same way. For this analysis to make any sense it is necessary for all
variables to be scaled similarly.
7.3.3.1 Output
From (3) Multivariate Test Results shown below it appears that the null hypothesis of horizontal
profiles can be rejected at α = 0.05, since the p-value is 0.000. As a consequence, it can be
concluded that the mean values for the four dependent variables are not equal.
61
Profile analysis
Multivariate Test Results
,305 8,188a 3,000 56,000 ,000,695 8,188a 3,000 56,000 ,000,439 8,188a 3,000 56,000 ,000,439 8,188a 3,000 56,000 ,000
Pillai's traceWilks' lambdaHotelling's traceRoy's largest root
Value F Hypothesis df Error df Sig.
Exact statistica.
62
Factor analysis
8. Factor analysis
The following presentation and interpretation of Factor Analysis of the results are based on:
- ”Videregående data-analyse med SPSS og AMOS”, Niels Blunch 1. udgave 2000, Systime.
Chp. 6, p. 124-155
- ”Analyse af markedsdata”, 2 rev. Udgave 2000, Systime. Chp. 3, p. 87-118
- Hair et.al. (2006): Multivariate Analysis 6th Ed., Pearson, kap. 3
8.1 Introduction
Factor analysis can be divided into component analysis as well as exploratory and confirmative factor
analysis. The three types of analysis can be used on the same data set and builds on different mathematical
models. Component analysis and exploratory factor analysis still produce relatively similar results. In this
manual only component analysis is described.
In component analysis the original variables are transformed into the same number of new variables, so-
called principal components, by linear transformation. The principal components have the characteristic that
they are uncorrelated, which is why they are suitable for further data processing such as regression analysis.
Furthermore, the principal components are calculated so that the first component carries the bulk of
information (explains most variance), the second component carries second-most information and so forth.
For this reason component analysis is often used to reduce the number of variables/components so that the
last components with the least information are disregarded. Henceforth the task is to discover what the new
components correspond to, which is exemplified below.
8.2 Example
Data: \\okf-filesrv1\Exemp\Spss\Manual\Faktoranalysedata.sav
The following example is based on an evaluation of a course at the Aarhus School of Business. 162 students
attending the lecture were asked to fill out a questionnaire containing various questions regarding the
assessment of the course. Each of the assessments were based on a five point Likert-scale, where 1 is “No,
not at all” and 5 is “Yes, absolutely”.
The questions in the survey were (label name in parenthesis):
Has this course met your expectations? (Met expectations)
63
Factor analysis
Was this course more difficult than your other courses? (More difficult than other courses)
Was there a reasonable relationship between the amount of time spent and the benefit derived? (Relationship
time/Benefit)
Have you found the course interesting? (Course Interesting)
Have you found the textbooks suitable for the course (Textbooks suitable)
Was the literature inspiring? (Literature inspiring)
Was the curriculum too extensive? (Curriculum too extensive)
Has your overall benefit from the course been good? (Overall benefit)
The original data consists of more assessments, but these have not been included in the example.
The purpose is now, based on the survey, to carry out a component analysis with a view to reduce the
number of assessment criteria in a smaller number of components. Furthermore, the new components should
be examined with a view to name them.
8.3 Implementation of the analysis
Component analysis is a method, which is used exclusively for uncovering latent factors from manifest
variables in a data set. Since these fewer factors usually form the basis of further analysis, the component
analysis is to be found in the following menu:
64
Factor analysis
This results in the dialogue box shown below. The variables to be included in the component analysis are
marked in the left-hand window, where all numeric variables in the data set are listed, and moved to the
Variables window by clicking the arrow. In this case all the variables are chosen.
It has now been specified, which variables SPSS should base the analysis on. However, a more definite
method for performing the component analysis has yet to be chosen. This is done by means of the
Descriptives, Extraction, Rotation, Scores and Options buttons. These are described individually below.
8.3.1 Descriptives
By clicking the Descriptives button the following dialogue box appears:
In short the purpose of the individual options is as follows:
Statistics
• Univariate descriptives includes the mean, standard deviation and the number of useful observations
for each variable.
• Initial solution includes initial communalities, eigenvalues, and the percentage of variance explained.
65
Factor analysis
Correlation Matrix
• Here it is possible to get information about the correlation matrix, among other thing the
appropriateness to perform factor analysis on the data set.
In this example Initial solution is chosen, because a display of the explained variance for the suggested
factors of the component analysis is desired. At the same time this is the most widely used method. Anti-
Image, KMO and Bartlett’s test of sphericity are checked as well to analyze the appropriateness of the data
analysis on the given data set.
With regards to the tests selected above, it may be useful to add a few comments. The Anti-Image provides
the negative values of the partial correlations between the variables. Anti-Image ought therefore to be low
indicating that the variables do not differ too much from the other variables. KMO and Bartlett provide as
previously mentioned measures for the appropriateness as well. As a rule of thumb one could say that KMO
ought to attain values of at least 0.5 and preferably above 0.7 to indicate that the data is suitable for a factor
analysis. Equivalently the Bartlett’s test should be significant, indicating that significant correlations exist
between the variables.
8.3.2 Extraction
By clicking the Extraction button the following dialogue box appears:
This is where the component analysis in itself is managed. In this example it has been chosen to use the
Principal components method for the component analysis. This is chosen in the Method drop-down-box.
Since the individual variables of this example are scaled very differently, it has been chosen to base the
analysis on the Correlation matrix, Cf. a standardization is carried out.
66
Factor analysis
A display of the un-rotated factor solution is wanted in order to compare this with the rotated solution.
Therefore, Unrotated factor solution is activated in Display.
Since the last components do not explain very much of the variance in a data set, it is standard practice to
ignore these. This results in a bit of lost information (variance) in the data set, but in return a more simple
output is obtained for further analysis. In addition, it makes the interpretation of the data easier.
The excluded components are treated as “noise” in the data set. The important question is just how many
components to exclude without causing too much loss of information. The following rules of thumb for
choosing the right number of components for the analysis apply:
• Scree plot: By selecting this option in Display a graphical illustration of the variance of the
components appears. A typical feature of this graph is a break on the curve. This curve break forms
the basis of a judgement of the right number of components to include.
• Kaiser’s criterion: Components with an eigenvalue of more than 1 are included. This can be
observed from the Total Variance Explained table or the scree plot shown in section 8.4. However,
these are only guidelines. The actual number of chosen factors is subjective, and it depends strongly
on the data set and the characteristics of the further analysis.
If the user wants to carry out the component analysis based on a specific number of factors, this number can
be specified in Number of factors in Extract. Last but not least it is possible to specify the maximum number
of iterations. Default is 25. This option is not relevant for this example, but for the Maximum Likelihood
method it could be relevant.
8.3.3 Rotation
Clicking the Rotation button results in the following dialogue box:
67
Factor analysis
In brief, rotation of the solution is a method, where the axes of the original solution are rotated in order to
obtain an easier interpretation of the found components. In other words it is assured that the individual
variables are highly correlated with a small proportion of the components, while being low correlated with
the remaining components.
In this example Varimax is chosen as the rotation method, since it ensures that the components of the rotated
solution are uncorrelated. The remaining methods will not be further described here.
A display of the rotated solution has been chosen in Display. Loading plots enables a display of the solution
in a three-dimensional plot of the first three components. If the solution consists of only two components, the
plot will be two-dimensional instead. With the 12 assessment criteria in this example, a three-dimensional
plot looks rather confusing, and it has therefore been ignored here.
8.3.4 Scores
Now the solution of the component analysis has been rotated, which should have resulted in a clearer picture
of the results. However, there are still two options to bear in mind before the analysis is carried out. By
selecting Scores it is possible to save the factor scores, which is sensible if they are to be used for other
analyses such as profile analysis, etc. These factor scores will be added to the data set as new variables with
default names provided by SPSS. In this example this option has not been chosen, as no further analysis
including these scores is to be performed.
8.3.5 Options
Treatment of missing values in the data set is managed in the Options dialogue box shown below:
68
Factor analysis
Missing values can be treated as follows:
• Exclude cases listwise excludes observations that have missing values for any of the variables.
• Exclude cases pairwise excludes observations with missing values for either or both of the pair of
variables in computing a specific statistic.
• Replace with mean replaces missing values with the variable mean.
In this example Exclude cases listwise has been chosen in order to exclude variables with missing values.
With regard to the output of the component analysis there are two options:
• Sorted by size sort’s factor loading and structure matrices so that variables with high loadings on the
same factor appear together. The loadings are sorted in descending order.
• Suppress absolute values less than makes it possible to control the output so that coefficients with
absolute values less than a specified value (between 0 and 1) are not shown. This option has no
effect on the analysis, but ensures a good overview of the variables in their respective factors. In this
analysis it has been chosen that no values below 0.1 is shown in the output
8.4 Output
When the various settings in the dialogue boxes have been specified, SPSS performs the component analysis.
After clicking OK a rather comprehensive output is produced, of which the most relevant outputs are
commented below.
KMO and Bartlett's Test
,791Kaiser-Meyer-Olkin Measure of SamplingAdequacy.
413,37728
,000
Approx. Chi-Squaredf
Bartlett's Test ofSphericity
Sig.
69
Factor analysis
Anti-image Matrices
,497 ,070 -,172 -,221 -,032 ,080 -,005 -,076,070 ,876 -,025 -,079 -,010 ,114 -,152 ,050
-,172 -,025 ,579 -,009 ,019 -,061 ,081 -,147
-,221 -,079 -,009 ,499 ,005 -,111 ,027 -,104
-,032 -,010 ,019 ,005 ,526 -,263 -,087 -,108,080 ,114 -,061 -,111 -,263 ,467 ,084 -,054
-,005 -,152 ,081 ,027 -,087 ,084 ,886 ,019-,076 ,050 -,147 -,104 -,108 -,054 ,019 ,486,763a ,106 -,321 -,443 -,063 ,166 -,008 -,155,106 ,728a -,036 -,120 -,015 ,178 -,172 ,077
-,321 -,036 ,843a -,017 ,035 -,117 ,114 -,278
-,443 -,120 -,017 ,806a
,010 -,231 ,041 -,211
-,063 -,015 ,035 ,010 ,746a -,531 -,127 -,214,166 ,178 -,117 -,231 -,531 ,735a ,130 -,114
-,008 -,172 ,114 ,041 -,127 ,130 ,751a ,029-,155 ,077 -,278 -,211 -,214 -,114 ,029 ,875a
Faget relevantFaget sværtTid ift. udbytteFaget spændende oginteressantEgnede lærebøgerInspirerende litteraturPensum for stortSamlede udbytteFaget relevantFaget sværtTid ift. udbytteFaget spændende oginteressantEgnede lærebøgerInspirerende litteraturPensum for stortSamlede udbytte
Anti-image Covariance
Anti-image Correlation
Faget relevant Faget svært Tid ift. udbytte
Fagetspændende
og interessantEgnede
lærebøgerInspirerende
litteraturPensumfor stort
Samledeudbytte
Measures of Sampling Adequacy(MSA)a.
SPSS has now generated the tables for the KMO and Bartlett’s test as well as the Anti-Image matrices. KMO
attains a value of 0.791, which apparently seems to satisfy the criteria mentioned above. Equivalently the
Bartlett’s test attains a probability value of 0.000. Similary high correlations are primarily found on the
diagonal of the Anti-Image matrix (marked with an a). This confirms out thesis of an underlying structure of
the variables.
Total Variance Explained
3,499 43,740 43,740 3,499 43,740 43,740 2,526 31,574 31,5741,114 13,922 57,662 1,114 13,922 57,662 1,853 23,167 54,7411,031 12,887 70,549 1,031 12,887 70,549 1,265 15,807 70,549,755 9,434 79,983,547 6,835 86,818,401 5,013 91,831,383 4,792 96,624,270 3,376 100,000
Component12345678
Total % of Variance Cumulative % Total % of Variance Cumulative % Total % of Variance Cumulative %Initial Eigenvalues Extraction Sums of Squared Loadings Rotation Sums of Squared Loadings
Extraction Method: Principal Component Analysis.
As can be seen from the output shown above, SPSS has produced the table Total Variance Explained, which
takes account of both the rotated and unrotated solution.
(1) Initial Eigenvalues displays the calculated eigenvalues as well as the explained and accumulated variance
for each of the 8 components.
(2a) Extraction Sums of Squared Loadings displays the components, which satisfy the criterion that has been
chosen in section Extraction (Kaiser’s criterion was chosen in section 8.3.2). In this case there are three
components with an eigenvalue above 1. These three components together explain 70.549% of the total
variation in the data. The individual contributions are 43.740%, 13.922%, 12.887% of the variation for
component 1, 2 and 3 respectively. These are the results for the un-rotated solution.
70
Factor analysis
In (2b) Rotation Sums of Squared Loadings similar information to that of (2a) can be found, except that these
are the results for the Varimax rotated solution. As shown the sum of the three variances is the same both
before and after the rotation. However, there has been a shift in the relationship between the three
components, as they contribute more equally to the variation in the rotated solution.
The Scree plot below is an illustration of the variance of the principal components:
Component Number87654321
Eig
enva
lue
4
3
2
1
0
Scree Plot
After the inclusion of the third component the factors no longer have eigenvalues above 1, and consequently
the curve flattens. Normally the Scree plot will exhibit a break on the curve, which confirms how many
components to include. The graphical depiction could therefore be better in the current example however it is
decided to include three components that satisfy the Kaisers Criteria in the further analysis. It is therefore
reasonable to treat the remaining components as “noise”. This agrees with the previous output of the
explained variance in the table Total Variance Explained.
71
Factor analysis
Component Matrixa
,816
,762 ,286 -,111
,730 -,257 ,430,723 ,330 -,328,722 ,187 -,278,672 ,592
-,340 ,710 -,331 ,547 ,540
Samlede udbytteFaget spændende oginteressantInspirerende litteraturFaget relevantTid ift. udbytteEgnede lærebøgerFaget sværtPensum for stort
1 2 3Component
Extraction Method: Principal Component Analysis.3 components extracted.a.
(3a) Component Matrix displays the principal component loadings for the un-rotated solution. This table
shows the coefficients of the variables in the un-rotated solution. For example it can be observed that the
correlation between the variable Overall benefit and component 1 is 0.829, i.e. very high. The un-rotated
solution does not form an optimal picture of the correlations, however. Therefore, it is a good idea to rotate
the solution in hope of clearer results.
(3b) Rotated Component Matrix displays the principal component loadings in the same way, but as the name
reveals this is for the rotated solution.
Rotated Component Matrixa
,854
,771 ,282
,764 ,153 -,161,669 ,462 -,123,226 ,871 ,261 ,819 -,214
-,226 ,799 -,303 ,731
Faget relevantFaget spændende oginteressantTid ift. udbytteSamlede udbytteEgnede lærebøgerInspirerende litteraturPensum for stortFaget svært
1 2 3Component
Extraction Method: Principal Component Analysis. Rotation Method: Varimax with Kaiser Normalization.
Rotation converged in 5 iterations.a.
As previously mentioned, the purpose of rotating the solution is to make some variables correlate highly with
one of the components – i.e. to enlarge the large coefficients (loadings) and to reduce the small coefficients
72
Factor analysis
(loadings). Whether a variable is highly or lowly correlated is subjective, but one rule of thumb is that
correlations below 0.4 are considered low.
By means of output (3b) Rotated Component Matrix it is possible to try to join the most similar assessment
criteria in three different groups:
• Component 1: For this group the variables Met expectations, Relationship time/Benefit, Course
Interesting and Overall benefit are important. Therefore it will be appropriate to categorize
component 1 as course benefit.
• Component 2: The variable Textbooks suitable and Literature inspiring are correlating highly with
component 2 and therefore it is named quality of the literature.
• Component 3: The variables More difficult than other courses and Curriculum too extensive all have
values above 0.4 compared to the third component. Thus could be named difficulty of the course.
This concludes the example. It is worth mentioning that if these results where to be used in later analyses, all
factor scores should be used. As mentioned, these scores can be calculated by selecting the property called
Scores. SPSS will then calculate a score for each respondent based on each of the components.
73
Multidimensional scaling (MDS)
9. Multidimensional scaling (MDS)
The explanation and interpretation of multidimensional scaling presented in this section is based on
the following literature:
• ”Analyse af markedsdata”, Niels Blunch 2. rev. edition 2000, Systime. chp. 4, p. 136-
157
• ”Marketing Research – An Applied Orientation”, Naresh K. Malhotra, 3rd edition 1999,
chp. 21, p. 633-646.
• “SPSS Categories® 10.0”, Jacqueline J. Meulman, Willem J. Heiser, SPSS Inc. 1999, chp. 1,
p. 13, chp. 13, p.193
• “Multidimensional skalering og conjoint analyse – et produktudviklingsværktøj”, Hans
Jørn Juhl, Internt undervisningsmateriale H nr. 151, Institut for informationsbehandling
1992, Handelshøjskolen i Århus. chp. 2 & 4.
9.1 Introduction
The purpose of multidimensional scaling is to find a structure in a number of objects or cases based
on the distances between them. This is done by placing the objects/units of the analysis in a
dimensional plane (normally two- or three-dimensional) based on a proximity-matrix. The objects
are placed in the plane based on information about the differences between them (the differences
may be disclosed by asking a number of respondents).
In the literature a distinction is made between two different types of multidimensional scaling:
Metric and non-metric MDS. For non-metric as well as metric MDS the starting point is a square
proximity matrix. By metric MDS it is a requirement that the data is on interval- or ratio-level
which means that the proximity matrix can be a covariance- or a correlation matrix. The non-metric
MDS is used when the data is ordinal or nominal scaled. An example could be a matrix of pair wise
comparisons or a transformation of rank values.
9.2 Example
Data set: \\okf-filesrv1\exemp\Spss\Manual\MDS.sav
74
Multidimensional scaling (MDS)
In a test of beverages two respondents have been asked to compare ten different soft drinks. After
the consumption of the drinks in random order the respondents were asked to express their opinion4
about each single soft drink ranking them on a 10-point scale. Small values expressed a large degree
of homogeneity and large values expressed a small degree of homogeneity (=distance). The
following soft drinks were rated: Coca-Cola, Pepsi, Fanta, Afri-Cola, Sprite, Bluna, Sinalco, Club-
Cola, Red Bull and Mezzo-Mix5.
Because this example deals with ordinal data, a non-metric multidimensional scaling is carried out.
9.3 Implementation of the analysis
The multidimensional scaling is to be found in the following menu:
Following this action the following dialogue box appears:
4 Pair wise comparison 5 The survey is fictitious and is taken from the following publication: “Multidimensional Skalierung – Beispieldatei zur Datenanalyse, Lehrstuhl für empirische Wirtshafts- und Sozialforschung Fachbreich Wirtschaftswissenschaft, BUGH Wuppertal 2001. Bergische Universität Gesamthochschule Wuppertal.
75
Multidimensional scaling (MDS)
In this dialogue box the desired variables for the multidimensional scaling should be marked in the
left-hand side window. The chosen variables are then transferred to the Variables window by
clicking the arrow next to this field (in this case all the variables should be transferred).
In this example the data have been entered in a quadratic and symmetrical matrix and are at this
point already distances. Therefore Data are distances is chosen and Shape is activated, which
results in the dialogue box shown below. It should be filled out as illustrated:
If the data set is not already in a proximity matrix this should be defined, because MDS requires the
input to be in the shape of proximity data. Because of this it is sometimes necessary to transform the
data, which can be done by choosing the option Create distances from data. (Please notice that this
option is not pictured here).
76
Multidimensional scaling (MDS)
Hereby the data are expressed as distances. In the section Measure the method for transformation is
chosen. If the analysis is based on continuous data, Interval is chosen. If the analysis is based on
frequency data, Counts is chosen, and if it is based on binary data, Binary is chosen.
The method by which the distances should be calculated is chosen under each item. Normally the
Euclidean measure of distance is used, which is also the case in this example. The proximity matrix
is generated based on this measure of distance.
The menu item Transform Values provides the possibility of standardization before the calculation
of the proximity data. It is possible to specify the standardisation method in the drop-down menu.
However there are other crucial functions, which are to control the scaling itself. For this reason
Model and Options will subsequently be described separately.
9.3.1 Model
The activation of Model is done to specify the model. The dialogue box below will appear:
In this dialogue box there are a number of possibilities:
• Level of Measurement: Specification of scaling of the data. Under this item it is chosen to
calculate the distances on an ordinal level, since the data in this example are ordinal scaled.
• Conditionality: This item indicates which comparisons make sense. Matrix Conditionality
should be chosen when the calculated distances must be compared within each single
77
Multidimensional scaling (MDS)
distance matrix. This is the case when the analysis is concerned with one matrix only (as in
this case) or when each matrix represents different areas. Row means that the calculated
distances are only compared in the same row of the matrix. This is relevant if the analysis is
concerned with asymmetrical or rectangular matrixes. Row cannot be chosen if the distance
matrix has been made for the data, e.g. by proximity data since these are quadratic and
symmetrical. Unconditional should be chosen if comparisons between various distance
matrices are needed.
• In Dimensions it should be indicated how many dimensions are needed for the analysis. If
only one solution is needed, minimum and maximum should have the same value.
• In Scaling Model the assumptions behind the scaling are indicated. Euclidian distance can
be used for all distance matrices, while Individual differences Euclidian distances (known as
INDSCAL) can only be used when the analysis deals with multiple distance matrices and
there are at least two dimensions.
In this case MDS should be performed with ordinal data in a two-dimensional solution. Since the
analysis only deals with one distance matrix, which is quadratic and symmetrical, matrix
conditionality and Euclidian distance model are chosen.
9.3.2 Options
Options is activated to define the actual output of the multidimensional scaling. Hereby the
following dialogue box will appear:
The decision regarding the contents of the output from the MDS-procedure is controlled under
Display. The following options are available:
78
Multidimensional scaling (MDS)
• Group Plots produces a scatter plot of the linear fit between the model and the data material
as the most important feature.
• Individual subject plots creates a plot of the transformations of the data from a single area. A
requirement is that Matrix Conditionality and Ordinal data have been activated in the Model
dialogue box as described above.
• Data matrix includes both the raw data and the scaled data for each subject-area.
• Model and options summary describe the effect that the scaling-possibilities have on the
analysis.
Another important item is Criteria since it controls the scaling itself. Here a decision should be
made regarding when the iteration should stop based on criteria concerning badness-of-fit.
• S-stress convergence: S-stress is a badness-of-fit criterion, which indicates how well the
distance matrix is pictured in the two-dimensional space. In S-stress convergence the
smallest acceptable size of the change in badness-of-fit between the iterations is decided.
This means that the iteration procedure stops if the change equals the given value. This size,
of course, cannot be negative. The value should be as small as possible since smaller s-stress
values enhances precision, however this will also require more time for the calculations.
• Minimum s-stress value indicates the smallest acceptable badness-of-fit. The value should be
between 0 and 1, with 1 being the worst measure of fit.
• Maximum iterations indicate the number of iterations. MDS is conducted via an iteration
procedure by which the products in turn are placed in the coordinate system. It is therefore
possible to decide the maximum number of iterations. The default in SPSS is 30 iterations
but this number can be raised to get a more precise solution (again this will require more
time for the calculations).
Finally it is possible to ensure that distances of a certain size are treated as missing. This is done in
Treat distances less than __ as missing. This function has not been chosen in this example.
9.4 Output
The above actions result in a long and comprehensive output of data regarding the multidimensional
scaling. The most relevant parts will subsequently be commented on.
First some descriptive information about the procedure itself appears which will only be
commented briefly:
79
Multidimensional scaling (MDS)
Alscal Procedure Options
Data Options-
Number of Rows (Observations/Matrix). 10
Number of Columns (Variables) . . . 10
Number of Matrices . . . . . . 1
Measurement Level . . . . . . . Ordinal
Data Matrix Shape . . . . . . . Symmetric
Type . . . . . . . . . . . Dissimilarity
Approach to Ties . . . . . . . Leave Tied
Conditionality . . . . . . . . Matrix
Data Cutoff at . . . . . . . . ,000000
The output above shows that the symmetrical matrix consists of ten rows and ten columns.
Furthermore, it can be seen that the data is of the ordinal scale of measurement.
A lot of descriptive output follows which will not be commented on.
A print-out of raw data matrices follows (in this case only one). This matrix is similar to the original
matrix (not displayed here).
The following warning draws attention to the fact that an increase of the scale level in the
multidimensional scaling has taken place (from ordinal to ratio).
>Warning # 14654
>The total number of parameters being estimated (the number of stimulus
>coordinates plus the number of weights, if any) is large relative to the
>number of data values in your data matrix. The results may not be reliable
>since there may not be enough data to precisely estimate the values of the
>parameters. You should reduce the number of parameters (e.g. request
>fewer dimensions) or increase the number of observations.
>Number of parameters is 20. Number of data values is 45
80
Multidimensional scaling (MDS)
If the analysis has more than ten objects in a two-dimensional scaling this will no longer appear.
Then follows a description of the iteration procedure:
Iteration history for the 2 dimensional solution (in squared distances)
Young's S-stress formula 1 is used.
Iteration S-stress Improvement
1 ,02813
2 ,02541 ,00272
3 ,02459 ,00082
Iterations stopped because
S-stress improvement is less than ,001000
As the output illustrates, no more than three iterations were made before the search process stopped.
The reason for this is that the s-stress improvement was less than the specified value of 0.001.
Furthermore the development in badness-of-fit (s-stress) can be compared to the improvement from
iteration to iteration.
Finally the first result of the multidimensional scaling follows:
Stress and squared correlation (RSQ) in distances
RSQ values are the proportion of variance of the scaled data (disparities)
in the partition (row, matrix, or entire data) which
is accounted for by their corresponding distances.
Stress values are Kruskal's stress formula 1.
For matrix
Stress = ,02179 RSQ = ,99707
Here the s-stress value, which should be less than 0.1 according to Kruskal’s assumption, can be
read. This assumption is fulfilled and the conclusion is that it has been possible to illustrate the
distance matrix in a two-dimensional space with a very high degree of precision. The RSQ (R-
81
Multidimensional scaling (MDS)
Square = R2), which has a value of 0.99707, is the traditional expression for the degree of
explanation; in this case it is quite large and accordingly very satisfactory.
The next output indicates the set of co-ordinates in the two dimensions for each soft drink.
Configuration derived in 2 dimensions
Stimulus Coordinates
Dimension
Stimulus Stimulus 1 2
Number Name
1 cocacola 1,0732 -,4818
2 pepsi ,7125 -,3833
3 fanta -1,3131 -,2381
4 afri ,6451 ,8952
5 sprite -,0711 -1,4253
6 bluna -1,7408 -,0561
7 sinalco -1,7354 ,5402
8 club ,3461 1,1895
9 redbull 2,2358 ,1069
10 mezzo -,1523 -,1472
These co-ordinates are used to present the configuration of the objects graphically (see (1) Derived
Stimulus Configuration below). The co-ordinates may be transformed to other programs if a
different graphical representation, different from the one SPSS offers, is needed.
Then follows a matrix with the transformed data (the optimally scaled data):
82
Multidimensional scaling (MDS)
Optimally scaled data (disparities) for subject 1
1 2 3 4 5
1 ,000
2 ,471 ,000
3 2,424 2,031 ,000
4 1,490 1,274 2,194 ,000
5 1,490 1,274 1,664 2,424 ,000
6 2,808 2,424 ,471 2,602 2,194
7 2,989 2,602 ,891 2,424 2,602
8 1,799 1,664 2,194 ,471 2,602
9 1,274 1,490 3,566 1,799 2,808
10 1,274 ,891 1,274 1,274 1,274
6 7 8 9 10
6 ,000
7 ,471 ,000
8 2,424 2,194 ,000
9 3,987 3,987 2,194 ,000
10 1,664 1,664 1,490 2,424 ,000
With this the text based output ends and the graphical output follows. The first thing that is
presented is the result of the multidimensional scaling in two dimensions:
83
Multidimensional scaling (MDS)
Dimension 13210-1-2
Dim
ensi
on
2
1,5
1,0
0,5
0,0
-0,5
-1,0
-1,5
mezzo
redbull
club
sinalco
bluna
sprite
afri
fantapepsi
cocacola
Derived Stimulus Configuration
Euclidean distance model
This plot gives a visual impression of the position of the soft drinks in relation to each other, based
on the two respondents’ evaluation.
The two respondents’ opinion of the ten soft drinks can be divided into six categories:
1. Orange lemonade (Sinalco, Bluna and Fanta)
2. Lemon lemonade (Sprite)
3. Cola-Fanta-mix (mezzo-mix)
4. Branded Cola (Coca-Cola and Pepsi)
5. Regional brands (Club-Cola and Afri-Cola)
6. Energy drinks (Red Bull)
It seems that the further to the right in the co-ordinate system the observation is located, the larger
the caffeine content in the drinks. The drinks to the left contain little or no caffeine.
84
Multidimensional scaling (MDS)
The vertical plane indicates facts about the brands of the drinks. The known brands are further south
in the graph (diagram) while the lesser-known brands are situated further north.
The last scatter plots serve the purpose of showing how good the multidimensional scaling is.
Disparities43210
Dis
tan
ces
4
3
2
1
0
Scatterplot of Linear Fit
Euclidean distance model
This scatter plot shows the linear fit. In this case Disparities has been plotted against Distances. The
closer these points are around the sloping line through the origin, the larger the degree of
explanation.
85
Conjoint Analysis
10. Conjoint Analysis
This chapter’s explanation and interpretation of Conjoint Analysis in SPSS is based on the
following literature:
- ”Analyse af markedsdata”, Niels Blunch 2. rev. edition 2000, Systime.
- ”Multidimensional skalering og conjoint analyse- et produktudviklingsværktøj”, Internt
undervisningsmateriale H nr. 151, Chp. 3, p. 40-59, 68-73
- “Multivariate Dataanalysis”, Hair, Anderson and others, fifth edition, Prentice Hall
10.1 Introduction
Conjoint analysis is a statistical method especially used when conducting market analysis. The
analysis can help explain consumer preferences for (ranking of) existing or new products/product
concepts. A side from market analysis conjoint analysis can also be used for a number of other
analyses, such as cost-benefit analysis, segmentation, etc. For a more detailed explanation of these
analyses please refer to the above mentioned literature.
The purpose with the analysis in this chapter is to determine a company’s optimal product design. This is
done by a study of how much a respondent is willing to surrender of one attribute in order to receive more of
another.
An important part of the conjoint analysis is the experiment plan where each respondent ranks the presented
product design based on the combination of attributes. However due to the number of attribute combinations
this will often prove to be a difficult task. Therefore an easier method is available in SPSS, here the
respondent is exposed to a smaller number of attribute combinations, and based on these combinations it is
possible to determine the importance of each attribute.
10.2 Example
Data: Blank dataset (see example)
\\okf-filesrv1\exemp\Spss\Manual\Conjoint_orthogonaldesign.sav
\\okf-filesrv1\exemp\Spss\Manual\Conjointdata.sav
A company is interested in marketing their new carpet and furniture cleaning product. The
management has identifies 5 attributes which are believed to influence consumer preferences these
are, package design (package), brand, price, seal type (seal), and money back guarantee if
unsatisfactory (money).
86
Conjoint Analysis
In the table below it is shown how many levels each attribute has.
Attribute Level 1 Level 2 Level 3
Package A B C
Brand K2R Glory Bisset
Price $ 1,19 $ 1,39 $ 1,59
Seal No Yes
Money No Yes
The company is now looking to perform a conjoint analysis in order to examine how much value
the consumer assigns each attribute’s different levels and which attributes the consumer finds most
important. Therefore an analysis is conducted where 10 respondents rank a number of product
concepts based on their preferences for each concept. However, a full factorial design would require
that the respondents should rank 3x3x3x2x2 = 108 different alternatives. In order not to confuse the
respondents and keep costs to costs to a minimum a simplified examination is therefore conducted
at first.
The entire example basically consists of two parts. First, the reduced number of combinations which
the respondent has to asses is determined. Then based on this result the importance of each attribute
is determined.
10.3 Generating an orthogonal design
SPSS can help you generate a simplified analysis plan, this way the 108 product concepts are
reduced to a more reasonable number. In spite of being reduced a large part of the informational
value is still present6. In order to start this procedure an empty data set is opened in SPSS. Below
the full procedure is explained.
6 This function assumes that there are no interactions.
87
Conjoint Analysis
After Orthogonal Design and Generate… has been activated, following dialogue box appears, this is where
the definition of the design is carried out.
88
Conjoint Analysis
In the Factor Name you have to specify each variable for the orthogonal design independently.
Furthermore a label name has to be specified for each variable since these will be used on the
generated plancards which will be shown to the respondents. E.g. the package variable is defined
with a label as package design. When this has been done click add, and the variable is transferred to
the large area. This is done for all variables.
In the Data File area you can either choose to save the new dataset on the computer or choose
Replace working data file which will make the generated data appear in the current dataset.
After this the levels for each variable has to be specified. Each factor has to be selected one at a
time and then click Define Values. Doing so activates the following box:
As can be seen from the figure a level has to be specified for each variable. The figure above shows
the defined values for the variable Brand. It is also possible to specify labels for each of the levels,
it is recommended that this is done. Furthermore, it is possible to make SPSS generate a number of
levels using Auto-Fill. However you still have to specify the labels for each level yourself.
When this is done for all the variables the actual design function must be established. This is done
by activating options in the main dialogue box.
89
Conjoint Analysis
10.3.1 Options
In the Option dialogue box the number of wanted variables are specified. Since the company wished
to test 18 designs this means that 18 different designs have to be picked from the possible 108.
Thus, the number 18 is stated in the Minimum number of cases to generate. If nothing is specified
SPSS will automatically generate the minimum number of cases which are necessary to calculate
the value of each of the attributes. The more different designs that are generated by SPSS – and
assessed by the respondents – the better the results will be but at the same time the costs will also
increase and at one point you can reach so many different designs that the validity of each
respondent’s answer will drop.
Holdout Cases are set to 4, which mean that 4 extra designs are generated. These will
correspondingly be assessed by the respondents but not used to the estimation of utilities (the value
of each of the attributes). The Holdout Cases are used to check the validity of the analysis.
The procedure is now ready to be started therefore OK is clicked on in the main dialogue box and the design
will be generated.
10.4 PlanCards
The above standing implies that 22 plancards are generated, which respondents later will have to
asses. Furthermore, 2 so-called simulationcards are also added, these simulationcards can either be
existing or expected products which then will be assessed based on the results of the test. The
simulationcards has to be created manually in SPSS a long with the specification of the attributes.
This mentioned dataset can be found on:
\\okf-filesrv1\exemp\spss\manual\conjoint_orthogonaldesign.sav
90
Conjoint Analysis
These plancards must then be moved to the output window. This procedure is shown below:
By doing this the dialogue box shown below appears here it is specified which variables that are
included in these plancards.
In this case we wish that the respondents have to asses all the different variables which mean that
they all are selected. In addition, listing for experimenter and profiles for subjects are activated.
Below examples of two generated plancards are shown.
91
Conjoint Analysis
Profile Number 1
1 A Glory $1,39 Yes NoCard ID
Packagedesign Brand name Price
GoodHousekeeping seal
Money-backgurantee
Profile Number 2
2 B K2R $1,19 No NoCard ID
Packagedesign Brand name Price
GoodHousekeeping seal
Money-backgurantee
10.5 Conjoint analysis
The generated plancards are now used in connection with the analysis where a number of
respondents have to asses the different designs. There are different ways the respondents can be
asked to asses the designs. In this example each respondent has ranked all of the 22 different (18+4)
different plancards from 1 to 22. The higher the rank value the higher the respondent’s preference
for this combination.
For this illustration 10 respondents have been used. However, this is clearly below what is normally
recommended.
The result can be seen here:
\\okf-filesrv1\exemp\Spss\Manual\Conjointdata.sav
92
Conjoint Analysis
In SPSS it is not possible to perform a conjoint analysis by the normal point and click version,
therefore it is necessary to use the syntax. The syntax code for the conjoint analysis is shown below:
Furthermore, in SPSS it is possible to specify a relationship between single variables and the
respondents’ rankings, if one is believed to be present. In the above standing window this is
described as e.g. discrete and linear (less / more specifies whether the relationship is believed to be
positive or negative. Further explanation of the code is available in the SPSS help menu. Choose
Help\topics and search for “Conjoint”, and the following command syntax will show:
93
Conjoint Analysis
In order to run the code from the syntax you choose the Run menu and the All. SPSS then performs
the conjoint analysis and the result can be seen in the output window
10.6 Output
In the output the results for each respondent is available as well as an average result of all the
respondents. The output of the average result is shown below in (1) ) Overall Statistics. Initially
only the results for the first 2 respondents are shown, however if the output window is expanded at
the bottom then all results are available.
Importance Values
35,63514,91129,41011,172
8,872
packagebrandpricesealmoney
Averaged Importance Score
The Averaged importance column shows each attributes’ importance to the respondent. The higher
the score the more important is the specific attribute to the consumers. Based on the output below it
can be seen that the average respondent regards Package Design and Price as the most important
94
Conjoint Analysis
attributes. After those come Brand, Good Housekeeping seal and least important is Moneyback
Guarantee.
Utilities
-2,233 ,1921,867 ,192
,367 ,192,367 ,192
-,350 ,192-,017 ,192
-1,108 ,166-2,217 ,332-3,325 ,4982,000 ,2874,000 ,5751,250 ,287
ABC
package
K2RGloryBissell
brand
$1,19$1,39$1,59
price
NoYes
seal
No2,500 ,5757,383 ,650
Yesmoney
(Constant)
UtilityEstimate Std. Error
Coefficients
-1,108price2,000seal
money 1,250
EstimateB
Utility expresses the utility values for the variables of each attribute. These values are calculated in
such a way that if they are added together, with respect to each of the attribute combinations, the
product that receives the highest score will also be the product that has received the highest rank.
An example of how to calculate the utility for a product concept is shown below (Package = B,
Brand Name = K2R, Price = $1,19, Good Housekeeping seal = No og Money-back guarantee =
No):
1,8667 + 0,3667 – 1,1083 + 2 + 1,25 = 4,3004
It is these calculated utility values the company can employ as a decision basis for the development
of their new product. This however, requires a closer study of the output which will not be
described here.
Factor is mini plots of the utility values for the variables of each attributes. They basically show the
same as the plots that can be found at the bottom of the output. Based on the Factor plots one can
see that B is the preferred Package Design and that $1,19 is the preferred price.
These results must of course be seen in relation with the fact that the respondents gave the highest
grades to the most preferred product combinations. This means that the higher the utility values are,
the better.
95
Correlationsa
,982 ,000,892 ,000,667 ,087
Pearson's RKendall's tauKendall's tau for Holdouts
Value Sig.
Correlations between observed andestimated preferences
a.
Pearson’s R and Kendall’s tau, which are specified after each of the respondents values, indicate how good
the model is. Both are calculated as correlations between the respondents’ observed and expected preferences
and therefore these should yield as high a value as possible.
By changing the syntax /PRINT=ALL to /PRINT=SUMMARYONLY the following output can be obtained.
(2) Preference Probabilities of Simulations:
Preference Probabilities of Simulationsb
23 30,0% 43,1% 30,9%24 70,0% 56,9% 69,1%
Card Number12
IDMaximum
UtilityaBradley-
Terry-Luce Logit
Including tied simulationsa.
y out of x subjects are used in the Bradley-Terry-Luce and Logitmethods because these subjects have all nonnegative scores.
b.
(2) Preference Probabilities of Simulations shows the probabilities of choosing a certain profile as
the preferred one by the help of three different probability calculations. The two profiles which have
been assessed here are the simulation designs which were generated earlier; these could be either
current or newly planned products.
The three probability models which are applied are the maximum utility model, the BTL (Bradley-
Terry-Luce) model and the logit model. The probabilities are calculated based on the two earlier
mentioned simulation cards. It can be seen that all three models indicates that card number 24 is
preferred.
96
Item analysis
11. Item analysis
The explanation and interpretation of multidimensional scaling presented in this section is based on
the following literature:
- SPSS Base 10.0 User’s guide, SPSS. Ch. 34 page 407-411.
11.1 Introduction
The item analysis is often used for evaluating questions for questionnaires. The analysis is used to
confirm the choice of a group of questions, which added together, is presumed to measure a
predefined concept.
The analysis can be used for several purposes. One of the purposes is to evaluate each question. In
this connection the focus is on whether questions can be left out of an analysis, perhaps because the
questions do not measure what they were intended to measure.
Furthermore, it is possible to examine if the answers should be turned around by making a so-called
negation. For example, an answer with the value 1 (on a scale of 1 to 5) might be turned around and
be assigned a value of 5 instead. This can be very important for the further analysis. The item
analysis is often used before the factor analysis (see chapter 8) or AMOS models (See also “Manual
for AMOS 6.0” from ITA).
11.2 Example
Data set: \\okf-filesrv1\exemp\Spss\Manual\Itemanalysedata.sav
This example is based on questionnaires from the lectures, which are filled out by the students at the
Aarhus School of Business every semester. Each questionnaire contains 19 questions. The students
are asked to evaluate the subject itself; pedagogical competence and his/her own performance. The
data is the same as what was used in the Factor Analysis (chapter 8). The answer of each single
question is indicated on a Likert-scale from 1 to 5 where:
• 1 is: No, not at all
• 3 is: Acceptable
• 5 is: Yes, absolutely
97
Item analysis
In this example of the item analysis, the purpose is to assess to what extent the first eight questions
are relevant for the evaluation of the subject and to what extent questions can be left out.
Furthermore it is examined whether it is relevant to make so-called negations on the eight questions.
11.3 Implementation of the analysis
In SPSS terminology item analysis is called Reliability Analysis. It is not possible to look for the
word Item in the online help-function.
The following choices are made under Analyze:
This results in the following dialogue box:
98
Item analysis
In this dialogue box the following variables (questions) are chosen in the window to the left and
moved to the right-hand side window by clicking on the arrow:
Relevant, svrt, tid_udb, spn_int, lit_egn, lit_insp, pensum, udbytte
There are now two places of specification. First a Model must be chosen and subsequently it is
possible to define the output under Statistics.
11.3.1 Model
Model gives the following possibilities of reliability models:
• Alpha (Cronbach): This is a model of internal consistency, based on the average inter-item
correlation.
• Split-half: This model splits the scale into two parts and examines the correlation between
the parts.
• Guttman: This model computes Guttman’s lower bounds for true reliability.
• Parallel: This model assumes that all items have equal variances and equal error variances
across replications.
• Strict parallel: This model makes the assumptions of the parallel model and also assumes
equal means across items.
In this example Alpha has been chosen as model.
11.3.2 Statistics
By clicking the Statistics button, the following dialogue box appears. The boxes that are activated
indicate the chosen outputs:
99
Item analysis
Descriptives for provides the opportunity to display the mean and the standard deviation for item
and scale. In this case items and scale have been chosen.
Scale if item deleted indicates how much it is possible to raise the scale (Chonbachs α) if the item is
removed from the analysis.
Inter-item provides the opportunity to display the correlation matrix and/or the covariance matrix
between all items.
Summaries:
• Means: Summary statistics for item means. The smallest, largest, and average item means,
the range and variance of item means, and the ratio of the largest to the smallest item means
are displayed.
• Variances: The same as for means, only for the variance instead.
• Covariances: The same as for means, only for the covariances between items.
• Correlations: The same as for means, only for the correlations instead.
ANOVA Table:
100
Item analysis
Here it is possible to carry out a repeated-measurement analysis of the variance under F-test. The
Friedman chi-square test is appropriate for rank data. The Cochran chi-square is appropriate for
data that is dichotomous.
The following possibilities are also available:
• Hotelling’s T-square: A multivariate test of the null hypothesis that all items on the scale
have the same mean.
• Tukey’s test of additivity: Estimates the power to which a scale must be raised to achieve
additivity. Tests the assumption that there is no multiplicative interaction among the items.
• Intraclass correlation coefficient: Produces single and average measure intraclass
correlation coefficients, along with a confidence interval, F statistic, and significance value
for each.
• Model: Here, the model for the computation of the intraclass correlation coefficients is
specified. Select two-way mixed when people effects are random and the item effects are
fixed. Select two-way random when people effects and the item effects are random, and one-
way random when people effects are random.
• Type: Here the definition of the intraclass correlation coefficient is chosen.
• Confidence interval: Here the desired confidence interval is indicated.
• Test value: The value with which an estimate of the intraclass correlation coefficient is
compared. The value should be between 0 and 1.
11.4 Output
Below, the relevant output is chosen and commented on. The first output is (1) Correlation matrix,
which shows the correlation between the responses on the eight questions.
Inter-Item Correlation Matrix
1,000 -,155 ,399 ,427 ,305 ,333 -,187 ,447-,155 1,000 -,276 -,267 -,139 -,144 ,313 -,248,399 -,276 1,000 ,404 ,318 ,232 -,232 ,474,427 -,267 ,404 1,000 ,346 ,438 -,057 ,443,305 -,139 ,318 ,346 1,000 ,547 -,214 ,447,333 -,144 ,232 ,438 ,547 1,000 -,086 ,332
-,187 ,313 -,232 -,057 -,214 -,086 1,000 -,193,447 -,248 ,474 ,443 ,447 ,332 -,193 1,000
relevantsvrttid_udbspn_intlit_egnlit_insppensumudbytte
relevant svrt tid_udb spn_int lit_egn lit_insp pensum udbytte
101
Item analysis
This output shows that DIFFICUL (the degree of difficulty) and CURRICUL (the extent of the
curriculum) correlate negatively with the remainder of the variables. For this reason a so-called
negation is made, where the scale is turned around. This is done by creating two new variables
(choose Transform – Compute):
• NY_SVRT = 6 – SVRT
• NY_PENSUM = 6 – PENSUM
The negation takes place by calculating the new variables as: N+1-the former value, where N is the
number of possible answers. After the new variables have been calculated the analysis is run once
more with the two new variables but without the variables svrt and pensum. The method is the same
as the previously mentioned and will not be repeated here. The new (2) Correlation Matrix looks as
follows:
Inter-Item Correlation Matrix
1,000 ,399 ,427 ,305 ,333 ,447 ,155 ,187relevant tid_udb spn_int lit_egn lit_insp udbytte ny_svrt Ny_PENSUM
relevant,399 1,000 ,404 ,318 ,232 ,474 ,276 ,232tid_udb,427 ,404 1,000 ,346 ,438 ,443 ,267 ,057spn_int,305 ,318 ,346 1,000 ,547 ,447 ,139 ,214lit_egn,333 ,232 ,438 ,547 1,000 ,332 ,144 ,086lit_insp,447 ,474 ,443 ,447 ,332 1,000 ,248 ,193udbytte,155 ,276 ,267 ,139 ,144 ,248 1,000 ,313ny_svrt
Ny_PENSUM ,187 ,232 ,057 ,214 ,086 ,193 ,313 1,000
The negative correlations are now positive.
Then follows (3) Item-total statistics. This output shows Cronbach’s alpha in case the
items/questions were removed.
Item-Total Statistics
19,7429 12,020 ,511 ,311 ,74120,0810 12,792 ,534 ,326 ,74019,9667 11,879 ,546 ,377 ,73520,1048 12,104 ,526 ,399 ,73920,6429 12,652 ,488 ,380 ,74620,1095 12,404 ,600 ,398 ,72921,2905 13,403 ,334 ,183 ,77120,5286 13,772 ,275 ,167 ,780
relevanttid_udbspn_intlit_egnlit_inspudbytteny_svrtNy_PENSUM
Scale Mean ifItem Deleted
ScaleVariance if
Item Deleted
CorrectedItem-TotalCorrelation
SquaredMultiple
Correlation
Cronbach'sAlpha if Item
Deleted
Output (3) must be compared to Cronbach’s alpha for the analysis, which is mentioned in the output
below (4) Reliability Coefficients.
102
Item analysis
Reliability Statistics
,773 ,774 8
Cronbach'sAlpha
Cronbach'sAlpha Based
onStandardized
Items N of Items
By comparing output (3) and (4) it can be seen that Cronbach’s alpha climbs from 0.773 to 0.780 if
NY_PENSUM is removed from the analysis. Hence, a decision should be made whether this
question contributes to the analysis. The rule of thumb is that items should be removed if
Cronbach’s alpha increases. Furthermore, Cronbachs alpha is in this case higher than 0.7, which it
should be. It is also noted that Cronbach’s alpha increases with the number of items.
103
Introduction to Matrix algebra
12. Introduction to Matrix algebra
The purpose of this introduction to matrix algebra in SPSS is to provide the initial insight needed to
become familiar with the programming language. The matrix programming language contains many
twists and turns and you will therefore not become an expert by following this introduction. The
introduction primarily contains matrix algebraic operations and will be strongly inspired by the
Cand.merc subjects “Applied Econometric Methods” as well as “Applied Quantitative Methods”.
The complexities encountered when using interactive matrix programming are NOT associated with
identifying and understanding applied syntax commandos, since this is relatively simple. On the
other hand it can be quite difficult to gain an overview of the vast possibilities that the program
provides access to. Often it is not difficult to – through theoretical considerations - to determine
which matrices one desires to compute, but it can be a very problematic task to determine which
algebraic operators and functions are needed to achieve the desired computation. Even so all these
difficulties become smaller and smaller by repeated use of the program.
12.1 Initialization
In order to work with matrix algebra in SPSS it is necessary to work in the syntax environment.
This is done by opening a syntax file under the menu - FIlE – NEW – SYNTAX. It is then possible
to initialize the recognition of matrix programming code by writing the following syntax command:
MATRIX.
It is important to remember that matrix programming code must be executed with the following
syntax statement:
END MATRIX.
In other words the matrix algebra must be written between these 2 commands.
104
Introduction to Matrix algebra
12.2 Importing variables into matrices
Often it will be necessary for you to import an existing dataset into matrices. For this purpose you
can use the following syntax:
GET X
/VARIABLES = Variable1, Variable2
/NAMES = VARNAMES .
Above, SPSS reads in all the observations from the open dataset into matrix X. The variable
selection is carried out by listing the variable names after the equal sign, separating the variable
names by commas. The selected variables will hence be imported as column vectors into the matrix
X.
12.3 Matrix computation
The calculations in the matrix language can be carried out on common scalar values or it is possible
to capitalize on the advantages of using matrix operators to save time and space. The matrix
language establishes scalar values as matrices of the dimension 1x1.
The establishment of scalar values is carried out by using the following command:
Compute A = 2.
Establishing a row vector:
Compute B = {1,2, 3}.
Establishing a column vector:
105
Introduction to Matrix algebra
Compute C = {4;5;6}.
Finally the establishment of matrices:
Compute D = {1,2,3; 4,5,6; 7,8,9}.
106
Introduction to Matrix algebra
Or for an improved overview:
Compute E ={4,2,3;
6,8,6;
1,3,9}.
If you are interested in seeing the matrices that have been computed you can use the following
code:
Print E.
This prints the matrix E to the output. One must be observant of the fact that it is not possible to list
matrices in the print commando unless you are in the process of printing matrices that have an equal
number of columns or rows AND you employ brackets to embrace the matrices which should be
separated by commas, as shown here:
Print {B,E}.
This prints a matrix composed of the two temporarily merged matrices. To summarize the following
code and corresponding output exemplifies the above mentioned syntax commands:
Example 1 Full syntax file.
107
Introduction to Matrix algebra
MATRIX .
Compute A = 2.
Compute B = {1,2, 3}.
Compute C ={4;5;6}.
Compute D = {1,2,3; 4,5,6; 7,8,9}.
Compute E = {4,2,3;
6,8,6;
1,3,9}.
Print A.
Print B.
Print C.
Print {A,B}.
Print {B;D}.
END MATRIX .
108
Introduction to Matrix algebra
Example 1 Output
12.4 General Matrix operations
The following table provides an overview of the different operators that are applied in order to carry
out matrix operations.
Symbol Operator
−
Changes the sign of the matrix. A minus sign alone in front of a matrix
will invert the sign of each element in the matrix.
+ Matrix addition.
− Matrix subtraction.
* Multiplication.
/ Division.
**
Matrix exponentiation. The matrix is multiplied by itself, as many times
as the number it is raised to the power of.
&* Elementwise multiplikation.
109
Introduction to Matrix algebra
&/ Elementwise division.
&** Elementwise exponentiation.
110
Introduction to Matrix algebra
Scalarvalues and matrices can hence be combined base don conventional arithmetical operations.
This can be seen by adding the following syntax to the before mentioned matrix syntax (remember
that the code must be between the MATRIX. and END MATRIX. statements):
Compute A = B+4.
Compute F = B-4.
Compute G= C*4.
Compute H= D/4.
Print A.
Print F.
Print G.
Print H.
By running (RUN - ALL) the new syntax code it is possible to get the following output:
A 5 6 7 F -3 -2 -1 G
16 20
24 H ,250000000 ,500000000 ,750000000 1,000000000 1,250000000 1,500000000 1,750000000 2,000000000 2,250000000
Note that the matrix A no longer has the value 2.
Two matrices can only be added if they are of the same order:
111
Introduction to Matrix algebra
Compute I = D+E.
Print I.
By adding this code you will add the following to your new output document:
The same rules apply for subtraction. As seen in the matrix operator table it is possible to change
sign of the elements in the matrix by using a minus sign, as seen here:
Compute J = -I.
Multiplication can only be carried out as long the matrices conform.
Compute K = B*D.
Print K.
In this case the dimensions of B are 1x3 while D has 3x3 meaning that J will become a 1x3 matrix:
The joining of two matrices can be done either horizontally or vertically, but once again it is
required that the matrices conform to each other.
Compute L={D,C}. /* Horisontalt, 3x3 matrice med 3x1 søjlevektor.*/
Print L.
I 5 4 6 10 13 12 8 11 18
K 30 36 42
112
Introduction to Matrix algebra
Compute M={D;B}. /* Vertikalt, 3x3 matrice med 1x3 rækkevektor.*/
Print M.
The result of these mergers are:
M 1 2 3
L 4 5 6 1 2 3 4 7 8 9
1 2 3 4 5 6 5 7 8 9 6
It is also possible to compute matrix values by using the index operator (:). It works as follows:
Compute N={9:4}.
Print N.
The result is:
N 9 8 7 6 5 4
113
Introduction to Matrix algebra
It is also possible to generate arithmetic sequences by using the following type of syntax:
Compute O = {2:14:3}.
Print O.
In this case a matrix N is computed. Its first column value is 2 upon which the following columns
value increases by 3 until the last row’s value reacher 14.
O 2 5 8 11 14
Finally, it is also possible to transpose a matrix – or a vector – by using the following syntax:
Compute P=T(K).
Print P.
In this case a new matrix, O, is computed which corresponds to the transposed J matrix. Output
resulting from adding the above syntax is:
P 30 36 42
The use of different matrix operations will now be exemplified through the use of a more
comprehensive example.
EXAMPLE 2
114
Introduction to Matrix algebra
The Simpson family has decided to keep account of the expenses used on their meals, so they can
track the daily expenses for each person. The family is composed of 4 people: Homer, Marge, Bart
and Lisa. The expenses are registered into three categories: Breakfast, Lunch and Dinner.
The accounts for the past week can be seen in the following table:
ACCOUNTS BREAKFAST LUNCH DINNER
Homer 50.50 45.00 67.25
Marge 45.00 40.25 88.75
Bart 30.00 30.25 30.75
Lisa 20.00 40.00 80.50
The Simpson Family has chosen to keep track of their expenses using SPSS matrix algebra.
1.part
MATRIX.
COMPUTE Accounts = {50.5, 45, 67.25;
45, 40.25, 88.75;
30, 30.25, 30.75;
20, 40, 80.5}.
COMPUTE names = {"Homer";"Marge";"Bart";"Lisa"}.
COMPUTE meal = {"Breakfast","Lunch","Dinner"}.
PRINT Accounts /TITLE = "The Simpson Family" /FORMAT = F12.2 /RNAMES =
names /CNAMES = meal.
END MATRIX.
This provides the following output:
115
Introduction to Matrix algebra
The Simpson Family Breakfast Lunch Dinner Homer 50,50 45,00 67,25 Marge 45,00 40,25 88,75 Bart 30,00 30,25 30,75 Lisa 20,00 40,00 80,50
The family agrees to compute their expenses without VAT, and therefore add the following syntax
within the matrix syntax:
2.part
COMPUTE vat = 0.22.
COMPUTE Accounts = Accounts/(1+vat).
PRINT Accounts /TITLE "The Simpson Family, Without VAT" /FORMAT = F12.2
/RNAMES = names /CNAMES = meal.
This results in the following output:
The Simpson Family, Without VAT Breakfast Lunch Dinner Homer 41,39 36,89 55,12 Marge 36,89 32,99 72,75 Bart 24,59 24,80 25,20 Lisa 16,39 32,79 65,98
When the week is over the accounts are totaled for each person.
116
Introduction to Matrix algebra
3.part
COMPUTE unit={1;1;1}.
COMPUTE pers_tot = Accounts*unit.
PRINT pers_tot /TITLE "Total each person" /FORMAT = F12.2 /RNAMES = names.
This results in the following table:
Total each person Homer 133,40 Marge 142,62 Bart 74,59 Lisa 115,16
In the mean time the Simpson Family has received a guest for the holidays – Milhouse, a poor
student – will also be entering into the family accounts. Milhouse is exempt from VAT!
4.part
COMPUTE guest = {"Milhouse"}.
COMPUTE newnames = {names; guest}.
COMPUTE mexpense = {80, 120, 150}.
COMPUTE holiday = {Accounts; mexpense}.
PRINT holiday /TITLE "Holiday Accounts" /FORMAT = F12.2 /RNAMES = newnames
/CNAMES = meal.
Below it is possible to observe the output that results from adding the above syntax.
Holiday Accounts Breakfast Lunch Dinner Homer 41,39 36,89 55,12 Marge 36,89 32,99 72,75 Bart 24,59 24,80 25,20 Lisa 16,39 32,79 65,98 Milhouse 80,00 120,00 150,00
117
Introduction to Matrix algebra
12.5 Matrix algebraic functions
Function Use
DET(squarematrix) Returns the determinant af en square matrix.
MDIAG(argument) Returns a diagonal matrix. The argument is a
vector or a square matrix.
EVAL(symmetricmatrix) Returns a columnvector contain eigen values
in descending order.
IDENT(dimension) Computes an identity matrix.
INV(matrix) Computes the invers of a matrix.
MAKE(n,m, value) Computes a n x m matrice of identical values.
MSSQ(matrix) Computes the squared sum of all elements.
CSUM(matrix) Computes the sum of all column elements.
TRACE(matrix) Returns the sum of the diagonal elements.
NCOL(matrix) Returns a scalar containing the number of
columns in a matrix.
NROW(matrix) Returns a scalar containing the number of
rows in a matrix.
SOLVE(Matrix1,Matrix2) Solves homogenous equations. If
M1*X=M2, then X= SOLVE(M1, M2).
DIAG(matrix) Computes a columnmatrix containing the
elements of the main diagonal.
Below the use of some of the above functions are demonstrated in a comprehensive example. The
example carries out a regressionanalysis. The theory used is, where deemed necessary, included as
comments to the syntax. The example can be run by entering the code into a syntax window (FILE-
NEW-SYNTAX) and then using the menu point ”Run All” (RUN-ALL).
118
Introduction to Matrix algebra
EXAMPLE 3 Regressionanalysis:
By using the dataset which can found here: \\okf-filesrv1\exemp\Spss\Ha manual\het-km-alder.sav .
MATRIX. /*Start recognition of matrix commands in SPSS syntax*/
GET y /*Import the dependent varaible from the open dataset*/
/VARIABLES = km /*The variables name in this case km */
/NAMES = varnamey . /* Row vector with variable name*/
GET x /* Import the independent variables from the open dataset*/
/VARIABLES = alder /* The variablenames in this case alder, with more than one
variable: separate variable names wiht commas */
/NAMES = varnamex. /* Row vector with variable names*/
Compute k = ncol(x). /*Problem dimension established*/
Compute n = nrow(x).
Compute varnames = t({"Constant", varnameX}). /* Create a name column vector*/
Compute con = make(n,1,1). /* Create a column vector to compute the constant*/
Compute x={con,x}. /* Merge the constant to the independent variables*/
Compute df2 = n-k-1. /* Degrees of freedom established*/
Compute invxtx = inv(t(x)*x). /* OLS-estimation*/
Compute b = invxtx*(t(x)*y). /* b=(x’x)-1x’y */
Compute j = nrow(b). /* Number of estimated parameters */
Compute yhat = x*b. /*Estimated value of Y, called yhat*/
Compute resid = (y-(yhat)). /*Hence calculate residuals*/
Compute sse= cssq(resid). /* The squared error (sum of e i 2) is estimated to
gain an estimate for the unexplained variation*/
Compute mse = sse/df2. /* Mean-Square Error*/
Compute tss = cssq((y-(csum(y)/n))). /* Total variation, SAKy or TSS*/
Compute r2 = (tss-sse)/tss. /* Explanatory strength, R2 */
Compute covb = (invxtx)&*(mse). /* Varianscovarians matrix*/
Compute stdeb = sqrt(diag(covb)). /*Column vector with std error of parameters */
Compute tobs = b/stdeb. /* Calculate t- values, t=(b-β)/σβ */
Compute f = ((tss-sse)/k)/mse.
Compute pf = 1-fcdf(f,k,df2). /* Lookup in the F-distribution */
Compute s = sqrt(mse). /* Regression Standard Error */
119
Introduction to Matrix algebra
Compute pf = {r2,f,k,df2,pf}. /* Result row vektor */
Compute p = 2*(1-tcdf(abs(tobs), n-j)). /*p-values for parameter estimates */
Compute oput = {b, stdeb, tobs, p}. /* Collecting estimates in output matrix */
Print varnameY /title = "Dependent Variable" /format A8. /* Print the name of the
dependent variable */
Print sse /TITLE = "SSE" /FORMAT = F12.3 . /* Print measure for unexplained part*/
Print s /TITLE = "Regression Standard Error (RES)" /FORMAT = F12.3 .
Print pf /title = "Model Fit:" /clabels = "R-sq" "F" "df1" "df2" "p"/format F10.4 .
Print oput /title = 'Regression Results' /clabels = "B(OLS)" "SE(OLS)" "t" "P>|t| " /rnames =
varnames /format f10.4.
END MATRIX. /* Stop recognition of the matrix language in SPSS syntax.
EXECUTE.
Try to carry out an interpretation of the results of this regressionanalysis. It is especially interesting
to gain a greater understanding of the dimensions of the different matrices while the
120
Introduction to Matrix algebra
regressionanalysis is being carried out. It is possible to find more matrixalgebraic functions by
using the SPSS help function. It can be found as follows:
Choose help\topics and search for ”Matrix functions” under Index:
121
APPENDIX A [Choice of Statistical Methods]
APPENDIX A [Choice of Statistical Methods]
Inte
rval
Inte
rval
Nom
inal
Can
onic
al A
naly
sis
Inte
rval
Nom
inal
Nom
inal
Nom
inal
Ord
inal
Inte
rval
Obs
erva
tions
Bot
h
Var
iabl
es
Several dependent variables (interval scaled) One
YesNo
Is the analysis concerning variables or observations?
Is any of the variables dependent of other variables?
Number of dependent variables?
Scale for dependent variable?
Scale for independent variable
Scale for independent variable
Scale for independent variable Cluster
Analysis
Logi
t Ana
lysi
s
Dis
crim
inan
t Ana
lysi
s
Multidimensional Scaling
Correspondence Analysis
Reg
ress
ion
w/d
umm
y
Ana
lysi
s of
Var
ianc
e
Reg
ress
ion
Ana
lysi
s Factor Analysis
Con
join
t Ana
lysi
s
MA
NO
VA
/GLM
Log-linear Analysis
Correspondence Analysis
Source: Niels J. Blunch: Analyse af markedsdata 2.rev. udg.2000. Systime.
122
Literature List
123
Literature List
Blunch, Niels J. (2000): ”Analyse af markedsdata”, 2 rev. udgave, Systime
_____________ (2000): ”Videregående data-analyse med SPSS og AMOS”, 1. udgave, Systime
_____________ (1993): ”Nogle teknikker til analyse af kvalitative data”, Internt
undervisningsmateriale E nr. 19, Handelshøjskolen i Århus.
Greenacre, Michael J. (1984): “Theory and Applications of Correspondence Analysis”, Academic
Press Inc.
Hair, Anderson m.fl.(1998): “Multivariate Data Analysis”, 5th edition 1998, Prentice Hall
Johnson, Richard A.; Wichern, Dean W. (1998): “Applied Multivariate Statistical Analysis”, 4th
edition, Prentice Hall
Juhl, Hans Jørn (1992): ”Multidimensional skalering og conjoint analyse – et
produktudviklingsværktøj”. Internt undervisningsmateriale H nr. 151, Institut for
informationsbehandling, Handelshøjskolen i Århus.
Kappelhoff, Peter (2001): “Multidimensionale Skalierung - Beispieldatei zur Datenanalyse.
Lehrstuhl für empirische Wirtschafts- und Sozialforschung Fachbreich
Wirtschaftswissenschaft“, Bergische Universität Gesamthochschule (BUGH)
Wuppertal.
Malhotra, Naresh K.(1999): Marketing Research – An Applied Orientation, 3rd edition, Prentice
Hall
Meulman, Jacqueline J.; Heiser, Willem J. (1999): “SPSS Categories®10.0”, SPSS Inc.
___________________________________ (1999): SPSS® Advanced ModelsTM 10.0, SPSS Inc.
__________________________________ (1999): SPSS® Base 10.0 Applications Guide, SPSS Inc.
___________________________________ (1999): SPSS® Base 10.0 User’s Guide, SPSS Inc.