1 introduction to spss data types and spss data entry and analysis
TRANSCRIPT
1
Introduction to SPSS
Data types and SPSSdata entry and analysis
2
In this session
What does SPSS look like? Types of data (revision) Data Entry in SPSS Simple charts in SPSS Summary statistics Contingency tables and crosstabulations Scatterplots and correlations Tests of differences of means
3
SPSS/PASW
4
Aspects of SPSS
Menus - Analyse and Charts esp. Spreadsheet view of data
Rows are cases (people, respondents etc.) Columns are Variables
Variable view of data Shows detail of each variable type
5
Questionnaire Data Coding
6
In SPSS
We change ticks etc. on a questionnaire into numbers
One number for each variable for each case How we do this depends on the type of
variable/data
7
Types of data
Nominal Ranked Scales/measures Mixed types Text answers (open ended questions)
8
Nominal (categorical)
order is arbitrary e.g. sex, country of birth, personality type, yes or no. Use numeric in SPSS and give value labels.
(e.g. 1=Female, 2=Male, 99=Missing)
(e.g. 1=Yes, 2=No, 99=Missing)
(e.g. 1=UK, 2=Ireland, 3=Pakistan, 4=India, 5=other, 99=Missing)
9
Ranks or Ordinal
in order, 1st, 2nd, 3rd etc. e.g. status, social class Use numeric in SPSS with value labels
E.g. 1=Working class, 2=Middle class, 3=Upper class
E.g. Class of degree, 1=First, 2=Upper second, 3=Lower second, 4=Third, 5=Ordinary, 99=Missing
10
Measures, scales
1. Interval - equal units e.g. IQ
2. Ratio - equal units, zero on scale e.g. height, income, family size, age Makes sense to say one value is twice another
Use numeric (or comma, dot or scientific) in SPSS
E.g. family size, 1, 2, 3, 4 etc. E.g. income per year, 25000, 14500, 18650 etc.
11
Mixed type
Categorised data Actually ranked, but used to identify
categories or groups e.g. age groups = ratio data put into groups
Use numeric in SPSS and use value labels. E.g. Age group, 1=‘Under 18’, 2=‘18-24’, 3=‘25-
34’, 4=‘35-44’, 5=‘45-54’, 6=‘55 or greater’
12
Text answers
E.g. answers to open-ended questions Either enter text as given (Use String in SPSS) Or Code or classify answers into one of a small number
types. (Use numeric/nominal in SPSS)
13
Data Entry in SPSS
Video by Andy Field
14
Frequency counts
Used with categorical and ranked variables e.g. gender of students taking Health and
Illness option
15
e.g. Number of GCSEs passed by students taking Health and Illness option
16
Central Tendency
Mean = average value sum of all the values divided by the number of values
Mode = the most frequent value in a distribution (N.B. it is possible to have 2 or more modes, e.g. bimodal
distribution) Median
= the half-way value, or the value that divides the ordered distribution in the middle
The middle score when scores are ordered N.B. need to put values into order first
17
Dispersion and variability
Quartiles The three values that split the sorted data into
four equal parts. Second Quartile = median. Lower quartile = median of lower half of the data Upper quartile = median of upper half of the data Need to order the individuals first One quarter of the individuals are in each inter-
quartile range
18
Used on Box Plot
Upper quartile
Lower quartile
Median
Age of Health and Illness students
19
Variance Average deviation from the mean, squared
5.20 is the Sum of Squares This depends on number of individuals so we divide by n (5) Gives 1.04 which is the variance
Score Mean DeviationSquared Deviation
1 2.6 -1.6 2.56
2 2.6 -0.6 0.36
3 2.6 0.4 0.16
3 2.6 0.4 0.16
4 2.6 1.4 1.96
Total 5.20
20
Standard Deviation
The variance has one problem: it is measured in units squared.
This isn’t a very meaningful metric so we take the square root value.
This is the Standard Deviation
21
Using SPSS
‘Analyse>Descriptive>Explore’ menu. Gives mean, median, SD, variance, min,
max, range, skew and kurtosis. Can also produce stem and leaf, and
histogram.
22
Charts in SPSS
Use ‘Chart Builder’ from ‘Graph’ menu or the Legacy menu
And/or double click chart to edit it. E.g. double click to edit bars (e.g. to change
from colour to fill pattern). Do this in SPSS first before cut and paste to
Word Label the chart (in SPSS or in Word)
23
Stem and leaf plots
e.g. age of students taking Health and Illness option
good at showing distribution of data outliers range
24
Stem and leaf plots e.g.
25
Box Plot
26
Box Plot
Fill colour changed.N.B. numbers refer to case numbers.
27
Histograms and bar charts
Length/height of bar indicates frequency
28
Histogram
Fill pattern suitable for black and white printing
29
Changing the bin size
Bin size made smaller to show more bars
30
Pie chart
angle of segment indicates proportion of the whole
Pie Chart
Shadow and one slice moved out for emphasis
Analysing relationships
Contingency tables or crosstabulations Compares nominal/categorical variables
But can include ordinal variables N.B. table contains counts (= frequency data) One variable on horizontal axis One variable on vertical axis Row and column total counts known as marginals
Example
In the Health and Illness class, are women more likely to be under 21 than men?
Crosstabulations
e.g.
Use column and row percentages to look for relationships
SPSS output
Chi-square ²
Cross tabulations and Chi-square are tests that can be used to look for a relationship between two variables:
When the variables are categorical so the data are nominal (or frequency).
For example, if we wanted to look at the relationship between gender and age.
There are several different types of Chi-square (²), we will be using the 2 x 2 Chi-square
2x2 Chi-square results in SPSS
Another example
The Bank employees data
Bank EmployeesChi-Square tests
Chi-Square analysis on SPSS
http://www.youtube.com/watch?v=Ahs8jS5mJKk 4m15s
http://www.youtube.com/watch?v=IRCzOD27NQU
From 6m:30s to 9m:50s
http://www.youtube.com/watch?v=532QXt1PM-Q&feature=plcp&context=C3ba91a4UDOEgsToPDskJ-ABupdp-Yfvuf4j4fJGzV 12m30s
Low values in cells
Get SPSS to output expected values Look where these are <5 Consider recoding to combine cols or rows
Tabulating questionnaire responses
Categorical survey data often “collapsed” for purposes of data analysis
Original category Frequency Collapsed category Frequency
White British 284 White 304
White Irish 7
Other White 13
Indian 40 South Asian 105
Pakistani 32
Bangladeshi 33
Chinese 16 Chinese 16
Black British 30 Black 44
Afro-Caribbean 12
African 2
An analysis on a sample of 2 (e.g. Black African) would not have been very meaningful!
Recoding variables
http://www.youtube.com/watch?v=uzQ_522F2SM&feature=related
Ignore t-test for now 6m11s
http://www.youtube.com/watch?v=FUoYZ_f6Lxc
Uses old version of SPSS, no submenu now. 6m
Scatterplots and correlations
Looks for association between variables, e.g. Population size and GDP crime and unemployment rates height and weight
Both variables must be rank, interval or ratio (scale or ordinal in SPSS).
Thus cannot use variables like, gender, ethnicity, town of birth, occupation.
44
45
Scatterplots
e.g. age (in years) versus Number of GCSEs
Interpretation
As Y increases X increases
Called correlation
Regression line model in red
46
Correlation measures association not causation
The older the child the better s/he is at reading The less your income the greater the risk of schizophrenia
Height correlates with weight But weight does not cause height Height is one of the causes of weight (also body shape,
diet, fitness level etc.) Numbers of ice creams sold is correlated with the
rate of drowning Ice creams do not cause drowning (nor vice versa) Third variable involved – people swim more and buy more
ice creams when it’s warm
47
Scatterplot in SPSS
Use Graph menu http://www.youtube.com/watch?
v=74BjgPQvIEg 8m34s
http://www.youtube.com/watch?v=blfflA-34pQ&feature=related 4m04s
http://www.youtube.com/watch?v=UVylQoG4hZM 1m50s, ignore polynomial regression
48
Modifying the Scatterplot
http://www.youtube.com/watch?v=803YCYA2AoQ&feature=related 4m04s
http://www.youtube.com/watch?v=vPzvuMuVXk8&feature=related 3m40s
49
If mixed data sets
Change point icon and/or colour to see different subsets.
Overall data may have no relationship but subsets might.
E.g. show male and female respondents. Use Chart builder
50
51
Correlation
Correlation coefficient = measure of strength of relationship, e.g. Pearson’s r
varies from 0 to 1 with a plus or minus sign
52
Positive correlation
as x increases, y increases
r = 0.7
53
Negative correlation
as x increases, y decreases
r = -0.7
54
Strong correlation (i.e. close to 1)
r = 0.9
55
Weak correlation (i.e. close to 0)
r = 0.2
Interpretation cont.
r2 is a measure of degree of variation in one variable accounted for by variation in the other.
E.g. If r=0.7 then r2=.49 i.e. just under half the variation is accounted for (rest accounted for by other factors).
If r=0.3 then r2=0.09 so 91% of the variation is explained by other things.
56
Significance of r
SPSS reports if r is significant at α=0.05 N.B. this is dependent on sample size to a
large extent. Other things being equal, larger samples
more likely to be significant. Usually, size of r is more important than
its significance
57
Pearson’s r in SPSS
http://www.youtube.com/watch?v=loFLqZmvfzU 6m57s
58
Parametric and non-parametric
Some statistics rely on the variables being investigated following a normal distribution. – Called Parametric statistics
Others can be used if variables are not distributed normally – called Non-parametric statistics.
Pearson’s r is a parametric statistic Kendal’s tau and Spearman’s rho (rank
correlation) are non-parametric.59
Assessing normality
Produce histogram and normal plot
60
Use statistical test
SPSS provides two formal tests for normality : Kolmogorov-Smirnov (K-S) and Shapiro-Wilks (S-W) But, there is debate about KS Extremely sensitive to departure from normality May erroneously imply parametric test not
suitable – especially in small sample So, always use a histogram as well.
61
Often can use parametric tests
Parametric tests (e.g. Pearson’s r) are robust to departures from normality
Small, non-normal samples OK But use non-parametric if
Data are skewed (questionnaire data often is) Data are bimodal
62
Spearmans’s rho
http://www.youtube.com/watch?v=r_WQe2c-ISU From 4.14 to 4.56
http://www.youtube.com/watch?v=POkFi5vKvI8&feature=fvwrel 6m16s
63
So far…
Looked at relationships between nominal variables
Gender vs age group
Looked at relationships between scale variables
Height vs. Weight
Now combine the two Groups vs a scale variable
E.g. Gender vs income
64
Reminder – IV vs DV
IV = independent variable What makes a difference, causes effects, is responsible
for differences.
DV = dependent variable What is affected by things, what is changed by the IV.
Gender vs income. Gender = IV, income = DV So we investigate the effect of gender on income
65
Example 1Age group vs. no. of GCSEs
Using the Health and Illness class data Age group defines 2 groups
Under 21 21 and over
Just two groups Can use independent samples t-test Independent because the two groups consist of
different people. t-test compares the means of the 2 groups.
66
67
Difference of means
Do under 21s have more or fewer GCSEs than 21 and overs?
Means are different (6.44 & 4.28) but is that significant?
68
No significant difference therefore assume equal variances
Means are statistically significantly different
Parametric vs non-parametric
Just as in the case of correlations, there are both kinds of tests.
Need to check if DV is normally distributed. Do this visually Also use statistical tests
69
Tests for normality
Kolmogorov-Smirnov and Shapiro-Wilk If n>50 use KS If n≤50 use SW Null hypothesis is ‘data are normally distributed’. So if p<0.05 then data are significantly different
from a normal distribution – use non-parametric tests
If p≥0.05 then no significant difference – use parametric tests
70
Checking normality
Produce histogram of DV Tick box to undertake statistical test Interpret results.
71
t-test
Identify your two groups. Determine what values in the data indicate
those two groups (e.g. 1=female, 2=male) Select Analyze:Compare Means:Independent
samples t-test http://www.youtube.com/watch?
v=_KHI3ScO8sc 9m40s
72
Mann-Whitney U test
Use this when comparing two groups and the DV is not normally distributed
http://www.youtube.com/watch?v=7iTvv3m9d_g 3m45s
73
Comparing 3 or more groups
ANOVA = Analysis of Variance Analyze: Compare Means: One-way ANOVA http://www.youtube.com/watch?
v=wFq1b3QjI1U 4m04s
Useful to get table of means (descriptives) and means plots from ANOVA options.
74
ANOVA Means and F value
75
ANOVA Means Plot
76