mash spss sessions getting started with spss/file/...14 as spss produces a lot of output for...
Post on 04-Feb-2021
16 Views
Preview:
TRANSCRIPT
-
Getting started with SPSS Maths and Statistics Help Centre
1
community project encouraging academics to share statistics support resources
All stcp resources are released under a Creative Commons licence
MASH SPSS sessions
Getting started with SPSS
-
Getting started with SPSS Maths and Statistics Help Centre
2
Data sets used in this booklet ........................................................................................................................................... 3
Statistical Analysis Cycle ................................................................................................................................................... 4
Introduction to data .......................................................................................................................................................... 4
Data types ..................................................................................................................................................................... 5
What is SPSS? ................................................................................................................................................................ 6
Opening an Excel file in SPSS .................................................................................................................................... 7
Titanic data ............................................................................................................................................................ 9
Exercise 1: Were wealthy people more likely to survive on the Titanic? ............................................................. 9
Labelling values ....................................................................................................................................................... 11
Summarising categorical data ..................................................................................................................................... 13
Output in SPSS ......................................................................................................................................................... 13
Exercise 2: Who are the most dangerous drivers? ............................................................................................. 14
Research question 1: Were wealthy people more likely to survive on the Titanic? ...................................................... 15
Bar Charts .................................................................................................................................................................... 16
Tidying up a bar chart ......................................................................................................................................... 16
Adjusting variables ...................................................................................................................................................... 19
Reducing the number of categories ........................................................................................................................ 19
Changing continuous to categorical variables ........................................................................................................ 20
Exercise 3 ............................................................................................................................................................ 20
Summary statistics and graphs: Continuous data ........................................................................................................... 21
Averages ...................................................................................................................................................................... 21
Measures of spread .................................................................................................................................................... 21
Which summary statistics should be used .................................................................................................................. 23
Ex 4: Comparison of continuous data by group ...................................................................................................... 24
Exercise 5: ........................................................................................................................................................... 25
Research question 2: Which of three diets was best? .................................................................................................... 27
Calculations using variables ........................................................................................................................................ 28
Summary statistics for groups in tables .................................................................................................................. 29
Scatterplots: ............................................................................................................................................................ 30
Summary of descriptive and graphical statistics......................................................................................................... 32
Research question 3: Which variables are strongly related to birthweight? ................................................................. 33
Exercise 6: ........................................................................................................................................................... 33
Exercise 7 ............................................................................................................................................................ 34
Getting SPSS on your home computer............................................................................................................................ 35
MASH contact details ...................................................................................................................................................... 35
Solutions to exercises...................................................................................................................................................... 36
-
Getting started with SPSS Maths and Statistics Help Centre
3
Data sets used in this booklet All the data needed for this booklet is contained in the Excel file ‘all_data_for_MASH_workshops. You will
need to download this file from the MASH workshops web page and save this file on your computer in
order to use it. Once saved, close the file.
Save the file somewhere:
Datasets:
Dataset Description
Titanic List of 1309 passengers on board the Titanic when it sank and details about them such as gender, whether they survived, class etc
Diet 78 people were put on one of three diets with the goal being to determine which diet was best.
Birthweight Details for a number of babies and their parents such as weight and length of babies at birth and weight and height of mother.
www.sheffield.ac.uk/mash/workshops
-
Getting started with SPSS Maths and Statistics Help Centre
4
Statistical Analysis Cycle
Introduction to data SECONDARY data is data collected by someone else e.g. using the data from the National Students survey PRIMARY data is data collected by the researcher e.g. by producing a questionnaire. If you are producing a questionnaire think very carefully about the questions.
QUANTITATIVE DATA is numeric and a variety of statistical techniques can be used to summarise and analyse the data.
QUALITATIVE data is collected using open ended questions such as ‘What do you like best about your course?’.
For all types of quantitative data, it is likely that it will end up in a spreadsheet with individuals/ subjects on rows and each column representing a variable e.g. answer to Q1 from a questionnaire or heart beat after running for 5 mins.
A variable is just a measurement which varies between subjects e.g. height or the answer to a question.
One variable per column
One subject per row
-
Getting started with SPSS Maths and Statistics Help Centre
5
Data types
In order to choose suitable summary statistics and analysis for the data, it is also important to distinguish between continuous (numerical) measurements and categorical variables. The choice of variable necessary to answer the main research questions should be considered at the planning rather than the analysis stage.
NOMINAL data is categorical data with no order. The labels just name the category. Examples: Department Marital status What is your favourite animal? Dog Cat Horse Hamster Fish Other
ORDINAL data has a recognisable order e.g. 1st, 2nd, 3rd
Likert scales are ordinal e.g. Strongly disagree – strongly agree Can be numbered but the numbers are no different to names The gap between 1st and 2nd may be different to the gap between 2nd and 3rd
DISCRETE data can only take whole numbers
Number of children, how many times have you been on holiday this year CONTINUOUS data can be measured on any scale Examples: height, anything that can have decimals Discrete usually treated as continuous in analysis
In most situations, the key distinction is between continuous/scale/ measurement data and categorical variables. Different summary statistics, charts and statistical tests are needed for the two types of variables. If discrete variables have a fairly large range of numbers, they can be treated as continuous for analysis purposes.
Data Variables
Measurements/ scale
appear as meaningful numbers
Continuous:
takes any value e.g. height
Discrete/ count:
takes whole numbers e.g. Number of children
in a family
Categorical:
appear as categories
Ordinal:
meaningfully ordered e.g. agree strongly - disagree
strongly questions
Nominal:
No meaningful order e.g. eye
colour
-
Getting started with SPSS Maths and Statistics Help Centre
6
What is SPSS?
SPSS is similar to Excel but it’s easier to produce charts and carry out analysis. To open SPSS, select IBM
SPSS statistics from ‘All programs’. Before opening, an additional screen appears. You can open a dataset
from this screen but it’s easiest to just select ‘Type in data’ every time. Data can be opened after SPSS is
opened.
Version 21 and below:
In version 22, select ‘New Dataset’ and ‘OK’.
-
Getting started with SPSS Maths and Statistics Help Centre
7
Example of data sheet in SPSS
Opening an Excel file in SPSS
Important note: There must be only one row with headings in for SPSS to open an Excel file correctly.
If SPSS is not open, open SPSS. When prompted to open a file, select type in data.
Variable headings can only
appear at the top in the blue
boxes
Unlike Excel, you can only have
one dataset on each page of
SPSS. A new file must be created
for each individual data set.
-
Getting started with SPSS Maths and Statistics Help Centre
8
To open any file in SPSS, select File Open Data. Here we are opening the ‘Titanic’ data which is currently in Excel. Note: The Excel file must not be open on your computer.
SPSS only opens one sheet of data at a time so select the required sheet containing the Titanic data.
Once the data is in SPSS, save the SPSS data file using File Save as. Save again after making changes to the data.
Select ‘Excel’ as ‘Type of file’
-
Getting started with SPSS Maths and Statistics Help Centre
9
Titanic data
The ship ‘The Titanic’ sank in 1914 along with most of its’ passengers and crew. The data set that we have
contains information on 1309 passengers.
Exercise 1: Were wealthy people more likely to survive on the Titanic?
Once the data set is open on your computer, give the following variables suitable labels, label the values
for categorical variables and select the correct data type.
Variable
name Variable label Value label Data type
pclass Class 1 = 1st, 2 = 2nd, 3 = 3rd
survived 0 = Died, 1 = survived
Residence Country of Residence 0=American, 1 = British, 2 = Other
age
sibsp Number of siblings/ spouses
parch Number of parents/ children on board
fare Price of ticket
Gender Gender 0 = male, 1 = female
a) Which variables would you use to investigate the research question ‘Were wealthy people more
likely to survive the sinking of the Titanic’?
-
Getting started with SPSS Maths and Statistics Help Centre
10
There are two sheets for each dataset. The ‘Data View’ sheet is where the numbers are entered and the
‘Variable View’ sheet is where the variables are named and defined. The option to choose between Data
and Variable View is in the bottom left hand corner. For data in categories, type numbers in the Data View
sheet and then label the numbers in ‘Variable View’.
Select variable view to
label the variables/ values
There should be one row per person
not one row per group
Variable view: Label the variables
The variable name has restrictions. It
can have no spaces or use certain
characters. Use the ‘Label’ column to
give sensible variable descriptions
which will appear in all output. If the
label is blank, the variable name will
appear in output.
For example sibsp is ‘Number of
siblings/ spouses on board’, parch is
‘Number of parents/ children on
board’ and fare is ‘Price of ticket’.
-
Getting started with SPSS Maths and Statistics Help Centre
11
Labelling values
It is best to have your categories coded as numbers for analysis in SPSS but for your output, people need to
know what the numbers mean. Go to the ‘Values’ column in Variable View, let the mouse hover until you
see a blue square. Clicking the square gives the ‘Value labels’ box. In the value box, put the number and
the label for that number in the label box. Click on ‘Add’ after each label and ‘Ok’ when finished.
Also, when using secondary data, watch for odd values, such as -99 indicating a missing value. These can
be identified in the missing column so they are not taken into account in any analysis.
Label the categories by
selecting the blue box
0 = Died and 1 = Survived Click on ‘Add’ after each one
-
Getting started with SPSS Maths and Statistics Help Centre
12
Note: There are two variables for gender. ‘Sex’ is a string variable (words) whereas ‘Gender’ has 0 for males and 1 for females so should be used during analysis.
Variable Type: SPSS only
analyses Numeric variables.
String means it’s a word. The
width is the number of
numbers/ letters allowed for
that variable.
Decimals: When typing in data, the default number of decimals is 2. Change this to 0 for categorical and discrete data.
The Measure column is where the data type is entered. Continuous/ discrete are called Scale in SPSS. SPSS won’t allow certain analysis for the wrong type of variable.
-
Getting started with SPSS Maths and Statistics Help Centre
13
Summarising categorical data The simplest way to summarise a single categorical variable is by using frequencies or percentages.
Analyse Descriptive statistics Frequencies
Output in SPSS
Charts, tables and analysis appear in a separate ‘output’ window in SPSS. The output window is brought to the front of the screen when analysis/ charts etc are requested. The left hand column shows all of the output produced in that session. The output file has to be saved separately to the data file.
To go back to the data file, select it on the bottom toolbar.
Use the Valid Percent column as it
does not include missing values.
Move the variable for the number of parents/ children on board and
survival to the right hand side and click ‘OK’ to run the analysis.
Move the variables to be summarised from the list on the left hand side to the right using the arrow in the middle.
-
Getting started with SPSS Maths and Statistics Help Centre
14
As SPSS produces a lot of output for analysis and you may produce several charts before you decide which one is best, copying the output you require for your project and pasting into a Word document is preferable.
Quick question: What percentage of people survived the sinking of the Titanic?
Exercise 2: Who are the most dangerous drivers?
Often we are interested in looking at the relationship between two variables. We start by investigating how age and gender relate to the number of car accidents in the UK. Stacked or multiple bar charts can summarise this type of information. The following multiple bar chart is taken from an article in the Guardian.
http://www.theguardian.com/politics/reality-check/2013/oct/11/dangerous-drivers-how-old-uk-age-18
a) Which gender is most likely to have an accident?
b) Which age group is most likely to have an accident?
c) The point of the chart should have been to look at how likely people were to have an accident by age and gender. What is wrong with the chart regarding addressing this research question?
-
Getting started with SPSS Maths and Statistics Help Centre
15
Research question 1: Were wealthy people more likely to survive on the Titanic? In general, using percentages to summarise categorical data is preferable although in the case of small
numbers, percentages can be misleading e.g. ‘100% of people agree that mascara A is better than mascara
B’ when only 2 people have been asked!
Suitable charts for categorical data are bar charts and pie charts.
A contingency table is a way of summarising two categorical variables. However, care needs to be taken
with comparing groups of different sizes.
If class had an effect on survival, a higher percentage of people in one class would have survived. If class
had no effect roughly the same percentage would have survived in each class.
To break down survival by class, a crosstabulation or contingency table is needed. Percentages are usually
preferable to frequencies but remember to include counts for small sample sizes. Choose either row or
column percentages carefully.
Analyse Descriptive statistics Crosstabs
3) Select ‘Cells’ to get the %
options. Choose row %’s
1) Select the
variable class here
and move to the
‘Row’ box. Move
survival to the
column box
2) Move selected
variables using the arrow
4) Select ‘OK’ when finished and the
chart appears in the output
window.
-
Getting started with SPSS Maths and Statistics Help Centre
16
Bar Charts
Plotting graphs in SPSS is much easier than in Excel. All graphs can be accessed through
Graphs Legacy Dialogs There is a chart builder option but the legacy dialogs options are more user friendly. To display the information from the cross-tabulation graphically, use either a stacked or clustered bar chart. Both of these can be accessed through
Graphs Legacy Dialogs Bar
Tidying up a bar chart
Double click on the chart to open an editing window.
Selecting this turns the
bars into 100% for each
class
Variable across the x-axis
Variable to split the bars
-
Getting started with SPSS Maths and Statistics Help Centre
17
The font in graphs is usually small so adjust the axes titles etc. Select each axis and change the font size to 12. The axis titles and percentages displayed on the bars can also be changed in this way.
Select this to add labels
% is more useful so move it to
the displayed box and remove
count. Use Number Format to
reduce to 0 decimal places
-
Getting started with SPSS Maths and Statistics Help Centre
18
Finally, give the chart a title and change the label on the y axis from ‘Count’ to ‘Percentage’.
When finished, close the chart editor to return to the main output window. Right click on the chart in the output window, copy and paste into word. Sometimes you may need to select ‘Copy Special’ to move charts.
Pasting as a picture enables easy resizing of graphs/ output in Word.
It is clear from the bar chart that the percentage of those dying increased as class lowered. 38% of passengers in 1st class died compared to 74% in 3rd class. Is this a significant difference? To answer this, hypothesis testing is needed.
-
Getting started with SPSS Maths and Statistics Help Centre
19
Adjusting variables
Reducing the number of categories
Sometimes categories can be merged if not all the information is needed. For example, a common summary is to calculate the percentage who agreed from a Likert scale i.e. % agree or strongly agree compared to everything else.
Use ‘re-code to different variables’ rather than ‘Re-code into same variables’ so that the re-coding can be checked.
If there are numerous variables to be recoded in the same way, transfer several variables at the same time. Each variable needs an individual name though. Click change after each new name.
Here a new variable is created where 0 = 3rd class and 1 = 1st or 2nd class.
Transform Recode into different variables
Select ‘Continue’ and then ‘OK’ to produce the new variable. Then label 0 = 3rd class and 1 = 1st or 2nd class in the value label box in variable view. Finally do a cross-tabulation of the old and new variables to check the re-coding is correct.
All 1st and 2nd class passengers have been correctly recoded as ‘1st or 2nd class.
Give the new
variable a name,
then click ‘Change’
Move ‘class’ across
New value Old value
You must click add after
each change to add to
the Old New box
Old
variable
New variable
-
Getting started with SPSS Maths and Statistics Help Centre
20
Changing continuous to categorical variables
Although it is not recommended as information is lost, continuous (scale) variables can be categorised. Here we will create a new variable identifying children of 12 and under within the Titanic data set.
Go to variable view and label 0 as ‘Adult’ and 1 as ‘Child’.
Use ‘Crosstabs’ for the old and new variable to check the re-coding is correct i.e. age vs Child to see all those of 12 and under are classified as a child.
Exercise 3
Were Americans more likely to survive than the British? Produce suitable summary statistics/ charts to
investigate this.
5. You must
click add
after each
change to
add to the
Old New
box
2. Give the new variable a
name, then click ‘Change’
1. Move ‘age’ across
3. Old values of
age up to 12
are now going
to be 1
4. New value
-
Getting started with SPSS Maths and Statistics Help Centre
21
Summary statistics and graphs: Continuous data Continuous variables can be summarised using statistics such as the mean, median, standard deviation,
minimum and maximum values. For continuous data, plotting a histogram gives an idea of the shape and
spread of the distribution as well as assessing whether the variable is normally distributed. Box-plots can
also be used and are particularly useful when comparing groups. The minimum and maximum help check
for outliers and possible data entry errors.
Averages
Mode: The value which occurs most often Mean: Sum of the values/ number of values Median: The middle value of ordered data
Measures of spread
Range = maximum value – minimum value = 30 – 7 = 23 Quartiles: These divide the data into 4 parts. 25% of values are below the lower quartile and 25% are above the upper quartile. The median is the 2nd quartile Interquartile range = Upper quartile – lower quartile = 13 – 8 = 5
7 7 8 8 9 10 13 13 13 14 30
Quick question: 2 out of 3 people earn less than the average income
1. True
2. False
Median Lower quartile Upper quartile
50% of subjects below median 25% of subjects above upper quartile
-
Getting started with SPSS Maths and Statistics Help Centre
22
Variance: Average of the squared deviations from the mean. A deviation is the difference between a single value and the mean.
1 - nsobservatio no.
sdifference squared of sumdeviation Standard
Calculating means and standard deviation Example of calculating the mean and standard deviation:
X = exam score
Both histograms on the left show approximately the same mean but the second has a much smaller standard deviation as it is less spread out.
Deviations from the mean
Mean
Subject ID
5.66.4210
426
1 - nsobservatio no.
mean thefrom deviations squared of sum SD
1211
132
nsobservatio ofnumber
scores of sum Mean
Outlier contributes most deviation
-
Getting started with SPSS Maths and Statistics Help Centre
23
Which summary statistics should be used
Means and standard deviations are commonly used to summarise continuous data although for skewed data, the median and quartiles are more appropriate. Skewed data can be assessed by plotting a histogram of continuous data. For large samples, we would expect a histogram to peak roughly in the middle. If the histogram peaks at one end or the other, the data is skewed. The histogram below shows male height which is normally distributed. This means that most people are in the middle and the spread is fairly symmetrical about the mean. For normally distributed data, the mean and the median are similar.
Positively skewed distribution Negatively skewed distribution Mean > median Mean < median
Quick question solution:
TRUE if you assume average is the mean: Two thirds of people earn less than the MEAN wage. As the chart below shows, the data is very skewed. There are a lot of people earning a low wage and a few very high earners pulling the mean up. In this situation, the median better represents the population as a whole.
Chart from ‘How does your wage compare with an MP’s’ http://news.bbc.co.uk/1/hi/8072031.stm
2 out of 3 people
Mean Median
Normally distributed data
http://news.bbc.co.uk/1/hi/8072031.stm
-
Getting started with SPSS Maths and Statistics Help Centre
24
Ex 4: Comparison of continuous data by group
Did the cost of a ticket affect chances of survival?
a) Is there a big difference in average ticket price by group?
b) Which group has data which is more spread out?
c) Is the data skewed?
d) Is the mean or median a better summary measure?
Cost of ticket Survived?Died Survived
Mean 23.35 49.36
Median 10.50 26.00
Standard Deviation 34.15 68.65
Interquartile range 18.15 46.56
Minimum 0.00 0.00
Maximum 263.00 512.33
-
Getting started with SPSS Maths and Statistics Help Centre
25
DATA: The data set ‘diet’ contains information on 78 people who undertook 1 of three diets. There is background information such as age and gender as well as weights before and after the diet.
Open the data set from Excel. Go into the Variable View and make sure that each variable is correctly categorised e.g. nominal. Note: continuous is called ‘Scale’ in SPSS. It is important that variables are correctly categorised as SPSS will only carry out some analysis on certain variable types.
There are several ways to produce summary statistics and charts. This option uses ‘Explore’ which contains
the most summary statistics to compare weight before the diet for males and females.
Analyse Descriptive statistics Explore
Exercise 5:
a) Fill in the following table using the summary statistics table in the output.
Female = 0 Male = 1
Minimum -70
Maximum 82
Mean 64
Median 66
Standard Deviation 21.6
b) Interpret the summary statistics by gender. Which group has the higher mean and which group is more spread out?
Put ‘Pre-weight’ as the dependent
variable and ‘Gender’ in the factor list.
The summary statistics will be
produced for each gender separately.
-
Getting started with SPSS Maths and Statistics Help Centre
26
A box-plot shows the spread of a distribution of values. The box contains the middle 50% of values.
c) How could the chart be improved and is there anything odd?
Median = central line
Upper quartile
Lower quartile
Outlier
-
Getting started with SPSS Maths and Statistics Help Centre
27
Research question 2: Which of three diets was best? Before the next section, change the error of -70 to 70. Outliers should not normally be changed unless
they are clearly data entry errors as in this case.
Give the variables sensible labels and label gender with 0 = Female and 1 = Male.
Re-run explore to see how the change has affected the summary statistics. Which summary statistics have changed the most?
Female with outlier Female after changing outlier
Minimum -70
Maximum 82
Mean 64
Median 66
Standard Deviation 21.6
Change -70
to 70kg
-
Getting started with SPSS Maths and Statistics Help Centre
28
Calculations using variables
Producing the charts for gender and weight before the diet was useful for demonstrating SPSS but the main question of interest is ‘Which diet led to greater weight loss?’. How could this be assessed? To answer this, a new variable ‘weight lost’ (weight before – weight after) would be useful. As spaces are
not allowed in variable names, use weightLOST as a name and give a better name in the label section in
variable view.
To do this use Transform Compute variable.
After putting the calculation into the ‘Numeric Expression’ box, select ‘OK’ and the new variable will appear last in the Data and variable view sheets. Before carrying out the official test of a difference, use summary statistics and charts to look at the differences.
Move ‘Preweight’ into box, select ‘-‘ and
then move ‘Weight6week’ across
Selecting ‘All’ gives
you a lot of options for
calculations e.g. mean
of several variables
-
Getting started with SPSS Maths and Statistics Help Centre
29
Summary statistics for groups in tables
SPSS has a table function which can produce more complicated tables although it is a little temperamental and frustrating at times!
To open the table window: Analyse Tables Custom Tables Drag variables to either the row or column bars to include them in the table. If you want to create sub categories, drag the categorical variable to the front of the variable already in the table. By default, SPSS will choose means to summarise continuous (scale) variables and counts to summarise categorical variables. It is vital that variables are correctly defined as scale or categorical.
1) Move ‘WeightLOST’ to the row section and ‘Diet’ to the Columns section. 2) Select the summary statistics you require 3) Choose ‘Columns’ in the ‘Position’ options for a better display.
Which diet seems the best and which diet has the most variation in weight loss?
Selecting the ‘Summary
Statistics’ button opens a
window where options for
statistics displayed can be
chosen.
The summary statistics button
will only highlight when a
variable is selected in the
main window. Here, make
sure weightLOST is highlighted
in yellow in the central
window.
To change the summary statistics to
appear down the side, select rows
instead of columns from the
position box.
Select Standard deviation and
count from the options and click
‘Apply to all’.
-
Getting started with SPSS Maths and Statistics Help Centre
30
Scatterplots:
A scatterplot helps assess a relationship between two continuous (scale) variables by plotting a different point for each individual based on their scores on two variables. The closer the points fit a diagonal line, the stronger the relationship.
The scatterplot below shows a negative relationship between a persons’ weight and the number of kilometres they run per week. i.e. the more they run, the lighter they are generally. There is one clear outlier who runs a lot but also weighs a lot.
Things to look for in a scatterplot:
How strong is the relationship? The closer the points form a line, the stronger the relationship.
Is there a negative or positive relationship?
Is the relationship linear? Do the points form a straight line?
Are there any outliers that could be data entry errors?
Outlier
General linear trend
downwards
-
Getting started with SPSS Maths and Statistics Help Centre
31
A scatterplot can be colour coded by a third categorical variable using the ‘Set marker by’ option within the
Graphs Legacy Dialogs scatterplot menu.
Here, we will look at the relationship between weight before and weight after the diet with different shapes for males and females.
Double click on the chart to open the edit window. To change the shape of the scatter, click on the scatter, then again on just one of the females to open the properties window. Change the marker type and size.
It is clear from the scatterplot that there is a strong positive relationship between a person’s weight before and after the diet. A positive relationship (uphill scatter) means that as the x (horizontal) variable (weight before diet) increases so does the y (vertical) variable. In a negative relationship, y decreases as x increases.
-
Getting started with SPSS Maths and Statistics Help Centre
32
Summary of descriptive and graphical statistics
Variable type Purpose Summary Statistics
Pie Chart or bar chart
One Categorical variable Shows frequencies/ proportions/percentages
Class percentages
Stacked / multiple bar
Two categorical variables
Compares proportions within groups Compare percentages within groups
Histogram One continuous variable Shows distribution of results Mean and Standard deviation
Scatter graph Two continuous variables
Shows relationship between two variables and helps detect outliers
Correlation co-efficient
Line Chart Continuous over time Continuous by group
Displays changes over time Comparison of group means
Frequencies Means
Confidence Interval plot
Continuous dependent/ categorical independent
Comparison of group means Means and Confidence Intervals
-
Getting started with SPSS Maths and Statistics Help Centre
33
Research question 3: Which variables are strongly related to birthweight?
Exercise 6:
a) Open the data set ‘birthweight’ from Excel. Label the variables with the labels in the table below.
b) What is the average birthweight? Is birthweight normally distributed?
c) Recode the variable mncig (cigarettes smoked by the mother per day) into the following four
categories: 1 = non-smoker, 2= light smoker (smokes 1 – 10 a day), 3 = Moderate smoker (11 – 20 a
day) and 4 = Heavy smoker (21+ a day).
d) Summarise birthweight by smoking category using suitable statistics and a graph
e) Produce a scatterplot of birthweight and gestational age by smoking category. What is the
relationship between the variables?
Variable Label Variable type
id Baby ID
headcir Head Circumference (cm)
leng Length of baby (inches)
weight Baby's weight
gest Gestational age
mage Maternal age
mnocig No. cigarettes smoked per day by mother
mheight Maternal height
mppwt Mothers pre-pregnancy weight
fage Fathers age
fedyrs Years father was in education
fnocig No. cigarettes smoked per day by father
fheight Fathers height
lowbwt Low birth weight baby 1 = under 5lbs
-
Getting started with SPSS Maths and Statistics Help Centre
34
Exercise 7: Enter the following data into SPSS:
Women
Men
Age
housework (hrs per
week) marital status
Hours worked per
week
Age
housework (hrs per
week) marital status
Hours worked
per week
46 6 Married 35
55 10 Married 28
62 8 Married 7
61 0 Married 39
42 30 Married 7
39 2 Married 49
36 25 Married 18
38 3 Married 40
58 30 Married 23
58 4 Married 40
36 21 Married 22
31 6 Married 41
32 10 Married 24
54 7 Married 42
35 14 Married 32
33 4 Separated 45
33 3 Married 36
62 6 Divorced 38
41 12 Married 36
62 6 Widowed 37
31 14 Separated 22
31 2 Never
married 35
50 25 Divorced 10
32 18 Never
married 25
31 15 Widowed 15
42 20 Never
married 35
a) Investigate the relationship between the amount of housework someone carries out per week and
each of the other variables using suitable charts. For scatterplots, have different markers for males
and females.
b) Create a new binary variable from ‘Hours worked per week’ to indicate whether someone is full
time or part time. Classify part time as under 30 hours.
c) Summarise the amount of housework carried out per week by working full/ part time using a table
and a plot and interpret.
-
Getting started with SPSS Maths and Statistics Help Centre
35
Getting SPSS on your home computer
Go to the downloading software page and enter your uni login and password
https://cics.dept.shef.ac.uk/software/
To download software or renew license codes click the SPSS Statistics 19-22 button on the page that comes
up. You will receive an email containing a download link, a license code, installation instructions and legal
information. The download sometimes takes a long time!
MASH contact details
Book an appointment or access help sheets via our webpage: https://www.shef.ac.uk/mash
Statistics appointments are 10am – 1pm every day in term time with an additional session 4-7pm Wednesdays. For appointments outside of term time see our website or email mash@sheffield.ac.uk.
https://cics.dept.shef.ac.uk/software/https://www.shef.ac.uk/mash
-
Getting started with SPSS Maths and Statistics Help Centre
36
Solutions to exercises Exercise 1: Identify the type of variables and key questions of interest for the Titanic dataset
Variable
name Variable label Value label Data type
pclass Class 1 = 1st, 2 = 2nd, 3 = 3rd Ordinal
survived 0 = Died, 1 = survived Nominal
Residence Country of Residence 0=American, 1 = British, 2 = Other Nominal
age Scale
sibsp Number of siblings/ spouses Scale
parch Number of parents/ children on board Scale
fare Price of ticket Scale
Gender Gender 0 = male, 1 = female Nominal (binary)
Were wealthy people more likely to survive? Which variables would you use to investigate this question?
Survival is the outcome. Wealthy could be measured using either class or price of ticket.
Exercise 2: Who are the most dangerous drivers?
Males and middle aged people have more
accidents.
This may be because there are more drivers
of males and middle aged drivers on the
road.
%’s are better than frequencies
Given there are different numbers of
drivers in each category and the categories
are different widths, the best way to
summarise is to compare the proportions
within each category having accidents. It
is clear that male drivers consistently have
more accidents and that younger drivers
are more likely to have accidents.
Categories are different widths
-
Getting started with SPSS Maths and Statistics Help Centre
37
Exercise 3: Investigate whether nationality and survival were related
56% of Americans survived compared to 32% of British passengers and 32% of other nationalities.
Ex 4: Comparison of continuous data by group
Did the cost of a ticket affect chances of survival?
a) Is there a big difference in average ticket price by group? Yes. The mean and median ticket prices are much higher in the group who survived
b) Which group has data which is more spread out? The standard deviation is double in the group who survived so there is much more variation in that group
c) Is the data skewed? Yes – it’s very positively skewed. There a lot of people with cheap tickets and not so many with expensive tickets
d) Is the mean or median a better summary measure? The median as the data is very skewed.
Cost of ticket Survived?Died Survived
Mean 23.35 49.36
Median 10.50 26.00
Standard Deviation 34.15 68.65
Interquartile range 18.15 46.56
Minimum 0.00 0.00
Maximum 263.00 512.33
-
Getting started with SPSS Maths and Statistics Help Centre
38
Exercise 5:
a) Fill in the following table using the summary statistics table in the output. Female = 0 Male = 1
Minimum -70 71
Maximum 82 88
Mean 64 79
Median 66 79
Standard Deviation 21.6 5
b) Interpret the summary statistics by gender. Which group has the higher mean and which group is
more spread out? Standard deviation: The standard deviation for men, 5 is much smaller than the standard deviation for
women of 21.6 so the weights for women are more spread out. However, the data entry error needs to be
removed and the statistics run again.
Averages: Females had a mean weight of 64kg and median of 66kg before the diet. There’s quite a
difference between the two measures suggesting that the data may be skewed. Males had a mean and
median pre-weight of 79kg suggesting that the data is normally distributed.
Minimum/ maximum: Are there any extreme outliers? Someone weighed -70kg before the diet which is
clearly an error. Outliers cannot always be removed/ changed but here the real weight is clearly 70kg so
make that adjustment and re-run the analysis. What effect has this had on the summary statistics?
c) How could the chart be improved and is there anything odd? Better labelling of variables. Someone weighed -70kg which is clearly wrong
Before the next section, change the error of -70 to 70. Outliers should not normally be changed unless
they are clearly data entry errors as in this case. Give the variables sensible labels and label gender with 0
= Female and 1 = Male.
Re-run explore to see how the change has affected the summary statistics. Which summary statistics have changed the most?
Female with outlier Female after changing outlier
Minimum -70 58
Maximum 82 82
Mean 64 67
Median 66 67
Standard Deviation 21.6 5.6
The mean, standard deviation, minimum and maximum are more influenced by outliers than the median
and interquartile range.
-
Getting started with SPSS Maths and Statistics Help Centre
39
Exercise 6: Open the data set ‘birthweight’ from Excel. Label the variables with the labels in the table
below.
All the variables are continuous/ discrete apart from ‘Low birth weight’ which is binary
a) What is the average birthweight? Is birthweight normally distributed?
The smallest baby in the data set was 3.3 pounds and the largest 11.4 pounds. The mean birthweight is
7.52 pounds and the median 7.6. The histogram shows that birthweight is normally distributed.
b) Recode the variable mncig (cigarettes smoked by the mother per day) into the following four
categories: 1 = non-smoker, 2= light smoker (smokes 1 – 10 per day), 3 = Moderate smoker (11 – 20
per day) and 4 = Heavy smoker (21+ per day)
c) Summarise birthweight by smoking category using suitable statistics and a graph
The means of the groups are similar ranging from 6.97 for moderate smokers to 7.73 pounds for
non-smokers. The standard deviations are similar suggesting similar spread of birthweights within
each category.
For the plots, either a confidence interval plot or a boxplot would be useful representations of the
differences between the groups.
-
Getting started with SPSS Maths and Statistics Help Centre
40
The boxplots show that the medians for the four
groups are fairly similar and the interquartile range
(middle 50% of the values) is of a similar width. Each
boxplot is fairly symmetrical about the median
suggesting the values are normally distributed within
each group.
Produce a scatterplot of birthweight and gestational
age. What is the relationship between the two?
There is a moderate positive relationship between
gestational age and birthweight but no clear
relationship between smoking and either weight or
gestational age. This means that as gestational age
increases, birthweight tends to increase. There is one
oddity though. A standard pregnancy is 40 weeks.
Most women are induced by 42 weeks but there seem
to be quite a few above 42 weeks. It’s likely that this is
old data perhaps from a time when gestational age
estimation was less accurate.
Exercise 7: Enter the following data into SPSS:
The data should have been entered like this and the categorical numbers labelled.
a) Investigate the relationship between the amount of housework someone carries out per week and
each of the other variables using suitable charts. For scatterplots, have different markers for males
and females.
-
Getting started with SPSS Maths and Statistics Help Centre
41
The graph suggests a strong negative relationship between weekly hours of work and hours of housework. This means that the more hours someone works, the less housework they do. For males, the amount of housework they do and the hours they do are less spread out.
There doesn’t appear to be a relationship between
age and the amount of housework someone does.
.
The highest medians are for those never married and those who are divorced. The data for those never married is very skewed. However, the sample size is small so not much can be concluded. How many are in each category?
-
Getting started with SPSS Maths and Statistics Help Centre
42
The summary statistics show that there are only 2 or 3 people in most of the categories so using summary statistics could be misleading. Merging suitable categories would be advisable.
Produce a plot comparing those working full/ part time for hours of housework and interpret.
Hours per week on housework
Working status
Part time Full time
Mean 18.73 6.33
Median 18.00 6.00
Standard Deviation 8.01 5.27
There is clearly a difference in the amount of
housework carried out per week between those
working full and part time. Those working part time
carry out 19 hours of housework a week on average
compared to 6 hours a week by those working full
time. The amount of housework is more spread out for part time people (SD = 8 compared to SD = 5 for
full time workers).
top related