distributions & descriptive statistics dr william simpson psychology, university of plymouth

•Distributions & Descriptive statistics

•Dr William Simpson•Psychology, University of Plymouth

1

Defining and measuring variables

2

Independent & dependent variables

• Independent variable: something we manipulate in an experiment

• Dependent variable: something we measure • By manipulating the IV, we expect to produce a

change in the DV

3

Scales of measurement

• variables classified according to type of scale–type of analysis depends on type of

scale

• Worst to best: Nominal, ordinal, interval, ratio

4

Nominal

•Nominal data: assign categorical labels to observations•Not really measurement•E.g. male/female; married/single/widowed/divorced•Numbers on football jerseys

5

Ordinal

•Ordinal data: values can be ranked (ordered). Categorical but rankable•E.g. small, medium, large; movie rating 1-5; Likert scale•Can only be ranked. Rating scale is not like cm. The diff between & is not nec the same as between &

6

• Adding a response of "strongly agree" (5) to two responses of "disagree" (2) would give us a mean of 4, but what is the meaning of that number?

7

Interval

•Interval data: ordinary measurement, e.g. temperature•Unlike ordinal data, we can say the diff between 1 & 2 deg C is same as diff between 4 & 5 deg

8

Ratio

•Ordinary measurements, but with an absolute, non-arbitrary zero point•E.g. weight, length: any scale must start at zero•deg C: not ratio, because 0 arbitrarily set at freezing pt of water

9

Discrete & continuous variables

• variables measured on interval & ratio scales are further identified as either:–discrete – Integers, no intermediate values. E.g.

#Smarties in a box

–continuous - measurable to any level of accuracy. E.g. Weight of Smarties contents

10

Frequency distributions

11

•We have a pile of scores•Not all scores are equally likely•How were scores distributed?

12

•Subjects were timed (in sec) while completing a problem-solving task:•7.6, 8.1, 9.2, 6.8, 5.9, 6.2, 6.1, 5.8, 7.3, 8.1, 8.8, 7.4, 7.7, 8.2

13

Stem & leaf

•Two components: the stem and the leaf•In problem-solving example, stem = ones, leaf = tenths•Stems range between 5 and 9

14

•7.6, 8.1, 9.2, 6.8, 5.9, 6.2, 6.1, 5.8, 7.3, 8.1, 8.8, 7.4, 7.7, 8.2• • 5|98• 6|821• 7|6347• 8|1182• 9|2•Key: 9|2 means 9.2

15

•Heights in cm:154, 143, 148,139, 143, 147, 153, 162, 136, 147, 144, 143, 139, 142, 143, 156, 151, 164, 157, 149, 146•- Put 2 digits in stem; split stems 0-4, 5-9•13|969•14|334323•14|87796•15|431•15|67•16|24•Key: 13|6 means 136

16

• GSR values: 23.25, 24.13, 24.76, 24.81, 24.98, 25.31, 25.57, 25.89, 26.28, 26.34, 27.09•- Round the last 2 digits•23|3•24|188•25|0369•26|33•27|1•Key: 23|3 means 23.3

17

Histogram

•Alternative way to look at distribution•It is like a version of stem-and-leaf turned 90 deg

18

Example

• Time to complete task (min):• 8 2 6 12 9 14 1 7 7 9 11 8

12 10 5 7 10 9 10 11 4 8 2 11 10 11 13 13 14 11 13 10 12 13 5 16 11 17 10 6 13 11 5 9 12 14 8 2 12 4

19

•Sort scores into about 10 or so bins (similar to stem in stem-and-leaf)

20

•Decide on sensible bins•Count the number of observations in each bin (length of each leaf in stem-and-leaf)•This number in each bin is called the frequency

21

22

time frequency 0-1 12-3 34-5 56-7 58-9 810-11 1312-13 1014-15 316-17 2

•This table is then used to make the histogram•Histogram is bar chart with frequency on y axis and score on x axis•Sometimes done other ways, e.g. connect the dots (frequency distrib polygon)

23

24

0

5

10

15

Fre

quen

cy

0 2 4 6 8 10 12 14 16 18 20Time (min)

in R

•x<-c(8, 2, 6, 12, 9, 14, 1, 7, 7, 9, 11, 8, 12, 10, 5, 7, 10, 9, 10, 11, 4, 8, 2, 11, 10, 11, 13,13, 14, 11, 13, 10, 12, 13, 5, 16, 11, 17, 10, 6, 13, 11, 5, 9, 12, 14, 8, 2, 12, 4)•hist(x)•stem(x)•boxplot(x)

25

Probability distributions

•Histogram is estimate of true probability distribution•Many theoretical probability distributions exist•Basis of statistical models used to make inferences about population

26

Binomial distribution

• Binomial distribution is a discrete distribution• the binomial distribution applies when:

–there is a series of n trials (e.g., 10 coin tosses)

–only 2 possible outcomes per trial –outcomes are mutually exclusive (head or tail)–outcome of each trial independent of others

27

•The binomial distribution gives the chance of getting each total number of ‘successes’ after doing all the (binary) trials of the expt•E.g. it gives the chance of getting 1, 2, or 3 girls after giving birth to 6 children•p = p(success) = p(girl) = 0.5 each trial•q = p(failure) = p(boy) = 1-p = 0.5•n = number of trials = 6

28

• prob distribution where n = 6 and the prob of each outcome is 0.5 on each trial looks like:

29number of girls

probability

•For any probability distribution, the y-axis is given by a formula•For the binomial, it looks like this:

30

• k successes in n trials; () is binomial coefficient

• you don’t need to know it

Normal distribution

•Continuous probability distribution•Every probability distribution’s y-axis is given by a formula•For normal distribution, the y-axis (probability density) is:

31

Descriptive statistics

33

•We have a pile of scores•Have made stem-and-leaf, histogram•Want to summarise further: descriptive statistics

34

1. Centre (location)

•What is the ‘typical’ score? If you were to make a prediction for a new score, what would it be?

35

a) Mean (average)

•Mean = sum(x)/n

36

Mean as balance point

•Imagine that each observation is a toy block•Place the blocks on a ruler; the position (1, 2, etc inches) represents the value•The balance point is the mean

37

•1 2 2 3

38

1 2 2 5 1 2 2 9

Mean is pulled towards extreme observation (outlier)

b) Median

•Median is middle score; 50th percentile•useful when extreme scores (outliers) lie in one tail of distribution (skewed)•

39

Calculate the median

•Sort scores•If odd n, median is middle value•If even n, median is mean of 2 middle values•25 13 9 18 1 -> 1 9 13 18 25; med=13•25 13 9 18 -> 9 13 18 25•Median= (13+18)/2 = 15.5

40

Median and outliers

•1 2 2 3•1 2 2 5•1 2 2 9•Median = 2 in all cases

41

c) Mode

•Mode is most frequently occurring score•Mean should really be used only for interval/ratio data. Mode good otherwise•E.g. mean movie rating – not really sensible. Mode sensible•Sometimes no unique mode exists (e.g. bimodal)

42

•Bimodality can be due to mixture of two different populations (e.g. male and female)

43

• Mean = 9.36 Median = 10 Mode =11

44

0

5

10

15

Fre

quen

cy

0 2 4 6 8 10 12 14 16 18 20Time (min)

Time to complete task (min)

•mean(x)•median(x)•Mode <- function(x) { ux <- unique(x) ux[which.max(tabulate(match(x, ux)))]}•Mode(x)

45

Likert scale

• e.g. Brief Psychiatric Rating scale (BPRS)• Interview + observations of patient's

behaviour over preceding 2–3 days• Each item scored 0-7

46

• Suppose we have a new treatment• Does it reduce anxiety?• Define “anxiety” as score on Q2

47

• We use BPRS on lots of patients• Compare treatment and placebo• How? Find mean(treatment) vs

mean(placebo)?

48

NO

49

• The numbers 0-7 are not really numbers!• They have only rank (order) info• Ordinal

50

• The “numbers” are really ordered labels: “normal”, “a bit anxious”, … , “extremely anxious”

51

• They lack a quantitative distance between them; calculating a mean level of anxiety for the group is not really appropriate

52

• It makes sense to find the mode• Most frequently occurring anxiety score

53

• It makes sense to measure the median: person in the middle of the group in terms of anxiety, with half the responses below and the other half above

54

Example

• Family-Focused Treatment Versus Individual Treatment for Bipolar Disorder: Results of a Randomized Clinical Trial

• J. Consulting & Clinical Psychology, 2003, 71, 482– 492

55

56

“The psychiatrist made ratings of compliance on a 7-point Likert scale ranging from full compliance (1) to discontinued medication against medical advice (7)” p.486

• “On the whole, the participants were quite compliant with their medication, with at least 78% of the patients scoring within the compliant range at each assessment point” p.489

• - Must have made mistake before: 1 is bad, 7 is good compliance

57

• For each 3-month follow-up period, participants were placed in one of the following clinical outcome categories:

(a) relapse, defined as a rating of 6 or 7 on the BPRS/SADS-C core symptoms of depression (depressed mood, loss of interest), mania (hostility, elevated mood, grandiosity), or psychosis (unusual thought content, suspiciousness, hallucinations, conceptual disorganization) and at least two ancillary symptoms (suicidality, guilt, sleep disturbance, appetite disturbance, lack of energy, negative evaluation, discouragement, increased energy activity), or

(b) nonrelapse, defined as a score of 5 or below on all relevant BPRS/SADS-C core symptoms during the 3-month interval

58

2. Spread (dispersion)

•Measure of centre (e.g. mean) tells what value we expect•Measure of spread tells how close a value will typically be to the centre

61

a) Interquartile range

•Interquartile range (IQR) finds distance between the top 25% and bottom 25% of scores

Quartiles

•Quartiles divide the data into quarters•The median (Q2) divides the data into 2 piles (50% above, 50% below)•Q1 is the cutoff below which fall the bottom 25% of scores•Q3 is the cutoff below which fall the bottom 75%

– Q1 has 25% of scores below it, Q2 has 50% (i.e. it is the median) and,Q3 has 75% of scores below it (25% above)

Finding quartiles

1. Sort the data2. Find the median = Q2 = value that

splits the data into two equal piles, half below it and half above

3. Q1 = median of lower half4. Q3 = median of upper half5. IQR = Q3 – Q1

•x<-c(8, 2, 6, 12, 9, 14, 1, 7, 7, 9, 11, 8, 12, 10, 5, 7, 10, 9, 10, 11, 4, 8, 2, 11, 10, 11, 13,13, 14, 11, 13, 10, 12, 13, 5, 16, 11, 17, 10, 6, 13, 11, 5, 9, 12, 14, 8, 2, 12, 4)•x<- sort(x); x•1 2 2 2 4 4 5 5 5 6 6 7 7 7 8 8 8 8 9 9 9 9 10 10 10 10 10 10 11 11 11 11 11 11 11 12 12 12 12 12 13 13 13 13 13 14 14 14 16 17

67

•n=50•Q2=(x[25]+x[26])/2 = (10+10)/2=10•Q1 = x[13] = 7•Q3= x[38] =12•IQR=Q3-Q1=12-7=5•We expect scores near 10, plus-or-minus 5 points

68

in R

•fivenum(x)• 1 7 10 12 17•= min, Q1, Q2, Q3, max•IQR(x)•5

69

•boxplot(x)70

b) Standard deviation

•Each point is some distance away from mean•Each distance from the mean is a deviation

•Deviation = score - mean

71

•Each deviation contributes to the spread of the data about the mean•Is the total spread just the sum of the deviations, then? •No. Mean is a balance point, so positive and negative deviations cancel out•Can find a “sort of” average or “typical” deviation if we get rid of the signs

“Average” deviation

•Average deviation actually is zero because signs cancel. Need to get rid of signs•Idea: square each deviation, average, then take (positive) square root. [RMS]•That is the standard deviation!

Calculating the SD

•Find the deviations•Square them•Find the average•Take the square root to undo the squaring•In symbols:

• N or n-1N

(X )2

c) variance

•Variance = SD squared

•Useful for ANOVA (ANalysis Of VAriance)

76

Likert scale

• These “numbers” are not really numbers• Therefore cannot do operations like

subtraction, division, sqrt• Use IQR

77

Statistical Inference

•Usually we are interested in more than describing or summarising the numbers we have on hand•E.g. have a sample, calculate mean. What is mean of larger pop?•E.g. have done an expt, means differ. Is this a fluke or “real”?

80

• The data we have on hand are samples from some (real or theoretical) population

• We want to make inferences about population

81

Summary

•IV, DV•Nominal, ordinal, interval, ratio•Continuous, discrete•Stem & leaf, histogram•Probability distribution•Mean, median, mode•IQR, SD, variance

82

distributions & descriptive statistics dr william simpson psychology, university of plymouth

Documents

version of stem

stem split stems

stem leaftwo components

frequency distributions

ordinalordinal data

nominalnominal data

intervalinterval data

rating scale