analysing and understanding learning assessment for ...€¦ · theory. we can do statistical...

Australian Council for Educational Research

Analysing and Understanding Learning Assessment for Evidence-based Policy

Making

Introduction to statisticsDr Alvin Vista, ACER

Bangkok, 14-18, Sept. 2015

Structure of workshop

Lecture/presentation – focus on concepts, brief reviewPractical exercises – using the most common and accessible software – Excel; Hands-on, to maximise transfer of knowledge and develop skillDiscussion and interpretation of sample studiesCollaborative setting

• If you don’t know how to do something, seek help.• If you know how to do something, provide help.• If you’re not sure, interact.

Drawing knowledge from reality

Measurement Theory Data Statistical

Theory

Statistics allows us to draw knowledge or conclusions from the data.Measurement theory allows us to draw meaningful data from reality.

What is Measurement?

A formal definition:‘Measurement may be regarded as the

construction of homomorphisms (scales) fromempirical relational structures of interest into

numerical relational structures that are useful.’(Krantz et al., 1971, p.9)

In other words:Measurement is a process where a variable (or construct) can be converted into a number in a

consistent manner.

Presenter

Presentation Notes

Without proper measurement, our data would be meaningless and results from statistical analysis conducted on them would be invalid.

Data and Models

DATAObservations

MeasurementsSensory perceptions

MODELSTheories

InterpretationsGeneralisations

Presenter

Presentation Notes


Data and Models

DATAWhat you see

(observed)

MODELSWhat Google Maps says

Presenter

Presentation Notes


Data and ModelsIf there is a mismatch between data and model, which is more likely to be wrong?

Data cannot be changed but the methods to collect data can be improved to increase the quality of data.

Presenter

Presentation Notes


Drawing knowledge from reality

Measurement Theory Data Statistical

Theory

We can do statistical analysis ONLY AFTER we’re confident that our data is reliable

Better data = better fitting models

Better fitting models = better understanding of reality

Fundamentals: Statistics

• Statistics is the study of data.• It concerns with the:

– Collection– Analysis– Presentation– Interpretation

of data.

Presenter

Presentation Notes

What you see on this page are not strict definitions, but an attempt to discuss the very important fundamental notation As a study statistics concerns with...

Data in education research

• In educational context – records are usually students or schools or

parents– variables are usually

• responses of the students to the test itemsor• responses of students or school principals or

parents to the questionnaire items

Data: values for variables & records

• In the educational data – responses to a particular item from all

respondents form the values for the corresponding variable

• a column of values in our imaginable table– responses from a particular respondent to

all items form the values for the corresponding record

• a row of values in our imaginable table

Levels of measurement• Nominal: Denote a category; statistics

include counts such as mode and frequency distributions

• Ordinal: Rank order is described but successive categories do not denote equal differences of the measured attribute; statistics include median

• Interval: Where the measurement is presumed to denote equal intervals between scores. Both the base point and unit of measurement are arbitrary

• Ratio: Note that ratio scales have a natural base value that cannot be changed (i.e., a zero in one unit means the same in all other units). Only the unit of measurement is arbitrary.

Non-metric --categorical measures which describe differences in type or kind; arithmetical operations are not applicable

Metric -- continuous measures which reflect differences in amount or degree

In a nutshell

Level of measurement has direct implications for how relationships within and between variables can be contained and identified

Presenter

Presentation Notes

This is why you need to think about your constructs, variables, and levels of measurement as part of the design of your study. For example, if you want to compare average performance on a particular ability across two groups, you will need to be able to calculate averages. That means that the level of measurement of your data will have to lend themselves to that type of manipulation.

Levels of measurement and measures of distribution characteristics

Level Central tendency Spread

Nominal Mode Percent distribution

Ordinal MedianMode

Minimum/MaximumRangePercentilesPercent distribution

Interval/Ratio MeanMedianMode

VarianceStandard deviationMinimum/MaximumRangePercentilesPercent distribution

Measures of central tendency

Level Definition Example

Mode The attribute of a variable that occurs most often in the data set

Variable = NationalityMode = Indonesian

Median The value of the middle case when the cases have been placed in order or in line from low to high

Variable = Rank (1st, 2nd, 3rd, … 7th)Median = 5th

Mean The arithmetic mean or average. Computed as the sum of all the valid cases together and dividing by the number of valid cases.

Variable = AgeMean = 24.35

Presenter

Presentation Notes

The Mode Is an Attribute, Not a Frequency or a Percent A common mistake when identifying the mode from a frequency table is to report the largest frequency or the largest percent as the mode. The correct mode for the variable “belief in life after death” is “yes.” If a person reported the mode as 78.2% or as 230, she would be wrong! You only locate the largest valid percent or the largest frequency so that you can identify the attribute that is the mode.

Levels of measurement determine the possible statistical analyses

Nominal– Cross-tabulations– Chi-square– Frequencies

7 64 7116 55 71

3 36 399 30 395 74 79

18 61 7913 71 8419 65 84

7 39 4610 36 4611 25 36

8 28 3617 19 36

8 28 3612 7 19

4 15 1912 2 14

3 11 1412 3 15

3 12 1599 340 43999 340 439

CountExpected CountCountExpected CountCountExpected CountCountExpected CountCountExpected CountCountExpected CountCountExpected CountCountExpected CountCountExpected CountCountExpected CountCountExpected Count

1.00

2.00

3.00

4.00

5.00

6.00

7.00

8.00

9.00

10.00

INCOME

Total

SINGLE MARRIEDMARITAL STATUS

Total

Levels of measurement determine the possible statistical analyses

Ordinal– Spearman

correlations– Non-parametric

analysesInterval and ratio

– Pearson correlations– Parametric analyses

Presenter

Presentation Notes

Correlation is a measure of the relation between two or more variables. The measurement scales used for Pearson should be at least interval scales, but other correlation coefficients are available to handle other types of data. Correlation coefficients can range from -1.00 to +1.00. The value of -1.00 represents a perfect negative correlation while a value of +1.00 represents a perfect positive correlation. A value of 0.00 represents a lack of correlation. Spearman R (Siegel & Castellan, 1988) assumes that the variables under consideration are measured on at least an ordinal (rank order) scale, that is, that the individual observations can be ranked into two ordered series. Spearman R can be thought of as the regular Pearson product moment correlation coefficient, that is, in terms of proportion of variability accounted for, except that Spearman R is computed from ranks.

Measures of central tendency

• Mode [=MODE(target range)]• Median [=MEDIAN(target range)]• Mean [=AVERAGE(target range)]

• Lets try that using the sample data!– TIMSS Country_X Grade 8 data.xlsx

Presenter

Presentation Notes

Timss country data, mode=nominal, median=rank. Which variables here are appropriate for these measures?

Measures of spread

• Standard deviation [=STDEV(target range)]

• Min [=MIN(target range)]• Max [=MAX(target range)]• Percentiles [=PERCENTILE(target

range, kth percentile)] where k ranges from 0.00 to 1.00

Presenter

Presentation Notes

What measure of central tendency is the equivalent of percentile .50?

Characteristics of a distributionSkewness: a measure of the asymmetry of a distribution.[=SKEW(target range)]

The normal distribution is symmetric and has a skewnessvalue of zero.

– Positive skewness: a long right tail. – Negative skewness: a long left tail.

Presenter

Presentation Notes

Relationship between mean median mode and skew only valid for unimodal and non-discrete distributions

Characteristics of a distributionKurtosis: A measure of the extent to which observations cluster around a central point. • For a normal distribution, the value of the

kurtosis statistic is zero. • [=KURT(target range)]

– Leptokurtic data values are more peaked (positive kurtosis)– Platykurtic data values are flatter and more dispersed along the X axis (negative kurtosis)

Presenter

Presentation Notes


Measures of spread• Frequencies [=FREQUENCY(target range, groups)] where bin ranges

are groups that includes values less than and up to each bin value– Bins = 10, 20, 30, will result in 4 groups (bin +1)– Group 1= less than or equal to 10– Group 2= 11 to 20– Group 3= 21 to 30– Group 4= more than 30– Enter as array formula – Write the formula in the first cell of the output range, select output range equal to

number of groups, press F2, then CTRL+SHIFT+ENTER)• Percent distribution can be computed by dividing Frequencies with

total cases [=COUNTIF(target range, value)]

• Lets try that using the sample data!

Presenter

Presentation Notes


Practical exercise!

• TIMSS Country_X grade 8.xlsx

• Complete the Descriptive statistics for Boys, Girls, and the whole sample

• Save your results as we will use them in later sessions.

Presenter

Presentation Notes


Australian Council for Educational Research

Analysing and Understanding Learning Assessment for Evidence-based Policy

Making

Inferential statistics and Hypothesis testing

Bangkok, 14-18, Sept. 2015

Confidence intervalsStandard error of the mean The standard error is an indicator of how precise the statistic is, and how close it is ‘probabilistically’ to the parameter (e.g., the true mean). Confidence intervals are based on the SE

𝑆𝑆𝑆𝑆 �𝑋𝑋 = 𝑠𝑠𝑛𝑛

𝐶𝐶𝐶𝐶lower = �𝑋𝑋 − 𝑍𝑍(𝑆𝑆𝑆𝑆 �𝑋𝑋)𝐶𝐶𝐶𝐶upper = �𝑋𝑋 + 𝑍𝑍(𝑆𝑆𝑆𝑆 �𝑋𝑋)Z=1.96 corresponds to a 95% CI

Presenter

Presentation Notes

Which one is more important to avoid? It depends on the situation. In crime, false negative is probably better: somebody said, “it’s better for ten guilty men to go free than for one innocent man to be convicted” In fire alarms, false positive is definitely better. Better to be inconvenienced than burned to death.

Confidence intervals

Presenter

Presentation Notes


Confidence intervals

• Lets try computing SEs and CIs with data!

• Means and SDs for boys and girls on Math achievement

• Standard error of the means• Confidence intervals

Presenter

Presentation Notes


Inferential StatisticsEstimating population parameters Inferential statistics can show how closely the sample statistics approximate parameters of the overall population.

• The sample is randomly chosen and representative of the total population.

• The means we might obtain from an infinite number of samples form a normal distribution.

Source: Johnson, B. & Christensen, L. (2012). Educational Research:Quantitative, Qualitative, and Mixed Approaches. Thousand Oaks, CA: Sage.

Inferential Statistics

What can we say if we have a sample, and it’s confidence interval does not overlap with the confidence interval of another sample?

Inferential Statistics

Testing Hypotheses (1)

Research hypothesis vs. statistical hypothesis Statistical hypothesis testing: comparing the distribution of data collected by a researcher with an ideal, or hypothetical distribution

- significance level/alpha (α): e.g., .05, .01 must be set before testing!- “statistically significant” means there is sufficient evidence to reject the null hypothesis.- it does NOT mean that the alternative hypothesis is true.

Presenter

Presentation Notes

A statistical hypothesis is a hypothesis that can be answered only two ways: statistically significant or not. But rejecting the null hypothesis gives an indirect evidence that the alternative hypothesis is tenable.

Statistical hypothesis testing Testing Hypotheses (2)

Making errors in hypothesis testing

- Type I error: alpha error, or false positive

- Type II error: beta error, or false negative

Presenter

Presentation Notes


Relationship between α and β

Presenter

Presentation Notes

We can set the alpha, but it has a direct impact on the beta. The point where we set our alpha is the point where we conclude the data must come from the other distribution (i.e., there is a difference), but there is a small chance that it comes from the H0 distribution, hence false positive. If we decrease the alpha on two fixed samples, we increase the beta. It is better to make our sample statistics more reliable (smaller standard errors)

Statistical hypothesis testingOur conclusions

Not guilty Guilty

Correct conclusion Type I error

(false positive)

Type II error Correct (false negative) conclusion

Rea

lity

Gui

lty

N

ot g

uilty

Presenter

Presentation Notes


Sometimes, false positive is worse“It’s better for ten guilty men to go free than for one innocent man to be executed” -- anonymous wise lawyer

Our conclusions

Not guilty Guilty

Correct conclusion Condemnedinnocent

Freed guilty Correct conclusion

Rea

lity

Gui

lty

N

ot g

uilty

Presenter

Presentation Notes


In fire alarms, false positive is better

A false negative would REALLY suck.-- anonymous wise homeowner

Our conclusions

No fire Fire

Correct Unnecessaryconclusion panic

Burned Correct to death conclusion

Rea

lity

Fire

N

o fir

e

Presenter

Presentation Notes

Now, think of other situations where avoiding type 1 is more important, and what about situations where avoiding type 2 is more important?

Statistical hypothesis testingTesting for significance

• Remember: statistical significance is NOT the same as practical significance!

• In inferential statistics, we are only concerned with statistical significance. Practical significance is a judgment call to be made by the researcher and audience.

Statistical hypothesis testingMaking errors in hypothesis testing

Increase the power of a statistical test• Use as large a sample size as is reasonably possible• Maximize the validity and reliability of your measures. • Use parametric rather than non parametric statistics

whenever possible.

• Whenever we test more than one statistical hypothesis, we increase the probability of making at least one Type 1 error.

• For multiple hypotheses, a correction (e.g., Bonferroni) needs to be applied to our statistical tests.

Statistical hypothesis testingSteps in Hypothesis Testing1. State the null and alternative hypotheses.2. Set the significance level before the research study.

(Most educational researches use .05 as the significance level. Note that the significance level is also called the alpha level or more simply, alpha.)

3. Obtain the probability value using a computer program such as SPSS .4. Compare the probability value to the significance level and make the statistical decision.

Rule 1:If probability value is less than alpha, reject the null hypothesis. Conclude that finding is significant.Rule 2: If probability value is greater than alpha, fail to reject the null hypothesis. Conclude that the finding is not significant.

5. lnterpret the results. That is, make a substantive real-world decision and determine practical significance.

Source: Johnson, B. & Christensen, L. (2012). Educational Research: Quantitative, Qualitative, and Mixed Approaches. Thousand Oaks, CA: Sage.

Statistical hypothesis testingExercise 1

Let’s compare two sets of data and test the hypothesis that their means are equal or different

Simple T-Test for independent means[T.TEST(data range 1, data range 2)]

Data: data check.xlsxSets to compare: Science scores for sets A vs B, and B vs D

Statistical hypothesis testingExercise 2

Let’s compare the science achievement of those parents completed only lower secondary (PHEL=4) vs those whose parents have a university degree (PHEL=1), and test the hypothesis that their means are equal or different

Simple T-Test for independent means[T.TEST(data range 1, data range 2)]

Data: data check.xlsxSets to compare: Science scores for sets PHEL=4, and PHEL=1

analysing and understanding learning assessment for ...€¦ · theory. we can do statistical...

Documents