measurement characteristics error & confidence reliability, validity, & usability

MEASUREMENT CHARACTERISTICS

Error & ConfidenceReliability, Validity, & Usability

ERROR & CONFIDENCE

Reducing error All assessment scores have error Want to minimize so scores are accurate Protocols & periodic staff training/retraining

Increasing confidence Results lead to correct placement Assessments that produce valid, reliable,

and usable results

ASSESSMENT RESULTS

Norm-referenced Individual’s score compared to others

in their peer/norm group School tests, 95%

Norm group needs to be representative of test takers the test was designed for

ASSESSMENT RESULTS

Criterion-referenced Individual’s score compared to a

preset standard or criterion Standard doesn’t change based on

the individual or group A=250-295 points

VALIDITY

Describes how well the assessment results match their intended purposeAre you measuring what you think you are measuring?Relationship between program & assessment contentDoes not have validity for all purposes, populations or time

VALIDITY

Depends on different types of evidenceIs a matter of degree (no tool is perfect)Is a unitary concept Change from past Former types are now considered as

evidence Content validity/content-related evidence

FACE VALIDITY

Not listed in textDo the items seem to fit?

CONTENT VALIDITY(Content-related evidence)

How well does assessment measure subject or content?RepresentativeCompleteness----all major areasNonstatisticalReview of literature or expert opinionBlueprint of major componentsPer Austin (1991), minimum requirement for any assessment

CRITERION-RELATED VALIDITY (Criterion-related evidence)

Comparison of resultsStatisticalReported as validity or correlation coefficient+1 to -1 (1 is a perfect relationship)0 = no relationshipr.73 better than r.52r +/-.40 to +/-.70 = acceptable range

CRITERION-RELATED VALIDITY (Criterion-related evidence)

May use .30 to .40 if statistically significantIf validity is reported, it is generally criterion-related validity2 types Predictive Concurrent

PREDICTIVE VALIDITY

The ability of an assessment to predict future behaviors or outcomesMeasures are taken at different times ACT or SAT & success in college Leisure Satisfaction predicts

discharge

CONCURRENT VALIDITY

More than one instrument measures the same contentDesire to predict 1 set of scores from another set of scores that are taken at the same or nearly same time measuring the same variable

CONSTRUCT VALIDITY(Construct-related evidence)

Theoretical/conceptualContent & criterion-related validity contribute to construct validityResearch concerning conceptual framework on which assessment is based contribute to construct validityNot demonstrated in a single project or statistical measureFew TR have: focus = behavior not construct

CONSTRUCT VALIDITY(Construct-related evidence)

Factor analysisConvergent validity (what it measures)Divergent validity (what it doesn’t measure)Expert panels here too

THREATS TO VALIDITY

Assessment s/b valid for intended use (e.g. research instruments)Unclear directionsUnclear or ambiguous termsItems that are at inappropriate level for subjectsItems not related to construct being measured

THREATS TO VALIDITY

Too few itemsToo many itemsItems with an identifiable pattern of responseMethod of administrationTesting conditionsSubjects health, reluctance, attitudesSee Stumbo, 2002, p.41-42

VALIDITY

Can’t get valid results without reliable results, but can get reliable results without valid resultsReliability is a necessary but not sufficient condition for validitySee Stumbo, 2002, p. 54

RELIABLITY

Accuracy or consistency of a measurementReproducible resultsStatistical in naturer = between 0 & 1 (with 1 being perfect)Should not be lower than .80Tells what portion of variance is non-error varianceIncreases with length of test & spread of scores

STABILITY (Test-retest)

How stable is the assessment?Assessment not overly influenced by passage of timeSame group assessed 2 times with same instrument & results of the 2 testings are correlatedAre the 2 sets of scores alike?Time effects (longer, shorter)

EQUIVALENCY (Equivalent forms)

Also known as parallel-form or alternative-form reliabilityHow closely correlated are 2 or more forms of the same assessment?2 forms have been developed and demonstrated to measure the same constructForms have similar but not same itemse.g. NCTRC examShort & long forms are not equivalent

INTERNAL CONSISTENCY

How closely are items on the assessment related?Split half 1st half vs. 2nd half Odd/even Matched random subsets

If can’t divide Cronbach’s alpha Kuder-Richardson Spearman-Brown’s formula

INTERRATER RELIABILITY

Percentage of agreements with number of observationsDifference between agreement & accuracyRaters compared to each other80% agreement

INTERRATER RELIABILITY

Simple agreement Number of agreements & disagreements

Point-to-point agreement Takes each data point into consideration

Percentages of agreement for the occurrence of target behaviorKappa index

INTRARATER RELIABILITY

Not in textCompared with self

RELIABILITY

Manuals often give this informationHigh reliability doesn’t indicate validityGenerally a longer test has higher reliability Lessens influence of chance or

guessing

FAIRNESS

Reduction or elimination of undue bias Language Ethnic or racial backgrounds Gender Free of stereotypes & biases

Beginning to be a concern for TR

USABILITY & PRACTICALITY

NonstatisticalIs this tool better than any other tool on market or one I can design?Time, cost, staff qualifications, ease of administration, scoring, etc

measurement characteristics error & confidence reliability, validity, & usability

Documents

measuresdivergent validity

significantif validity

set of scores

assessment measure subject

program assessment contentdoes

intended use

different types of evidenceis

pastformer types