development and evaluation of scales/instruments in psychiatry

DEVELOPMENT AND EVALUATION OF SCALES/INSTRUMENTS IN

PSYCHIATRY

DR. PAWAN SHARMA

[email protected]

• INTRODUCTION

• STEPS FOR DEVELOPMENT OF SCALES

• WHOQOL-100: AN EXAMPLE

INTRODUCTION

• The term 'rating scale' was originally used to define a series of items which quantified or placed in rank order, the manifestations of a single variable e.g. aggressiveness

Hamilton, 1976

• Psychological testing is something that requires one to perform a behavior to measure some personal attribute, trait or characteristic or to predict an outcome

INTRODUCTION

Intelligence test

• Binet-Simon scale

• Ability in global areas like verbal comprehension, performance, reasoning etc.

Aptitude

• Capability in specific task or skill

• Supplement global intelligence tests

• Vocational test

• General aptitude battery

Achievement

• Degree of learning, success

• Civil services

• Resemble intelligence tests more closely

• SLD batteries

Personality test: • Trait, quality or behavior• Self report inventories, check lists• Performance or situational test• Projective technique

INTRODUCTION

• In terms of user

– Self rating

– Observer rating

• Form of items

– Graded items- degree of severity

– Checklist- present or absent

– Forced choice items - choose 2 alternatives as applicable

• Content

– Behavior

– Social adjustment

– Functional capacity etc

INTRODUCTION

Most important scale as per function in clinical setting

1. Intensity scales – severity and response to treatment : BPRS, PANSS

2. Prognostic scales- prediction of response to treatment: Strauss and Carpenter Prognostic Scale

3. Scales for selection of treatment by means of differential indicators: Conner's rating scale for ADHD

4. Scales for diagnosis and classification : IQ scales Hamilton, 1976

• INTRODUCTION



SCALE DEVELOPMENT

Defining the test

Selecting a scaling method

Constructing the items

Testing the items

Revising the test Publishing

DEFINING THE TEST

• Over the years thousands of scale or tests developed

• Clear idea of what the test or scale is supposed to measure (purpose)

• How is the new test different from others and what contribution the test or scale provides to the existing field

SELECTING A SCALING METHOD

Representative scaling method1. Expert Ranking : Glasgow Coma Scale for scaling the depth of

coma– Panel of neurologists to list patient behaviors associated with

different level of consciousness2. Method of equal appearing intervals: (Thurstone scaling

approach)– For an scale of attitude collect as many true false statement as

possible – Known judges or experts determine the favorability usually one

out of 11 categories (1 to 11): extremely favorable to extremely unfavorable

– The mean favorability rating and Standard deviation for each item determined

– Items with large standard deviation dropped


Method of absolute scaling:

• Used in aptitude testing and group achievement testing

• Measure of absolute item difficulty based upon results of different age groups of test takers

• Administration of common set of questions to two or more age groups

• The relative difficulty of these items between any two age groups serves as a basis for making a series of interlocking comparisons

• One age group as an anchor

• Item difficulty measured in common units as SD units of the ability for anchor group


Likert scaling

• Proposed by Likert 1932

• Presents the examinee with 5 responses ordered on an agree/disagree or approve/ disapprove continuum

• Score of 5 to extreme response and 1 to opposite extreme

• Total scale score is obtained by adding scores from individual items so known as summative scale


Guttman scale:

• Respondents who endorse one statement also agree with the milder statement pertinent to the same underlying continuum

• Produced by selecting items that fall into an ordered sequence of examinee endorsement

• Perfect Guttman scale is seldom achieved – errors of measurement

• Example : Beck Depression Inventory

• I occasionally feel sad or blue

• I often feel sad or blue

• I feel sad or blue most of the time

• I always feel sad and I cant stand it

A client who endorses the last certainly agrees with milder form


Method of Empirical Keying:

• Test items selected based on how well they contrast a criterion group from a normative sample

Pool of person experiencing Major Depression gathered to answer the pool of true-false questions

Endorsement frequency of depression group is compared to normative group

Items showing large difference in endorsement frequency selected for the depression scale – keyed in the direction favored by depression subjects

Raw score for depression-number of items answered in keyed direction


Rational Scale construction (Internal consistency):

• Popular method for self-report personality inventories

• Rationale: All scale items correlate positively with each other and with the total score

• Starting with review of literature - to scale a character like leadership

• True false statements

• Administered to large sample of individual similar to target sample

• Items with weak or negative correlation to the total score are discarded

• Item-total correlation is recalculated the to verify the homogeneity of remaining items

CONSTRUCTING THE ITEMS

Initial questions:

• Homogeneity vs heterogeneity:

– Depends on how the test developer has defined the new instrument

– For example culture reduced form of general intelligence will have varied items whereas theory based test of spatial thinking will have homogenous items

• Range of difficulty :

– Meaningful differentiation of examinees of both extremes

– Graded series of very easy items passed by all to a difficult items passed by virtually no one


• Ceiling effect

– Significant number of examinee obtain near perfect score

– No distinction between high scoring individual even though there might be substantial difference in underlying trait

• Floor effect:

– Significant number of examinee obtain scores at bottom or near bottom

– Example: WAIS-R had serious floor effect so the discrimination between moderate, severe and profound level of intelligence is difficult

CONSTRUCTING THE ITEMSTable of specification

• Enumerates the tasks on which examinees are to be assessed

• Content by process matrix: lists the exact number of items n relevant contact areas and details the precise composite of items

Hypothetical 10 item Science Achievement Test(Content by process

Content area Factual knowledge Information competence

Inferential reasoning

Astronomy 8 3 3

Botany 6 7 2

Chemistry 10 5 4

Geology 10 5 2

Physics 8 5 6

Zoology 8 5 3

Total 50 30 20


ITEM FORMAT

1. Multiple choice methodology:

• Quick and objective

• Measure conceptual as well as factual knowledge

• Fairness can be proved

• But difficulty in writing good distractor options

• Presence of response may cue a half knowledgeable respondent to the correct answer


2. Matching Questions :

• Good in classroom testing but has serious psychometric shortcomings

• Responses are not independent

• Missing one match compels the examinee to miss another

• Options must be closely related


3. Short answer Objective item :

– Individually administered test

– Simplest and most straight forward types of questions

– the best reliability and validity

4. True/False Questions:

– Useful in personality tests

– Easy to understand and simple to answer

• Socially desirable responses minimized by “Forced choice methodology

e.g. choose between 2 equally desirable or undesirable options

Which would u do? a. Mop the floor b. Volunteer for half day

TESTING THE ITEMS

Reliability:

• Consistency of the score

– Reexamining with same test on different occasions

– Different sets of equivalent items

– Under other variable examining conditions

• Concerned with degree of consistency or agreement between two independently derived set of scores

• Can be expressed in terms of correlation coefficient (0 to +1)

– Degree of correspondence or relationship between two sets of scores

– Sometimes negative as in time score correlated with amount scores (time to solve maximum no of problem of mathematics)

TESTING THE ITEMS

1. Test-Retest reliability

• Reliability coefficient is simply the correlations between the same person on the two administration of the same test

• In a test the interval between the administration of test should be always be specified

• Difficulties: practice leads to improvement, recall

• Best for sensory discrimination or motor test

TESTING THE ITEMS

2. Alternate form reliability:

• Using alternate form of test

• Correlation between the two test scores

• Temporal stability and consistency of response to items

3. Split-Half Reliability

• Two scores for each individual by diving the test into two halves

• Also called coefficient of internal consistency

TESTING THE ITEMS

4. Kuder-Richardson Reliability and Coefficient Alfa

• Examination of each item in the test

• Single administration and single test

5. Scorer Reliability:

• Good deal of judgment in the scorer like in projective testing

Error of measurement: margin of error to be expected in an individual score as a result of unreliability of the test

A reliability coefficient of .85 means that 85% of variance in test scores depends on true variance in trait and 15% on the error variance

TESTING THE ITEMS

Item reliability index

• Point biserial correlation coefficient

• Higher the correlation coefficient between an item and total score the more useful is the item from the stand point of internal consistency

• In case of dichotomous items if the item approaches a 50 -50 split of right and wrong scores greater is the standard deviation

• Product of these two indices, correlation and standard deviation; is item reliability index

• Item with high reliability will have high internal consistency and high SD

• So items with low reliability index can be discarded

TESTING THE ITEMS

VALIDITY

1. Content validity:

• systematic examination of the test content to determine whether it covers a representative sample of the behavior domain to be measured

• Important in achievement test; less in personality or aptitude

• Example the different domain of IQ in IQ questionnaire

• Another validity that is confused with content validity is face validity

• It is what the test appears superficially to measure, e.g. a scale for children might have low face validity for adult

TESTING THE ITEMS

2. Criterion validity:

• Effectiveness of a test in predicting an individual’s performance in specified activities

– Concurrent: diagnosis or existing status

– Predictive: prediction for future e.g. selection or hiring of new personnel

• Compares the test with other measures or outcomes (the criteria) already held to be valid

TESTING THE ITEMS

3. Construct validity:

• The extent to which the test may be said to measure a theoretical construct or trait like intelligence, depression, psychopathology

• Test must correlate with other variables or tests with which it shares an overlap of constructs- convergent validity

• Test must not correlate with the variables from which it should differ-discriminant validity

Measured by validity coefficient which is the correlation between test score and criterion measuredError of estimate: margin of error to be expected in predicted criterion score as a result of imperfect validity of test

TESTING THE ITEMS

Item validity index

• To identify predictively useful test items

• Compute point biserial correlation between the item score and score on the criterion variable

• More is the value more useful is the item in view of item validity

• Item validity index is the product of SD and the point biserial correlation

TESTING THE ITEMS

Item difficulty index

• Proportion of examinee in a large tryout sample who get that item correct

• Varies from 0.0 to 1.0

• If the index is 0 , every individual has answered it; so item becomes psychometrically unproductive – same with index 1

• Index should hover between .3 to .7

• In a true false or multiple choice test, difficulty index of 0.5 can result due to guessing so difficulty index must be around 0.75

TESTING THE ITEMS

Item Characteristic curves

• Graphical display of relationship between the probability of correct response and the examinee’s position on the underlying trait measured by the test

• Used for identifying items that perform differently for subgroups of examinees

• Example: sex biased question involving football facts – For man desired slope but for woman flat curve so the items that differ can be eliminated

TESTING THE ITEMS

Ability level

Probability of correct response

b

a

c

Item Characteristic curves

TESTING THE ITEMS

• An ideal item is the one that most of the high scorers pass and most of the low scorers fail

• For an ideal item it should have an ogive curve

• But visual inspection is not completely objective

Item discrimination Index:

U=no of examinees in upper range who answered correctly

L=no in lower range who answered the item correctly

N= total no of examinee in upper range or lower range

• -1 to +1

d=(U-L)/N

D=0 : cant discriminate between low and high scoring Closer to 1 is good Negative score items need to be replaced

TESTING THE ITEMS

Response biases

• Wide range of cognitive biases that influence the responses of participants away from an accurate or truthful response

• Socially desirable response, mainly prevalent in self report inventories

• Steps to prevent:

– Relatively subtle or socially neutral items

– Use of forced choice items

– Use of special scales within the inventory to see socially desirable responding

– Rapport during administration

STANDARIZATION

• Compared with some norm obtained by applying same test in the sample supposed to represent the population

• Major problem – application of the norm of the large majority to the minority population

• Context specific not stable over time

• Z score :

– Raw score expressed in units that indicate the position of an individual relative to distribution of scores

– Score 0 =score at mean

– Score 1 =1 SD above mean

– Score -1=1 SD below mean

Z= (variable – group mean)/2

Fischer & Milfont, 2010

STANDARIZATION

1. Within subject:

Transformation of scores of each individual using the mean for that individual across all variables

Relative endorsement of item = (Raw score – average across all variables of an individual )

2. Within culture standardization:

Mean across all items and individuals in a group

3. Double standardization:

Combination of both

Used for the culture free dimensions

Ipsatization

Fischer & Milfont, 2010

REVISING THE TEST

• After the try out sample many items identification of unproductive items so that they can be revised, eliminated, or replaced

• Next step: collect new data (2nd tryout sample)

• Examinees similar to test sample

• The main purpose of collecting additional data is to repeat the item analysis procedure anew

• If major changes needed it might be desirable to collect data from 3rd or even 4th tryout sample

REVISING THE TEST

1. Cross validation:

– Fresh and independent confirmation of test validity

– The practice of using the original regression equation in a new sample to determine whether the test predicts the criterion as well as it did in original sample

2. Validity shrinkage :

– In the cross validation research the test predicts the relevant criterion less accurately with new sample than original sample

– Inevitable

– Major problem when derivation and cross validation samples are small, large number of items

3. Feed back from examiner

PUBLISHING THE TEST

Production of testing material

Technical manual and user’s manual

• Describe rationale, cautions against anticipated misuses

• Cite representative studies

• Identify special qualification needed to administer and interpret test

• Provide revisions, ammendations, supplements

• Appropriate interpretive aids

• Essential data on validity and reliability

CULTURAL ISSUES

• Reliability and validity of measures within the dominant culture

• Measurement of various psychological constructs is also likely to be influenced by cultural characteristics

• When utilizing standardized measures, assumption is the client is similar to the standardized population - violated when assessing a client from another culture

• Test translated in another language tends to decrease the reliability and validity of the test

• One important way of tackling the cultural issues is cross cultural development of the scale

Suzanne M. Skevington, 2002

• INTRODUCTION



AN EXAMPLE:WHOQOL-100

• Develop a reliable, valid, and responsive assessment of quality of life that is applicable across cultures

• The aim was to simultaneously develop the assessment in several different cultures and languages rather than translating

• CONCEPT CLARIFICATION:

– First phase of work : international collaborative review to establish a definition of quality of life and an approach to international quality of life assessment

– Protocol development

Group, T. W. (1998)


Qualitative pilot : second phase of work

WHOQOL Facets:

• Focus groups discussion conducted in each of the 15 centers

• Facets were listed and the definition created with consensus from health professionals, general population, population with disease or impairment

• Maximum of twelve questions written in each center for each facetin the local language and translated in English

• Global question pool = 1800 questions

• Evaluated to see what extent they met the criteria for WHOQOL questions=1000 question

Group, T. W. (1998)


Generation of the response scales

• Five point Likert scales

• Two anchor points selected (like very satisfied to very dissatisfied or not at all to extremely never to always)

First anchor point

second anchor point

Best descriptors for 25%, 50%, 75% in each language

Group, T. W. (1998)


Piloting and psychometric evaluation

• 15 field centers in different countries : 236 questions, 6 domains, 29 facets

• Separate 41 questions to indicate how important the each facet was to the quality of life

• The centers could add up to 2 additional national/ regional questions per facet

• Scale reliability using SPSS and MAP (multi trait analysis program)

• Items dropped

– failed to discriminate between sick and well population

– Showing <10% of the responses on aggregate

– Criterion for the global data, but failed to meet it for more than 50% of the centers

Group, T. W. (1998)


• Reliability analysis second wave

• Validity analysis

– Groups discriminant validity in the form of a comparison between healthy and unhealthy individuals

– Items not significantly distinguishing healthy from unhealthy individuals were highlighted for possible elimination\

• Facet and domain inter-correlation

• Decision was taken to select four items per facet,

• These decisions therefore led to the selection of 25 x 4 =100 items

• Revised field trial known as whoqol-100

Group, T. W. (1998)


• Factor analysis yielded 4 factors with Eigen values greater than 1

• Confirmatory Factor analysis:

– The 4 factor model fitted to the conceptual 6 factor model with the value of little less than 0.9

– Six domain model CFI=0.88

– Six conceptual domains expected to load onto one single factor (a hypothetical quality of life construct) with comparative fit index of 0.975

Group, T. W. (1998)


Group, T. W. (1998)

CONCLUSION

• Different scales are used in psychiatry broadly the personality and intelligence

• Scales can be divided into many types as per their use, function or form of items

• The development of scale is a daunting task that needs the expertise in the respective field and statistics

• Cross cultural development of scale is very useful in overcoming the cultural confounders

• To develop a new test over existing test there are hindrances like copy right laws, economic issues as developing is a costly venture

Thank you

development and evaluation of scales/instruments in psychiatry

Health & Medicine

unit of test scale

scale of attitude

conners rating scale

development of scales

adhd scales

glasgow coma scale

years thousands of scale

binet test