attenuating response-shift bias - conservancy - university of

481

The Feasibility of Informed Pretests inAttenuating Response-Shift BiasGeorge S. Howard, Patrick R. Dailey, and Nancy A. GulanickUniversity of Houston

Response-shift bias has been shown to contami-nate self-reported pretest/posttest evaluations ofvarious interventions. To eliminate the detrimentaleffects of response shifts, retrospective measureshave been employed as substitutes for the tradi-tional self-reported pretest. Informed pretests,wherein subjects are provided information aboutthe construct being measured prior to completingthe pretest self-report, are considered in the presentstudies as an alternative method to retrospectivepretests in reducing response-shift effects. In Study1 subjects were given a 20-minute presentation onassertiveness, which failed to significantly improvethe accuracy of self-reported assertiveness. Otherprocedural influences hypothesized to improve self-report accuracy—previous experience with the ob-jective measure of assertiveness and previous com-pletion of the self-report measure—also were notrelated to increased self-report accuracy. In asecond study, information about interviewing skillswas provided at pretest using behaviorally anchoredrating scales to participants in a workshop on in-terviewing skills. Response-shift bias was not at-tentuated by providing subjects with informationabout interviewing prior to the intervention. Changemeasures which employed retrospective pretestmeasures demonstrated somewhat higher (althoughnonsignificant) validity coefficients than measuresof change utilizing informed pretest data.

Although the measurement of change is im-

portant in virtually all areas of psychological re-

search, it is an endeavor fraught with problems(Cronbach & Furby, 1970; Linn & Slinde, 1977).In the evaluation of training and treatment in-terventions, change is frequently measured bymeans of subject self-reports in pretest-posttestdesigns, such that the degree of change frompretest to posttest for treatment subjects, rela-tive to their control group counterparts, is as-sumed to reflect the value of an intervention.With random assignment of subjects, this design(Design 4; Campbell & Stanley, 1963) was

thought to provide internally valid results. How-ever, Howard, Ralph, Gulanick, Maxwell,Nance, and Gerber (1979) have recently reportedthe problem of response-shift bias, which is a

threat to internal validity in evaluation studiesemploying self-report measures. The problem ofresponse-shift bias is handled by substitutingretrospective pretest (Then) ratings for tradi-tional pretest (Pre) ratings in the analysis ofchange (see Howard et al., 1979, for an explana-tion of response-shift effects and of how retro-spective ratings are obtained).Four potential causes for the Then-Pre self-

report differences which have been attributed to

response-shift bias are (1) memory distortion,e.g., forgetting; (2) subject’s response-style ef-fects, e.g., subject acquiescence, social desirabil-ity ; (3) insufficient insight or awareness of one’sown level of functioning with respect to a par-ticular construct; and (4) insufficient under-

APPLIED PSYCHOLOGICAL MEASUREMENTVol. 3, No. 4 Fall 1979 pp. 481-494© Copyright 1979 West Publishing Co.

Downloaded from the Digital Conservancy at the University of Minnesota, http://purl.umn.edu/93227. May be reproduced with no cost by students and faculty for academic use. Non-academic reproduction

requires payment of royalties through the Copyright Clearance Center, http://www.copyright.com/

482

standing of the construct, e.g., assertiveness, in-terviewing skills, dogmatism.Howard et al. (1979) in Study 5 investigated

the potential for memory distortion effects andconcluded that at posttest (Post), subjects’ mem-ory of their pretest ratings was not systematically.different from their Pre ratings but that Thenratings were reliably different from both Pre andMemory ratings. Howard and Dailey (1979) inStudy 2 arrived at similar conclusions regardingthe influences of memory distortions. Regardingsubject response style, Howard, Millham,Slaton, and O’Donnell (under review) con-

sidered the possibility that response-shift effectsmay simply be subject acquiescence artifacts.That is, subjects may not believe they havechanged a great deal, but their desire to providethe experimenter with a favorable set of resultsleads them to lower their Then ratings. However,no support for this explanation was found.The other two possible causal determinants

suggest that it is because the intervention en-ables subjects to attain a greater understandingof either or both the constructs being measuredand/or their level of functioning on those dimen-sions that their Then self-report ratings differfrom their Pre self-report ratings. Gaining agreater awareness of one’s level of functioningmight be truly treatment dependent and thusmay require a retrospective approach to

counteract response-shift bias. However, in thecase of insufficient understanding of the con-struct, investigators might provide subjects withenough information about the construct prior tothe pretest to allow them to make a more ac-curate assessment of their pretest level of func-tioning. This method of rapprochement wouldbe welcomed by many investigators, since utiliz-ing &dquo;informed&dquo; pretests to lessen response-shiftbias would eliminate the need for retrospectiveratings. The two investigations reported belowtake differing approaches to providing subjectswith information about the construct designedto improve the accuracy of their self-report pre-test ratings, in an attempt to attenuate responseshift bias effects.

STUDY 1

This study sought to determine if a short infor-mation-giving session can improve the accuracyof self-report assessments. The construct util-ized in this study was assertiveness. The effectsof two procedural influences that may affect

self-report pretest accuracy (previous experiencewith the self-report instrument or the behavioralmeasure) were also investigated.

Method

Subjects

Eighty-eight undergraduates participated inthe present study for credit toward the require-ments of their introductory psychology course.Complete data sets were obtained for 83students who were included in the analyses.

Instruments

The College Self-Expression Scale

(CSES). The CSES (Galassi, Delo, Galassi, &

Bastien, 1974) is a 50-item self-report measureof assertiveness on which respondents describethemselves using a 5-point scale. Scores can

range from 0 to 200, with higher scores reflect-ing a more assertive response pattern. Extensivedata on reliability and validity of the scale arereported by Galassi et al. (1974) and Galassi,Hollandsworth, Radecki, Gay, Howe, and Evans(1975).Objective Measure of Assertiveness (OMA).

The objective measure of assertiveness consistedof each student’s verbal responses to 8 tapedstimulus situations. The stimulus situations

were the same as or similar to those used byEisler, Miller, and Hersen (1973). Each studentwas instructed to listen to each stimulus state-ment and were to respond verbally using the ac-tual words he/she would use if the situation were

really happening. All responses to the stimulusstatements were audiotaped, coded, and laterrated for assertiveness by two trained raters

using the Rathus Assertiveness Scale (Rathus,



483

1973). The mean assertiveness rating for eachstudent was employed in the present study.

Raters

Two senior psychology majors with previousexperience using the Rathus Assertiveness Scale(Rathus, 1973) were used as raters of the OMAtapes. The interrater reliability for this instru-ment in the present study was .89.

Assertive Information and Placebo Sessions

Each student attended either a session which

provided information about assertiveness (As-sertive Information) or an attention placebo ses-sion wherein a number of social activities on

campus were discussed (Placebo). Both presen-tations lasted about 20 minutes. In the AssertiveInformation session, the experimenter presenteddidactic information about assertiveness and

nonassertiveness, answered students’ questions,and gave examples of assertive and nonassertivebehavior. In order to check that students hadachieved an accurate understanding of the con-cept of assertiveness, the experimenter asked

each student to give an example from his/herown life illustrating assertive and nonassertivebehavior and provided further clarification if

necessary. The Placebo session followed a

similar format of presenting the student infor-mation on campus activities, answering ques-tions, giving several examples of possible activ-ities, and asking the students to share some ac-tivities in which they might engage that semes-ter. An advanced graduate student conductedall Assertive Information and Placebo sessions.

Procedure

Students were randomly assigned to one ofeight groups. These groups differed from oneanother in the order in which the measures wereadministered and whether students attended theAssertiveness Information or the Placebo ses-sion. Table 1 shows the order in which variousactivities were completed by the students in eachcondition. All students were instructed to care-

fully and honestly complete the tasks of thestudy. For those who completed the CSES asecond time, further instructions were givenafter the Assertive Information or Placebo ses-

Table I

Number of Subjects and Order of Activitiesin the Experimental Conditions



484

sion to respond as they now felt was true,whether that meant their responses were similarto or different from their earlier responses. Total

testing time, which included Assertive Informa-tion or Placebo for each student, was approxi-mately 50 minutes. Students were then debriefedand given course credit.

Data Analysis

Four possible influences that might increasethe accuracy of self-report ratings can be iden-tified : (1) student having received Assertive In-formation ; (2) student having received an exhor-tation to try to provide accurate ratings on theCSES, which took place at the end of each As-sertive Information and Placebo session; (3)student having previously completed the objec-tive measure; and (4) student having previouslycompleted the self-report measure. Each self-re-port rating may be characterized as either pos-sessing or not possessing each of these four po-tential influences. For example, the first time astudent in Condition 1 completed the CSEShe/she would not have had the benefit of any of

these four potentially helpful influences. How-ever, the second time that student completed theCSES he/she might have been influenced by allfour effects. The crucial question, then, waswhether any of these factors helped to improvethe accuracy of CSES ratings.The correlation coefficient is a standard index

of agreement. Correlations represent the degreeto which scores on two measures are proportion-al when expressed as deviations from their

means. Accuracy implies a further conditionthat the absolute values of the scores on thesemeasures be equivalent. Since simple correlationcoefficients would not test for this latter condi-

tion, correlation was deemed an inadequateanalysis, and the authors developed the follow-ing procedure wherein a subject’s OMA ratingwas used as a basis from which to predict an ex-pected CSES score.The functional relationship between the CSES

and OMA needed to be determined; this was ac-

complished by using data from treatment sub-jects in a previous study (Study 3 of Howard etal., 1979). The relationship between Post-CSESand Post-OMA scores was found to be CSES =

13.8 OMA + 69.44. Similarly, the line of best fitfor the Then-CSES with Pre-OMA scores wasdescribed by CSES = 13.3 OMA + 52.04. Theaverage of these two regression lines was CSES13.5 OMA + 60.74, which was assumed to be therelationship between these two measures. Thesecomparisons assumed that OMA ratings for thedata utilized to compute Predicted scores were

equivalent to the OMA ratings in the presentstudy.

Since the raters and times of rating were dif-ferent, some indication of the equivalence ofscales was necessary. One author randomly se-lected 20 taped responses from each study, andthe two remaining authors rated them withoutknowing from which study they came or howthey had been scored. The mean OMA rating forStudy 3 of Howard et al. (1979) was 3.78; theauthors’ rating of these same tapes was 3.70. Inthe present study, however, the mean OMA rat-ing for selected tapes was 3.98, whereas theauthors’ mean rating of these responses was3.87. There were no statistical differences be-tween judges’ and authors’ OMA ratings in thetwo studies. By entering each student’s OMAscore, a Predicted CSES score was obtained forthat student.’ The difference (D-score) betweenthis Predicted CSES score and the student’s ob-served CSES score served as the measure ofCSES accuracy. Obviously, the smaller the valueof D, the greater the accuracy of the CSES rat-

ing. If a sample of D-scores were accurate, the

’It is important to use both pretreatment and posttreatmentratings in generating the equations to yield Predicted CSESscores (1) because subjects for Study 3 of Howard et al.

(1979) were selected for that study if they were highly femi-nine on the Bem Sex-Role Inventory and (2) due to the rela-tionship between sex-role orientation and assertiveness, thedistribution of Pre-OMA scores was skewed toward the non-

assertive end of the scale. Since no selection technique wasemployed in the present study, OMA scores covered a widerange.



485

mean of the sample would be zero (i.e., positiveand negative scores tending to cancel out). If,however, some source of systematic bias werepresent, mean D-scores would be positive if stu-dents’ CSES ratings were overestimates of theirassertiveness and negative if they were underes-timates.

Results

CSES scores were overestimates of the Pre-dicted CSES ratings. (Mean D-score = 19.53;t(81) = 9.31, p < .001). Table 2 presents a break-down of the D-Scores for CSES ratings after theAssertive Information and Placebo sessions

grouped by the presence or absence of assertiveinformation, prior OMA experience, and priorCSES experience. Statistical comparisons couldnot be made for the effect of the accuracy in-struction.CSES accuracy was not significantly improved

by assertive information or prior experience witheither the OMA or CSES. Paradoxically, onefactor (OMA prior experience) was actually re-lated to a slight, though nonsignificant, decreasein CSES accuracy. If subject response-style ef-fects were feared to contaminate Then scores in

Study 3 of Howard et al. (1979), one might

choose to generate Predicted CSES employingthe equation CSES = 13.8 OMA + 69.44, whichis obtained by using Post-CSES and Post-OMAscores only. If this formula were to be employed,the CSES scores in the present study would stillrepresent overestimations of Predicted CSES

scores, although they would be viewed as some-what more accurate.

Discussion

The CSES scores in this study representedoverestimates of the predicted CSES values.This is in agreement with the findings of Studies3 and 4 of Howard et al. (1979), which foundthat self-reported Pre assertiveness scores

were inflated relative to response-shift-free Thenscores. These inaccuracies were present in spiteof a 20-minute information session on assertive-ness. Although the information session mighthave taught the students about the concept ofassertiveness, it may have failed to sensitize

them enough to their own level of assertivenessto enable them to provide an accurate assess-ment of their functioning. Similarly, asking thestudents for one example in the information ses-sion may not have had enough impact for themto relate personally to the construct. In addition,

Table 2

Mean D-Scores for CSES Ratinas Groupedby the Three Potential Accuracy Influencinq Factors



486

the information session may not have allowedstudents to generalize the concept of assertive-ness from the specific examples given duringthat session to a wide range of situations in-

cluded in the self-report and the objective mea-sures of assertiveness. Gormally, Hill, Otis, andRainey (1974) noted that generalization is diffi-cult to achieve without sufficient practice acrossvarious situations to enable subjects to deducepractical guidelines in assertiveness. Finally,there may have been variance in the presenta-tion on assertiveness provided individually to thestudents.

Ironically, prior experience with the OMA wasrelated to a slight decrease in CSES accuracy.When completing the OMA, students receivedno feedback on the appropriateness of their re-sponses. In the absence of such feedback,students may have inappropriately evaluatedtheir nonassertive responses as &dquo;correct,&dquo; thuserroneously overestimating their assertiveness.

In order to further investigate the potential ofan &dquo;informed pretest&dquo; in eliminating response-shift bias, a second study was undertakenwherein the &dquo;information&dquo; was given in a stan-dardized behaviorally anchored form, and anactual analog interview pretest rather than ataped stimulus was employed. The relative ef-ficacy of an informed pretest versus the conven-tional pretest in evaluating the effectiveness of atraining intervention was also observed.

STUDY 2

This study investigated the influence of infor-mation about interviewing skills in attenuatingresponse shift in the self-reported evaluation of aworkshop designed to improve interviewer skillsin selection interviews in industrial contexts.

The program consisted of a week-long seminarthat heavily emphasized interviewing skills prac-tice plus didactic training.In the industrial appraisal literature two gen-

eral methods for collecting performance datamight be classified as the summative and be-havior-specific approaches (Schwab, Heneman,

& DeCotis, 1975). The summative approach de-fines performance using poorly delineated levelsof functioning (e.g., below average, excellent,poor). While variants of this approach may bedesigned to collect performance using a moremultidimensional approach rather than a singleglobal composite assessment, basically theseexist at a rather ambiguous level by virtue of thedifferent interpretations of behavior used by theraters.

The behavior-specific approaches includemore attention to actual examples of behavior,are more multidimensional in their approach,and attempt to reduce the ambiguity or variancein interpretations that are possible across ratersthrough empirical categorization and scaling.Thus, users of this type of performance ap-praisal method are provided information fromwhich they may infer appropriate performancecategory definitions and more reliably assess

functioning levels.In previous studies, Howard and Dailey (1979)

utilized a seven-item self-report measure to as-sess participants’ skill levels on six target inter-viewer dimensions (questioning techniques,structured approach, supportive attitude, rap-port building, active listening, and relevant

material) plus a rating of overall effectiveness.Subjects had responded to each item by utilizinga scale from 1 (to a very little extent) to 9 (to avery great extent) to indicate the extent to whichthey felt they possessed the six types of inter-viewer skills and were capable of conducting aneffective interview. However, the subjects had re-ceived no explanation of the dimensions, norwere they provided with examples of what mightbe appropriate or inappropriate behaviors forany dimension. Using this summative style self-report instrument, Howard and Dailey (1979)found substantial response-shift contaminationthat significantly lowered the concurrent validityof the self-reported indices of change. It was

speculated that if behaviorally anchored scaleswere to be employed, subjects would be providedwith appropriate information about the six in-terviewer performance dimensions at pretest,



487

thus allowing them to operationally define eachconstruct. This procedure would then be ex-pected to improve the accuracy of their pretreat-ment ratings and thereby attentuate response-shift effects.

Method

Subjects

Twenty participants in a week-long interview-ing institute served as subjects in this study.They possessed from 1 month to 18 years priorinterviewing experience (median = 6 months)and were heterogeneous with respect to parentcompany and geographic location. All were en-gaged in jobs which involved conducting job in-terviews with potential job candidates or subor-dinates.

Program

The Institute was conducted over 5 days (36hours of training) and involved training in theinterviewer skills mentioned above. The pro-gram was focused around three videotaped prac-tice interviews with undergraduates who werecurrently seeking employment and three smallgroup play-back sessions wherein three inter-viewers reviewed the videotapes of their inter-views with a member of the Institute staff whose

duty it was to critique each interviewer’s per-formance.

Instrument Development

Eleven college seniors in an undergraduate in-terviewing course took part in the developmentof the behaviorally specific performance ap-

praisal scales (BSS) for use in the present study.The procedure followed was that of Smith andKendall (1963). The process of scale develop-ment included (1) identification of criterion di-mensions ; (2) collection of critical interviewingbehaviors representing these dimensions; (3) re-translation ; (4) scaling; and (5) final editing andscale construction. Specifically, the students to-

gether with one of the authors first identifiedand agreed upon the six dimensions of inter-viewer behavior mentioned above. Critical inci-dents (Flanagan, 1954) were collected by the stu-dents from personal experience and from view-ing 40 archival Institute tapes of selection inter-viewing. The focus was on the behaviors of theinterviewers in these six criterion areas. It wasthe task of the students to describe the behaviorsobserved on the tapes that were particularlynoteworthy (either positive or negative). In-cluded in the description of an incident was adescription of the circumstances, the behavior ofthe interviewer, its consequences to the inter-

view, and an indication of the criterion area inwhich the behavior fit. This phase resulted inmore than 300 critical interviewing behaviors(172 usable) across the six interviewer perform-ance dimensions.The retranslation step involved feeding back

to the students the incidents they had previouslycollected, with their task being to read the inci-dent and individually to assign it to the most ap-propriate of the six dimensions. This rep-resented a quality control step in BSS develop-ment ; a cutoff of 55% (6 of 11 students) orgreater agreement was used to insure that therewas reasonable agreement regarding the dimen-sion to which each particular incident belonged.Items were discarded when the judges were un-able to agree on appropriate category assign-ment. The judges were again presented with thelist of interviewer behaviors (N = 110) groupedaccording to the six performance dimensions.Each rater’s task this time was to independentlyprovide a scale value representing the favorabil-ity of each incident using a 9-point scale rangingfrom 1 (extremely unfavorable) to 9 (extremelyfavorable). Means and standard deviations werecalculated for each incident, and items whosestandard deviations exceeded 1.75 were dis-carded. The mean for each of the remaining in-cidents (N = 52) served to anchor the dimension;and when fully constructed, a dimension had be-havioral examples at several scale points acrossthe dimension.



488

Procedure

During the introductory session of the Insti-tute, participants were asked to examine theBSS scales and to indicate their level of inter-

viewing skill on the seven items (Pre). Before theactual training began, each participant inter-

viewed a job applicant and all interviews wererecorded on videotape. This served as the behav-ioral Pre measure (Beh Pre). Interviews wereconducted, with undergraduates serving as in-terviewees, and lasted for 30 minutes or less. TheInstitute proceeded as planned, and part of theconclusion of training was the completion of afinal videotaped interview, serving as the behav-ioral Post Measure (Beh Post). Following all

training and feedback sessions, the participantsagain completed the BSS, answering each itemonce as they felt they were at that point in time(Post) and once as they felt they had been at thebeginning of the workshop (Then). Subjects wereinstructed to feel free to agree with their Pre self-

report ratings if they felt that they were accurateor to disagree with those ratings if they now sawthem as being inaccurate. The participants wereprovided with a brief explanation of the purposeof the study, and the Institute was concluded.

Six upper level undergraduates, who them-selves were in a similar interviewing class, servedas judges to rate the videotapes. An 18-hourtraining period was conducted by one of theauthors wherein the scale dimensions were ex-

plained and discussed. Archival Institute tapeswere viewed, rated, and discussed until all raterswere comfortable that they understood the di-mensions to be assessed. The 40 taped interviewswere coded, randomized, and shown to the sixraters, who independently assessed each inter-viewer’s skill along the seven criteria (six specificplus one overall) using a scale from 1 (to a verylittle extent) to 9 (to a very great extent). Thesewere identified as judges’ skill ratings.

Additionally, based upon recommendationsby Fear (1973) and Banaka (1971), the authorsdeveloped a set of behavioral composite vari-ables with favorable and unfavorable interviewerbehaviors pertaining to each interviewing di-

mension. Judges were instructed to tabulate in-cidents of these behaviors while viewing thetapes, and these entries were used to form alinear composite for each of the six interviewingdimensions (behavioral incidents). For example,the supportiveness composite was formed bytotaling the number of appropriate interrup-tions plus the number of agreement statementsminus the number of inappropriate interrup-tions.

Reliabilities for the seven judges’ skill ratingsranged from .89 to .96, and reliabilities for com-posites of the six behavioral incidents rangedfrom .74 to .95 (see Table 4). The mean ratingsof the judges were employed as the unit of anal-ysis for both the skill ratings and behavioral in-cidences.

Results

Due to the number of dependent variables(seven BSS scales, seven judges’ skill ratings,and six behavioral incidents scales) and theirpattern of high intercorrelations, a multivariateanalysis of variance (MANOVA) was employedto test for treatment effects. Table 3 presentsmean Pre, Post, and Then self-report scores

along with the results of univariate F-tests ofPre/Post and Then/Post differences for this

study. Subjects reported significant before- toafter-workshop changes, whether measured bythe Pre/Post ratings (multivariate F (7, 13) =16.92, p < .0(11) or the Then/Post ratings (multi-variate F (7, 13) = 9.11, p < .001). Significantdifferences were found for all subsequent uni-variate Pre/Post comparisons and all Then/Postcomparisons. A significant treatment effect wasfound for Pre/Post comparisons of judges’ skillratings (multivariate F (7, 13) = 18.85, p < .001)and behavioral incidents (multivariate F (6, 14)= 26.96, p < .001). Table 4 presents mean Preand Post judges’ skill ratings and behavioral in-cidents in addition to the results of 13 univariateF-tests.

In an attempt to ascertain the relative effec-tiveness of the Then/Post self-report relative tothe Pre/Post self-report approach, both



489

Table 3Mean Pre, Post, and Then Ratings and

Results of Univariate F-Tests of Pre/Postand Then/Post Differences for Each Self-Report Item

* p <.05 ; ** p<.0)

Pre/Post and Then/Post self-report changescores were correlated with Pre/Post changes injudges’ skill and behavioral incidents ratings.Table 5 presents the results of these correlationsand tests (Hotelling-Williams test of e,2 = pn) ofthe equality of two Pearson correlations com-puted among three variables in a single sample.On the judges’ skill ratings, six of the seven com-parisons favored the Then/Post approach, withthese differences reaching significance (onetailed) in one instance. The mean correlation ofchanges in judges’ skill ratings with Pre/Postself-report change was .17, whereas the meancorrelation of changes in judges’ skill ratingswith Then/Post self-report change was .38. Withregard to the correlations with the behavioral in-cidents in three cases (none of which would havereached significance had the test been two

tailed), the Pre/Post self-report approach wassuperior; this trend was reversed in the otherthree cases (one significant). The mean correla-tion of changes in behavioral incidents withPre/Post self-report change was .04, and themean correlation of changes in behavioral inci-dents with Then/Post self-report change was.08.

Finally, the magnitude of response shift

(Pre-Then self-report differences) was com-

pared for a behaviorally specific scale (presentstudy) versus summative self-report measures(Studies 1 and 2 of Howard & Dailey, 1979).zPre-Then differences would be reduced if (1) in-formation supplied to subjects about the charac-teristics of selection interviewing skills was help-ful in reducing response shift and (2) the BSSadequately supplied that information. Table 6presents mean Pre-Then differences and the re-sults of univariate F-tests comparing the threeaforementioned studies. As expected, there was

2The Institute programs were essentially identical, and datacollection procedures differed only with respect to the use ofthe BSS in the present study. Since the assignment of partici-pants to programs was in no way random, concern should bewith the initial equivalence of the three samples for thesecomparisons. Since the Pre self-report was a behaviorallyspecific scale for the present study but summative for theother two, comparisons along these dimensions would besuspect. Similarly, no judges’ ratings were available fromStudy 1 of Howard and Dailey (1979), whereas different setsof raters were employed for Study 2 of Howard and Dailey(1979) and the present study. Consequently, comparisons ofPre scores of judges’ ratings would be of dubious value.Hence, the initial equivalence of the three samples will re-main untested. The authors recommend that subsequentcomparisons among studies be viewed as anecdotal only, andall conclusions will be stated with extreme caution.



490

Table 4Mean Pre and Post Scores and Results of

Univariate F-Tests for Judges Skill I Ratings andBehavioral Incidents of Videotaped Interviews

* p <.05; ** p<.01a Reliabilities were corrected for attenuation using the

Spearman-Brown procedure.

no difference in the magnitude of response shiftwhen the data from the two studies which em-

ployed the summative self-report index werecompared (multivariate F (7, 54) = .698, p = .67).Similarly, the differences between Study 1 of

Howard and Dailey and the present study (mul-tivariate F (7, 54) = 1.51, p = .18) were not sig-nificant, nor did the differences between Study 2of Howard and Dailey and the present study

reach significance (multivariate F (7, 54) = 1.26,p = .28).

DISCUSSION

All measures of change (Pre/Post self-report,Then/Post self-report, Pre/Post judges’ skill

ratings, Pre/Post behavioral incidents) foundsignificant before-to-after changes in the work-



491

Table 5Correlations of Change in Judges Skill I Ratings and Behavioral

Incidents with Pre/Post and Then/Post Self-Report Changeand Results of Tests Differences Between Correlations

~p<.05

shop participants. These findings replicate bothstudies reported by Howard and Dailey (1979).Regarding the anticipated higher validity of

Then/Post indices of change to the Pre/Postself-report measures for predicting criterion

change, partial support was found. As foundpreviously by Howard and Dailey (1979), meancorrelation of change in judges’ skill ratings withThen/Post self-reported change was superior tothe same comparison using the Pre/Post self-re-port changes (.38 versus .17). Change in behav-ioral incidents was not related to either Pre/Postor Then/Post self-reported change. This findingis somewhat surprising, given Howard and

Dailey’s relationships between Pre/Post self-re-ported change with change in behavioral in-

cidents (-.05) and Then/Post change with

change in behavioral incidents (.33). Severalreasons for the failure to replicate might be that(1) different judges were used in the presentstudy; (2) several new behavioral incidents wereadded to Howard and Dailey’s list; or (3)changes from global to behaviorally specificscales might have been responsible for attenuat-ing the relationship between self-report mea-sures and behavioral incidents. However, theauthors find none of these potential causes com-pelling.The question of whether providing informa-

tion to subjects via behaviorally specific scaleswould remove response-shift effects must be an-swered in the negative. There was a significantresponse shift noted in the present study. Whilethe magnitude of response shift in this study was



492

>-&dquo;C!:::¡-+-if)

-

0 CBt -+-C

U) -0 N-h- c m- c0 Nn L(/) &horbar; CLOQ~ ~ <D

Q) -c;-0&horbar;+-C -C?fO ~ -0+ C

U) (/1 (0B0 <DO L-

Q) C 0 CrB--Q L- Ch(0 O fl-t&horbar; ~- +- ~

4- Il. >-O- O1-

CLL..&horbar;N roNI-+-

I (6 ’(D&horbar;c:L- L (pCL l0

> ’0c&horbar;i-(1J C ((J<D::J 3z 0

IT

-

o



493

slightly smaller than was observed in the twostudies by Howard and Dailey, the differencesdid not approach statistical significance.More generally, there are many reasons

why it is difficult to measure change using gainscores (Cronbach & Furby, 1970; Linn &

Slinde, 1977). The primary problem is thatthere is greater error associated with differencescores than with single measurements. There-fore, researchers generally prefer posttest com-parisons between experimental and control

groups rather than comparisons of changescores. Unfortunately, response-shift bias effectsrender posttest-only comparisons invalid, sincetreatment and control subjects’ ratings are madewith respect to different scales (see Howard etal., 1979, pp. 20-21). In this case retrospectivemeasures are suggested to gain control for theeffects of response-shift bias. The approach ofthe authors is preferred, not because it allays theproblems identified by Cronbach and Furby, butbecause their solution to the problem is no

longer appropriate. Consequently, it is necessaryonce again to measure change as the lesser oftwo evils.

However, if treatment and control subjectsprovide ratings with respect to different scales,does this not pose a threat to construct validity(Cook & Campbell, 1976)? Regrettably, no

clear-cut resolution to this question can be givenat this time. However, progress has been madeby Golembiewski, Billingsley, and Yeager(1976), who identified three conceptually differ-ent types of change: (1) alpha (true change;changes in level or state over time taken on aconstantly calibrated instrument); (2) beta (ob-served variation, where apparent change is dueto recalibration of the instrument between as-

sessments) ; and (3) gamma (reconceptualizationby the participant of the phenomenon that is

measured). Beta change would seem to be re-lated to response-shift effects and is handled bythe use of retrospective measures. With the con-trol of beta effects, the assessment of alphachange becomes straightforward. However, thequantification of gamma effects, which would

constitute the threat to construct validity, is dif-ficult. Golembiewski et al. (1976) recommendedthe use of factor analysis with a comparison offactor structures at pretest and posttest repre-senting an estimate of gamma change. This ap-proach is unsatisfactory for several reasons. Ter-borg, Howard, and Maxwell (under review) sug-gested the use of profile analysis as a means ofassessing gamma change independently of alphaand beta change (the independence of assess-ment issue was one of the major problems of thefactor analytic approach). However, the newerapproach for gamma assessment requires sub-stantial additional computations. Terborg et al.(under review) summarized by pointing out that&dquo;the measurement of change is a complex andproblematic endeavor.... Once again, we findthat human beings are complex and cognitivebeings. Our suggestions are intended to enableus to appreciate further human change in its

complexities&dquo; (p. 24).

CONCLUSIONS

Data from the two studies reported herein sug-gest that providing subjects with information atpretest about the target construct (assertivenessor interviewing skills) will not substantially alterpretest ratings nor attenuate response-shift ef-fects. For the present, it must be concluded thatresponse-shift effects appear to be truly treat-ment dependent; hence, a retrospective ap-proach may be required to remove their dele-terious effects. The major untested explanationfor response-shift effects is that subjects changetheir awareness of their level of functioning withregard to the construct.

References

Banaka, W. H., Training in depth interviewing. NewYork: Harper & Row, 1971.

Campbell, D. T., & Stanley, J. C. Experimental andquasi-experimental designs for research on teach-ing. In N. L. Gage (Ed.), Handbook of research onteaching. Chicago: Rand McNally, 1963.



494

Cook, T. D., & Campbell, D. T. The design and con-duct of quasi-experiments and true experiments infield setting. M. B. Dunnette (Ed.), Handbook ofindustrial and organizational psychology. Chi-

cago : Rand-McNally, 1976.Cronbach, L. J., & Furby, L. How we should measure

"change"—or should we? Psychological Bulletin,1970, 74, 68-80.

Eisler, R. M., Miller, P. M., & Hersen, M. Com-ponents of assertive behavior. Journal of ClinicalPsychology. 1973, 29, 295-299.

Fear, R. A. The evaluation interview. New York:McGraw-Hill, 1973.

Flanagan, J. C. The critical incident technique. Psy-chological Bulletin, 1954, 51, 327-358.

Galassi, J., Delo, J., Glassi, M., & Bastien, S. The col-lege self-expression scale: A measure of assertive-ness. Behavior Therapy, 1974, 5, 165-171.

Galassi, J., Hollandsworth, J. G., Radecki, J. C., Gay,M., Howe, M. R., & Evans, C. Behavioral per-formance in the validation of an assertivenessscale. Behavior Therapy, 1976, 7, 447-452.

Golembiewski, R. T., Billingsley, K., & Yeager, S.

Measuring change and persistence in human af-fairs ; Types of change generated by OD designs.Journal of Applied Behavioral Science, 1976, 12,133-157.

Gormally, J., Hill, C., Otis, M., & Rainey, L. A

microtraining approach to assertion training.Journal of Counseling Psychology, 1974, 22,299-303.

Howard , G. S., & Dailey, P. R. Response-shift bias:A source of contamination of self-report mea-sures. Journal of Applied Psychology, 1979, 64,144-150.

Howard, G. S., Millham, J., Slaton, S., & O’Donnell,L. Influence of subject response-style effects onretrospective measures. Journal of Research in

Personality (under review).

Howard, G. S., Ralph, K. M., Gulanick, N. A., Max-well, S. E., Nance, D., & Gerber, S. L. Internal in-validity in pretest-posttest self-report evaluationsand a reevaluation of retrospective pretests. Ap-plied Psychological Measurement, 1979, 3, 1-23.

Linn, R. L., & Slinde, J. A. The determination of thesignificance of change between pre- and posttest-ing periods. Review of Educational Research,1977, 47, 121-150.

Rathus, S. Instigation of assertive behavior throughvideotape-mediated assertive models and directedpractice. Behavior Research and Therapy, 1973,11, 57-65.

Schwab, D. P., Heneman, H. G. III, & DeCotis, T.A. Behaviorally anchored rating scales: A reviewof the literature. Personnel Psychology, 1975, 28,549-562.

Smith, P. C., & Kendall, L. M. Retranslation of ex-pectations: An approach to the construction of un-ambiguous anchors for rating scales. Journal ofApplied Psychology, 1963, 47, 149-155.

Terborg, J. T., Howard, G. S., & Maxwell, S. E.

Evaluating planned organizational change: A pro-posed method for the assessment of alpha, beta,and gamma change at the individual and grouplevel. Academy Management Review, (under re-view).

Acknowledgments

The authors thank Robert Pritchard and Scott

Maxwell for their comments on earlier drafts of thismanuscript.

Author’s Address

Send requests for reprints or further information toGeorge S. Howard, Department of Psychology, Uni-versity of Houston, Houston, TX 77004.



attenuating response-shift bias - conservancy - university of

Documents