content validity in evaluation and policy-relevant research

5
Evaluation and Program Planning, Vol. 8, pp. 87-91, 1985 Printed in the USA. All rights reserved. 0149-7189185 $3.00 + .OO Copyright e 1985 Pergamon Press Ltd CONTENT VALIDITY IN EVALUATION AND POLICY-RELEVANT RESEARCH MELVIN M. MARK, ROBERT C. SINCLAIR, and GARY L. BENSON The Pennsylvania State University ABSTRACT It is argued that, in assessing attitudes about policy issues, the policy area can be conceptualized as a “‘domain of content, ” and standards of content validity applied. The role of content validity in policy-relevant research is illustrated in a study that contrasts the findings of two surveys concerning public opinion toward gun control. Attitudes were assessed with items from the two surveys, which focused on different aspects of the policy domain of gun control. It was found that inadequate content validity threatened inferences about the overall level of support for gun control, but not inferences about opinion differences between the sexes or be- tween respondents of varying political affiliation. Discussion focuses on the general condi- tions under which inadequate content validity threatens inferences from research. Finally, it is argued that attention to content validity should enhance construct validity. Program evaluation and other policy-relevant research depends heavily on self-report data. In evaluation, questionnaires are frequently employed, for example, to assess community perceptions of need for a service or to measure participants’ attitudes toward a pro- gram. Surveys are often used as evidence about public opinion in debates about policy formulation. A sizeable literature exists on the possible shortcom- ings of using self-report data by policy-making pro- cesses, as well as for other possible uses. In particular, the fallibility of single items is widely recognized (e.g., Cook & Campbell, 1979; Duncan & Schuman, 1974), leading to the widespread use of multiple-item scales. The underlying notion is that while each item is an im- perfect representation of the construct of interest, by combining a number of imperfect items that are im- perfect in different ways, one “triangulates” on the construct of interest.’ The purpose of this paper is to call attention to a problem that multiple-item scales do not necessarily preclude and, because of the incomplete way such scales are frequently reported, may even disguise, the absence of content validity. Content validity refers to ‘Of course, the use of multiple-item scales can offer other advan- tages, such as increased reliability. the adequacy with which some specific domain of con- tent is sampled (American Psychological Association, American Educational Research Association, & Na- tional Council on Measurement in Education, 1974; Anastasi, 1982; Cronbach, 1971). Most extent examples involve testing, such as how well a mathematics test covers the various math skills taught in a course.2 In the present paper, we argue that (a) a policy area can be conceptualized as a “domain of content,” (b) content validity is weak when that domain is inade- quately represented in an attitude questionnaire or survey, (c) such poor content validity can undermine the validity of one’s inferences, but (d) the extent to which inferences are jeopardized will depend on the nature of the inference, the characteristics of the do- main of content, and the nature of the content validity problem. We consider two types of inferences: first, ‘We should note that some writers prefer that such terms as confent representativeness, content relevance, and domam clarity be used rather than the term content validity (Fitzpatrick, 1983; Messick, 1975). We will adopt the more common usage of the term confent validity here. The preference for other terms stems largely from the belief that attention to content validity should not preclude (or be used as an excuse to avoid) other, empirical validation studies. While we agree with this position, we also believe in the value and significance of achieving content validity. The authors wish to thank Mary Beth Crowe and Hoben Thomas for their comments on a draft of this manuscript. Reprint requests should be sent to Melvin Mark, Department of Psychology, The Pennsylvania State University, 542 Moore Bldg., University Park, PA 16802. 87

Upload: melvin-m-mark

Post on 19-Nov-2016

219 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Content validity in evaluation and policy-relevant research

Evaluation and Program Planning, Vol. 8, pp. 87-91, 1985

Printed in the USA. All rights reserved.

0149-7189185 $3.00 + .OO Copyright e 1985 Pergamon Press Ltd

CONTENT VALIDITY IN EVALUATION AND POLICY-RELEVANT RESEARCH

MELVIN M. MARK, ROBERT C. SINCLAIR, and GARY L. BENSON

The Pennsylvania State University

ABSTRACT

It is argued that, in assessing attitudes about policy issues, the policy area can be conceptualized as a “‘domain of content, ” and standards of content validity applied. The role of content validity in policy-relevant research is illustrated in a study that contrasts the findings of two surveys concerning public opinion toward gun control. Attitudes were assessed with items from the two surveys, which focused on different aspects of the policy domain of gun control. It was found that inadequate content validity threatened inferences about the overall level of support for gun control, but not inferences about opinion differences between the sexes or be- tween respondents of varying political affiliation. Discussion focuses on the general condi- tions under which inadequate content validity threatens inferences from research. Finally, it is argued that attention to content validity should enhance construct validity.

Program evaluation and other policy-relevant research depends heavily on self-report data. In evaluation, questionnaires are frequently employed, for example, to assess community perceptions of need for a service or to measure participants’ attitudes toward a pro- gram. Surveys are often used as evidence about public opinion in debates about policy formulation.

A sizeable literature exists on the possible shortcom- ings of using self-report data by policy-making pro- cesses, as well as for other possible uses. In particular, the fallibility of single items is widely recognized (e.g., Cook & Campbell, 1979; Duncan & Schuman, 1974), leading to the widespread use of multiple-item scales. The underlying notion is that while each item is an im- perfect representation of the construct of interest, by combining a number of imperfect items that are im- perfect in different ways, one “triangulates” on the construct of interest.’

The purpose of this paper is to call attention to a problem that multiple-item scales do not necessarily preclude and, because of the incomplete way such scales are frequently reported, may even disguise, the absence of content validity. Content validity refers to

‘Of course, the use of multiple-item scales can offer other advan- tages, such as increased reliability.

the adequacy with which some specific domain of con- tent is sampled (American Psychological Association, American Educational Research Association, & Na- tional Council on Measurement in Education, 1974; Anastasi, 1982; Cronbach, 1971). Most extent examples involve testing, such as how well a mathematics test covers the various math skills taught in a course.2

In the present paper, we argue that (a) a policy area can be conceptualized as a “domain of content,” (b) content validity is weak when that domain is inade- quately represented in an attitude questionnaire or survey, (c) such poor content validity can undermine the validity of one’s inferences, but (d) the extent to which inferences are jeopardized will depend on the nature of the inference, the characteristics of the do- main of content, and the nature of the content validity problem. We consider two types of inferences: first,

‘We should note that some writers prefer that such terms as confent

representativeness, content relevance, and domam clarity be used rather than the term content validity (Fitzpatrick, 1983; Messick, 1975). We will adopt the more common usage of the term confent

validity here. The preference for other terms stems largely from the belief that attention to content validity should not preclude (or be used as an excuse to avoid) other, empirical validation studies. While we agree with this position, we also believe in the value and significance of achieving content validity.

The authors wish to thank Mary Beth Crowe and Hoben Thomas for their comments on a draft of this manuscript. Reprint requests should be sent to Melvin Mark, Department of Psychology, The Pennsylvania State University, 542 Moore Bldg., University

Park, PA 16802.

87

Page 2: Content validity in evaluation and policy-relevant research

88 MELVIN MARK, ROBERT SINCLAIR, and GARY BENSON

inferences about the characteristics of a single popula- pants). We describe how the validity of these two sorts tion (e.g., the level of support for Candidate A among of inferences are threatened by inadequate content the population of “voters”); and second, inferences validity, as a function of the domain of content and about the differences between two or more groups the nature of the content validity problem. We il- (e.g., whether males and females differ in their sup- lustrate these points with an empirical example involv- port for Candidate A, or whether the attitudes of pro- ing public opinion on gun control. gram participants differ from those of non-partici-

BACKGROUND OF THE STUDY

The issue of gun control and the private ownership of guns is one of the more hotly debated social issues of recent years. Partisans on either side argue for and against the effectiveness of gun control legislation, often arming themselves with social science reports as evidence. An example of such use of social science data occurred when, at virtually the same time in 1978, two national probability sample surveys on gun con- trol were conducted, purportedly to inform policy debate. One survey was conducted for the National Ri- fle Association (NRA), and the other for the Center for the Study and Prevention of Handgun Violence (CSPHV). The two surveys resulted in substantially different conclusions about public opinion toward gun control. In large part, these discrepant results arose because the two surveys differentially mapped out the domain of gun control policy (see also Rossi & Wright, 1985, for discussion of the role of ideology in the dif- ferent conclusions drawn from the two surveys).

Rossi and W.right (1985; Wright, Rossi, & Daly, 1983) have contrasted the NRA and CSPHV surveys, and state that “although the two surveys are ostensibly on the same issue, there are very few questions that deal with the same specific topic” (p. 313) [emphasis added]. For example, the CSPHV-sponsored survey focuses on “handguns” and “handgun violence” while the NRA-sponsored survey primarily concerns “fire- arms” and the likelihood that gun legislation will result in “crime control.” We describe these differences in how the domain of gun control policy is mapped out as a problem of content validity, which arises from the failure of both sets of survey researchers to (a) specify satisfactorily the entire domain of gun control policy and (b) sample adequately from the entire domain.3

A study was carried out contrasting the NRA and CSPHV surveys. Its purpose is to illustrate the effects of content validity problems on policy-relevant in- ferences.

METHOD

Respondents The respondents were 23 men and 23 women surveyed on the main campus of the Pennsylvania State Univer- sity. Pedestrians were recruited for a short survey, and received no remuneration for participation.

The Questionnaire We randomly selected 15 questions from the National Rifle Association survey and 15 from the Center for the Study and Prevention of Handgun Violence sur- vey. This sampling was done to create a representative set of questions from both surveys while keeping our questionnaire reasonably short. The 30 items selected were randomly assigned to position in a paper-and- pencil questionnaire. Each item was followed by a 5-point Likert-type scale, bounded by verbal anchors at either end (e.g., “strongly agree” and “strongly disagree”).

The following is an example of an item selected from the CSPHV survey: “We should require the regis- tration of all handguns at the time of purchase or transfer.” An illustrative item selected from the NRA survey is: “Requiring all handgun owners to be licensed would cut down the number of violent crimes.” The

Likert-type scales for both of these items were an- chored “strongly disagree” and “strongly agree.““

After completing the 30 Likert-type items on gun control, respondents were asked to indicate their sex and political affiliation (Democrat, Republican, or other).

Procedure Respondents were solicited from three campus loca- tions with moderate pedestrian traffic. Respondents were chosen according to a prespecified sampling plan: Men and women were selected in random order, and according to a random schedule the researcher ap- proached one of the next five pedestrians of the appro- priate sex. If this person chose not to participate, the next pedestrian of the same gender was approached.

Respondents were asked to complete a short survey

3The problem can be described using other terminology. For in- stance, Rossi and Wright (1985) label this as a form of specifica- tion error in descriptive research.

‘Copies of the entire 30-item questionnaire are available upon re- quest from the authors.

Page 3: Content validity in evaluation and policy-relevant research

Content Validity 89

for a class project. They were given a copy of the ques- completing the survey, respondents were debriefed as tionnaire, and asked to indicate which response best to the nature of the study. represented their attitude toward each item. Upon

RESULTS

Three summative scales were created by combining (a) the 15 items from the NRA survey (the “NRA scale”); (b) the 15 items from the CSPHV survey (the “CSPHV scale”); and (c) all 30 items from both sur- veys (the “combined scale”). Each of these three sum- mative scales was divided by the number of items in it, so that all scales were on a comparable l-5 scale, with lower numbers representing more support for gun control.

One question of interest involves the effects of in- adequate content validity on estimates of population parameters for a single group. To examine this, we contrasted scores on the NRA and CSPHV scales. A paired t test indicates that the two scales differ significantly, t(22) = 5.07, p < .Ol. As expected, the Center for the Study and Prevention of Handgun Violence scale (M = 2.85) indicates more support for gun control than does the National Rifle Association scale (M = 3.15). Thus, focusing on different aspects of the domain of gun control policy would lead to dif- ferent conclusions about public opinion toward gun control. In fact, our findings probably underestimate the effect of inadequate content validity in that the two sets of surveys do contain some overlap, despite their general differences in focus (Rossi & Wright, 1985). Rossi and Wright (1985) contend that together the two surveys do a fairly complete job of representing the domain of gun control policy. It is therefore interesting

to note that the mean for the combined scale is 2.998 on a 5-point scale.

A second question of interest concerns the effects of inadequate content validity on the detection of be- tween-group differences. To examine this, we first contrasted the responses of men and women on both the NRA and CSPHV scales. No difference was ob- served between men and women on either scale (with t[44] I 1 .OO for both comparisons). We also examined the relationship between political affiliation and at- titude as measured by the two scales, contrasting Democrats (N = 17), Republicans (N = 16), and others (N = 13). An analysis of variance revealed no difference among these three groups on either scale (with F [2, 431 < 1.50 for both scales). The possibility exists that because of small sample size we lack the necessary power to detect existing between-group dif- ferences. Whether or not this is the case, the two scales seem to tap gender and party affiliation differences equally well, e.g., the difference between Republicans and Democrats on the CSPHV scale is 0.16 and on the NRA scale the Republican-Democrat difference is 0.15.

That the NRA and CSPHV scales give comparable answers about between-group differences is Derhaps not surprising given the relationship between the two. The two scales correlated 0.78, p < .Ol.

DISCUSSION

Our results indicate that the NRA and CSPHV surveys lead to certain different inferences, apparently because they focus on different parts of the policy domain of gun control. However, the results also indicate that poor content validity does not necessarily invalidate all inferences. The effect of content validity problems, we suggest, depends on the nature of the inference, the characteristics of the domain of content, and the nature of the content validity problem.

First, consider inferences about the characteristics of a single, given population, such as inferences about average public opinion toward a policy issue. Such in- ferences will often be biased by content validity prob- lems. For instance, it is not surprising that in the pres- ent study and in previous national probability sample surveys the NRA and CSPHV surveys disagree about the level of public support for gun control; the two surveys focus on different aspects (or subdomains) of the overall domain of gun control. However, such bias is not inevitable. If the population parameter of in-

terest does not vary across subdomains - for example, if Americans have the same opinion about controlling handguns as they have about controlling all guns - then weak content validity will not lead to bias. Of course, such stability across subdomains may not be common in major social policy areas.

Next, consider inferences about between-group dif- ferences, such as male-female differences in opinion or attitude differences between participants and nonpar- ticipants in some program. In the present study, the NRA and CSPHV survey items led to the same conclu- sion about the relative opinions of men and women and of respondents with different political affiliations (i.e., no difference). Thus, in the present study inade- quate content validity did not appear to bias between- group comparisons. However, in some cases such bias should occur. Specifically, inferences about between- group differences are threatened by poor content validity to the extent that (a) the difference between groups varies across subdomains, and (b) relevant sub-

Page 4: Content validity in evaluation and policy-relevant research

90 MELVIN MARK, ROBERT SINCLAIR, and GARY BENSON

domains are not equally represented in a measure. For but not another (e.g., male-female differences). example, if males and females agreed about controls Therefore, the magnitude of bias depends on the on handguns but disagreed about controls on other nature of the specific inference and its relationship firearms, then a measure that did not represent both to the subdomains, and on the extent to which a subdomains would lead to an inaccurate inference measure represents the various subdomains across about gender differences in opinions about the policy which the inference may vary (e.g., the degree to which area “gun control.” opinions about both handgun control and control of

In short, the extent to which inadequate content other firearms are assessed, if public opinion varies validity biases inferences depends largely on the extent across these two subdomains). Thus it is impossible to to which the policy area is a homogenous domain, as specify in advance the consequences of inadequate opposed to a multifaceted collection of distinct subdo- content validity without considerable substantive mains. Of course, the existence of subdomains may af- understanding. feet one sort of inference (e.g., average public opinion)

CONTENT VALIDITY IN POLICY AND EVALUATION RESEARCH: PAST PERSPECTIVES AND FUTURE ACTIONS

The sorts of inferential problems we are discussing already receive attention in evaluation and policy research, typically in the context of construct validity (Cook & Campbell, 1979). However, the techniques of construct validation (i.e., providing evidence of con- vergent and, less often, discriminant validity) may lead to the recognition of content validity problems only slowly, if at all. Progress is especially likely to be slow if the content validity problems of a new measure do not obscure the measure’s relationship with the variables used for initial construct validation- and this can occur even though the measure has content validity problems that invalidate other substantive or policy-relevant inferences. Further, the measures used in much policy research are not subjected to the exten- sive empirical study required for construct validity to accrue. Thus we believe that existing concern for con- struct validity may not suffice to prevent the inferen- tial problems associated with inadequate content validity.

What is to be done then? A simple answer is to em- phasize concern for content validity in evaluation and policy research. Measures of attitudes toward a policy issue can generally be conceptualized in terms of con- tent validity, as we have illustrated for the gun control

issue, as can many of the topics assessed in evaluation research. In such cases, research reports should in- clude explicit specification of how the measurement domain was “mapped out,” including subdomains. (See, e.g., APA et al., 1974; Anastasi, 1982; Cron- bath, 1971; and Fitzpatrick, 1983, concerning the pro- cedures of content validation.) Scholarly evaluation of research can then include explicit discussion of the adequacy with which the domain is specified and represented in the attitude or opinion measure. Such practices seem likely to increase research attention to differences across subdomains. That is, concern for con- tent validity should make it more likely that research- ers ask such questions as: What is the relationship between support for handgun control and support for controls on other firearms? Are male-female dif- ferences similar across these two aspects of gun con- trol? Indeed, attention to content validity may even lead researchers to ask more systematically what the appropriate boundaries are of domains (e.g., should “handgun control” and “control of other firearms” be conceptualized as subdomains of the larger domain of “gun control,” or as distinct domains). In these ways, increased attention to content validity may enhance construct validity.

REFERENCES

AMERICAN PSYCHOLOGICAL ASSOCIATION, AMERICAN EDUCATIONAL RESEARCH ASSOCIATION, & NATIONAL COUNCIL ON MEASUREMENT IN EDUCATION. (1974). Sron- durds for educurional and psychological tests. Washington, DC: American Psychological Association.

(Ed.), Educurronal meusuremenf (2nd ed.). New York: Holt, Rinehart & Winston.

DUNCAN, 0.. & SCHUMAN, H. (1974). Questions about attitude survey questions. In H. L. Kostner (Ed.), Sociologicalmethodology, 1973-74. San Francisco: Jossey-Bass.

ANASTASI, A. (1982). Psychologlcul resting (5th ed.). New York: Macmillan. FITZPATRICK, A. R. (1983). The meaning of content validity. Ap

plied Psychologrcul Measurement, 7, 3- 13. COOK, T. D., & CAMPBELL, D. T. (1979). Quusi- expenmentatron: Design and analysis issues for field settings. Chicago: Rand McNally.

CRONBACH, L. J. (1971). Test validation. In R. L. Thorndike

MESSICK, S. (1975). The standard problem: Meaning and values in measurement and education. American Psychologist, 30, 955-966.

ROSSI, P. H., & WRIGHT, J. D. (1985). Social science research

Page 5: Content validity in evaluation and policy-relevant research

Content Validity 91

and the politics of gun control. In R. L. Shotland & M. M. Mark (Eds.), Social sconce and social policy. Beverly Hills, CA: Sage Publications.

WRIGHT, J. D., ROSSI, P. H., & DALY, K. (1983). Under the gun: Weapons, crime and violence in America. Hawthorne, NY: Aldine.