new experiments on the design of complex survey questions paul beatty, national center for health...

New Experiments on the Design of Complex Survey

Questions

Paul Beatty, National Center for Health Statistics

Collaborators:Jack Fowler and Carol Cosenza,

Center for Survey Research, University of Massachusetts-Boston

Optimal structure and presentation of explanatory material in survey questions Many survey questions are complex, particularly

on behavioral surveys This complexity is driven by:

The desire for very specific data points The need to collect data as efficiently as possible (i.e.

single questions if possible) A few common practices:

Presentation of material that follows the question mark The use of examples to illustrate complex concepts Detailed wording to capture relatively rare events

What alternatives do we have? Are they better?

Methods Split ballot experimentation in RDD survey

(n=425) Original questions drawn from federal health

surveys; we constructed alternative questions Do responses differ across versions? If so, can we judge which distribution is more plausible?

Behavior coding random subset of tape recorded interviews (n=313) How often were initial responses inadequate? How often do respondents interrupt the question? How often did interviewer do something more than just

read the question to get a response? How often did respondents ask for repeat, clarifications,

and so on?

Issue #1: Info after the question mark It is common for questions to apparently end but

then add some more material: In the past 12 months, how many times have you talked

to any health professional about your own health?



to any health professional about your own health? Include in-person visits, telephone calls, or times you were a patient in a hospital.




Concern: Do respondents pay adequate attention to this material? Failure to consider it could lead to under-reports.




Concern: Do respondents pay adequate attention to this material? Failure to consider it could lead to under-reports.

Alternative: People talk to health professionals in person, over the

phone, or as a patient in a hospital. Including any of those, in the past 12 months how many times have you talked to a health professional about your own health?

Results– Experiment 1 V1 V2

Qualifier: (after q) (begin of q) signif

Contacts w/healthprof in 12 months 6.6 5.9 n.s.

(n=214) (n=206)

Initial resp inadeq 32.5% 25.5% n.s.Resp req help 20.0% 13.1% p<.1

(n=160) (n=153)

Issue #2: Related experiment– definition after the question mark Definitions are sometimes presented after the question

mark as well. For example: V1: Have any of your immediate blood relatives ever been told

by a doctor that they have diabetes? By "immediate blood relatives", we mean your parents, your children, and your brothers and sisters, whether or not they are still living.

Issue #2: Related experiment– definition after the question mark Definitions are sometimes presented after the question

mark as well. For example: V1: Have any of your immediate blood relatives ever been told

by a doctor that they have diabetes? By "immediate blood relatives", we mean your parents, your children, and your brothers and sisters, whether or not they are still living.

V2: The next question is about immediate blood relatives-by that, we mean your parents, your children, and your brothers and sisters, whether or not they are still living. Have any of your immediate blood relatives ever been told by a doctor that they have diabetes?

If the definition is easier to ignore in V1, respondents might

interpret “blood relatives” more broadly than intended, leading to (erroneously) higher reports in V1.

Results– Experiment 2 V1 V2

Definition: (after q) (begin of q) signif

Relative w/diabetes 42.6% 34.4% p<.1(n=209) (n=215)

Initial resp inadeq 7.2% 2.5% p<.1Interrupted 16.5% 0.6% p<.01Iwer intervention 9.2% 3.1% p<.05

(n=152) (n=159)

Issue #3: Administration of response categories Conventional wisdom dictates that you administer the

question before offering response categories: V1: The last time you went to see a doctor, which of the

following best describes the main reason for your visit? Medical treatment for a new condition Follow-up care for an existing condition Or, a routine checkup

But what if this design encourages respondents to gravitate toward the first seemingly acceptable response rather than

considering the whole list?

Issue #3: Administration of response categories Conventional wisdom dictates that you administer the

question before offering response categories: V1: The last time you went to see a doctor, which of the

following best describes the main reason for your visit? Medical treatment for a new condition Follow-up care for an existing condition Or, a routine checkup

But what if this design encourages respondents to gravitate toward the first seemingly acceptable response rather than considering the whole list? V2: People schedule doctor visits for a variety of reasons,

including getting medical treatment for a new condition, follow-up care for an existing condition, or a routine checkup. Which of those best describes the main reason for your visit the last time you went to see a doctor?

Results– Experiment 3V1 V2

Response categories: (after Q) (before Q)signif

New condition 21.5% 23.6% n.s.Follow-up 41.0% 34.6%Routine exam 37.4% 41.9%

--------------------(n=195) (n=191)

Initial resp inadeq 10.6% 23.2% p<.01---------------------(n=141) (n=142)

Issue #4: Examples vs. definitions to illustrate complex concepts Complex concepts such as “strenuous activity” are often

illustrated through examples: The next question is about strenuous tasks done around your

home. By "strenuous tasks," we mean things like shoveling soil in a garden, chopping wood, major carpentry projects, cleaning the garage, scrubbing floors, or moving furniture. In the past 30 days, on how many days did you do strenuous tasks in or around your home?

Although designed to express a range of possibilities, but we hypothesize that they have the opposite effect, focusing attention on a few specifics that might not be well chosen

We expect that a good definition will create higher reports and be easier to administer

However, previous attempts were not successful, presumably because our definition was too complex

Examples vs. definitions V1: The next question is about strenuous tasks done around

your home. By "strenuous tasks," we mean things like shoveling soil in a garden, chopping wood, major carpentry projects, cleaning the garage, scrubbing floors, or moving furniture. In the past 30 days, on how many days did you do strenuous tasks in or around your home?

V2: The next question is about strenuous tasks done around your home. By "strenuous tasks", we mean any chores or projects that made you feel very tired by the time you finished them. In the past 30 days, on how many days did you do strenuous tasks in or around your home?

Results– Experiment 4V1 V2(example) (def) signif

Strenuous activ/mo.4.9 3.9 n.s.Reported “zero times” 29.3% 37.7% p<.1

(n=208) (n=215)

Initial resp inadeq 27.0% 25.1% n.s.(n=153) (n=159)

Issue #5: Question wording to capture rare events One reason questions are very complex is that

their authors want to prompt respondents to think of a broadly inclusive range of situations: In the past 12 months, how many times have you seen

or talked on the telephone about your physical or mental health with a family doctor or general practitioner?

The practice has a downside: respondents may lose track of the forest for the trees

Cognitive interview evaluation of the question above suggested that respondents thought it was exclusively about telephone contact with doctors.

If true, the question would generate significant undercounts.

A simplified comparison “The next question is specifically about

primary care doctors….” V1: In the past 12 months, how many times

have you seen or talked on the telephone with a primary care doctor about your health?

V2: In the past 12 months, how many times have you seen or talked with a primary care doctor about your health?

The only difference between these two questions is the inclusion of “on the telephone.”

Results– Experiment 5V1 V2(telephone) (no phone) signif

Mean contacts 3.4 3.6 n.s.“Zero” responses 24.7% 9.2% p<.01

(n=194) (n=195)

Initial resp inadeq 14.9% 21.1% n.s.Resp req help 5.7% 11.3% p<.1

(n=120) (n=121)

Issue #6: Question decomposition Food consumption example:

“During the last 30 days, how many times did you eat cheese, including cheese as snacks, and cheese in sandwiches, burgers, lasagna, pizza, or casseroles? Do NOT count cream cheese.”

Issue #6: Question decomposition Food consumption example:

“During the last 30 days, how many times did you eat cheese, including cheese as snacks, and cheese in sandwiches, burgers, lasagna, pizza, or casseroles? Do NOT count cream cheese.”

Clearly a challenging response task in general; we had little confidence in accuracy of reports

Cognitive testing: when probed about details… “did you include cheese in other dishes/sandwiches/etc? (If no), “would that have changed your overall answer?”

…some participants increased their reports

Question decomposition (2) Alternative: multiple, response tasks divided into

reasonable sub-components:The next questions are about cheese you have eaten in the last

30 days. Please do NOT include any cream cheese you may have eaten.

During the last 30 days, how many times have you eaten cheese on a sandwich, including burgers?

During the last 30 days, how many times have you eaten cheese in lasagna, pizza, casseroles, or mixed in with other dishes?

During the last 30 days, how many times have you eaten cheese as a snack or appetizer?

Results– Experiment 6 (responses)

V1 V2(single) (multi) signif

Mean times 13.9 19.0 p<.01(n=218) (n=228)

Results– Experiment 6 (behavior coding) The individual “decomposed” questions

consistently outperform the single-item on virtually all measures

Orig Alt1 Alt2 Alt3Inadeq init resp 15.9 9.9 8.3 3.1Probes used 13.7 7.8 6.3 2.1Req help/repeat 19.1 15.1 3.1 2.1

(all expressed as %; most signif at p<.05)

Some other considerations Mean time to administer original was 28 seconds; mean for

alternative was 51 seconds If we actually compare amount of probing, inadequate

responses, etc. to reach our desired data points (i.e., through three questions) the rates of behavior coding become very similar For example: 13.5% of original questions were probed; 15.1%

of the alternative series was ever probed Some research suggests that responses to decomposed

questions are less accurate (but…) Next steps: split ballot experiment on various food and

exercise questions (global vs. decomposed) with diary validation

Conclusions Qualifiers and definitions that dangle after the

question mark should be avoided– provided there is a reasonable way to do so.

Conventional wisdom about response categories after the question seems to stand.

In spite of our reservations about examples, we have failed to find evidence that they limit frame of reference. They don’t perform wonderfully, but alternatives don’t do better

Details in questions have the potential to distract respondents from overall meaning. Additional words may help a few respondents, but simpler wording may have a more profound impact.

Conclusions (2) Experiments presented here involve

single, interviewer-administered questions. Complexity can often be reduced by

asking multiple, smaller questions. However, the pressure to ask fewer

questions is real. Hopefully these results provide some guidance for how to structure questions given such constraints.

new experiments on the design of complex survey questions paul beatty, national center for health...

Documents

health statisticscollaborators

health professionals

question markit

presentation of material

federal health surveys

question markdefinitions

rdd survey n

original questions