short is better: evaluating the quality of answers in ... · questions are tackled by means of an...

23
1 Short is better: Evaluating the quality of answers in online surveys through screener questions Riccardo Ladini, University of Trento ([email protected]) Moreno Mancosu, Collegio Carlo Alberto ([email protected]) Cristiano Vezzoni, University of Trento ([email protected]) Paper prepared for the ECPR General Conference, Prague, September 7-10, 2016 Section: Experimental Designs in Political Science Panel: Survey Experiments I: Application of the Design Draft August 26, 2016. Please do not quote without permission of the authors. Abstract The article shows that a large proportion of respondents (≥ 50%) in an online survey does not pay adequate attention to the wording of questions, especially if the text is long and complex. The result is obtained by an experiment on screener questions, specifically developed to evaluate the attention of the respondents. To successfully pass a screener question, a respondent must follow the instructions described in a text that contains a varying amount of misleading information. Screeners turn out to be effective tools for evaluating the quality of respondents’ answers, especially if the cognitive load required to pass the test is low. The analyses are based on ITANES data (Italian National Election Study) collected in 2015, with n = 3000. Keywords: Screener, cognitive strain, survey experiment, online surveys

Upload: others

Post on 23-Jan-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Short is better: Evaluating the quality of answers in ... · questions are tackled by means of an experimental design, ... psycholinguistics findings (Tourangeau, Rips e Rasinski,

1

Short is better: Evaluating the quality of answers in online surveys through

screener questions

Riccardo Ladini, University of Trento ([email protected])

Moreno Mancosu, Collegio Carlo Alberto ([email protected])

Cristiano Vezzoni, University of Trento ([email protected])

Paper prepared for the ECPR General Conference, Prague, September 7-10, 2016

Section: Experimental Designs in Political Science

Panel: Survey Experiments I: Application of the Design

Draft August 26, 2016. Please do not quote without permission of the authors.

Abstract

The article shows that a large proportion of respondents (≥ 50%) in an online survey does not pay

adequate attention to the wording of questions, especially if the text is long and complex. The result is

obtained by an experiment on screener questions, specifically developed to evaluate the attention of the

respondents. To successfully pass a screener question, a respondent must follow the instructions

described in a text that contains a varying amount of misleading information. Screeners turn out to be

effective tools for evaluating the quality of respondents’ answers, especially if the cognitive load

required to pass the test is low. The analyses are based on ITANES data (Italian National Election

Study) collected in 2015, with n = 3000.

Keywords: Screener, cognitive strain, survey experiment, online surveys

Page 2: Short is better: Evaluating the quality of answers in ... · questions are tackled by means of an experimental design, ... psycholinguistics findings (Tourangeau, Rips e Rasinski,

2

Introduction

Answering a survey needs an effort that respondents are not always prepared to sustain

(Krosnick, 1991; Lenzner, Kaczmirek and Lenzner, 2010). Several studies have found that respondents

do not always pay attention in answering survey questions. Often they do not choose the option that

better represents their positions but they rather adopt strategies that reduce the cognitive load required

to answer correctly and sincerely. For instance, a respondent can select the first option that satisfies

him/her or even a random answer category (Krosnick, 1991). These strategies determine measurement

errors, that can be attributed to the respondent (Groves, 1989; Corbetta, 2015: 62). This type of error is

likely to increase its relevance as long as the control of the researcher on the interview setting

decreases, as in the case of online surveys performed with CAWI mode (Computer Assisted Web

Interview).

In order to estimate the amount of these errors in the online surveys, Instructional Manipulation

Checks (IMC, Oppenheimer, Meyvis and Davidenko, 2009), also known as screener1 (Berinsky,

Margolis and Sances, 2014) have been proposed. This survey tools, diffused both in psychology and

political science2, can be seen as a test aimed at distinguishing between “quality” respondents and

respondents who do not pay enough attention to survey questions (Meade and Craig, 2012). In addition,

the tool can activate respondents’ attention during an online survey (Oppeheimer et al., 2009).

Despite the growing diffusion of online surveys in social research (Callegaro, Lozar Manfreda

and Vehovar, 2015) and the increasing relevance of issues related to CAWI data quality,

methodological works focused on the nature and employment of screeners are usually based on ad hoc

1 Henceforth, we will use the label screener to define this kind of questions.

2 Berinsky and colleagues (2014) identify about 40 studies employing screeners between January 2006 and July 2013.

Page 3: Short is better: Evaluating the quality of answers in ... · questions are tackled by means of an experimental design, ... psycholinguistics findings (Tourangeau, Rips e Rasinski,

3

experimental studies, carried out on small convenience samples (Oppenheimer et al., 2009; Hauser and

Schwarz, 2015).

The study presented here aims to enhance the knowledge on the empirical working of screeners

in realistic survey conditions, by employing data coming from an online survey carried out on a large

sample of Italian respondents (n=3,000). The paper attempts to answer two main research questions.

First, we examine the relation between the difficulty of a screener and its ability in identifying

“quality” respondents. Second, we test whether a screener activates respondents’ attention. Both the

questions are tackled by means of an experimental design, in which the position and the wording of the

screener are manipulated.

Results could be of particular interest for researchers who rely on online surveys to collect data

and employ non-conventional questions in CAWI surveys, such as vignettes (Atzmüller and Steiner,

2010; Mutz, 2011: 54-67).. These questions often imply long and complex texts, which necessitate a

careful and accurate reading. Is it possible to expect that respondents read so carefully those questions

in a CAWI survey? This is, basically, the question that this paper faces.

Screener and respondents’ quality in online surveys

A screener is a multiple-response question in an online/self-administered survey. Differently

from what happens in a common behavioral or attitudinal question, in a screener the respondent is not

required to express an opinion or a behavior, but (s)he is rather requested to complete a task, following

specific instructions hidden in the wording of the question. Performing the task is made difficult by the

introduction in the text of misleading information. Generally speaking, a screener can be split in four

sections, as shown in figure 1. The first section presents an introduction that provides information

connected to the topic of the survey, but not relevant for the accomplishment of the task. The second

Page 4: Short is better: Evaluating the quality of answers in ... · questions are tackled by means of an experimental design, ... psycholinguistics findings (Tourangeau, Rips e Rasinski,

4

part provides the instructions that respondent must follow to carry out the text and pass the screener.

The task requires to select a combination of non-consistent answer categories which would be hardly

chosen if one does not read the instructions. The third part is constituted by the trap-question, which is

aimed at diverting respondent from passing the screener.3 The question, indeed, is semantically

connected to the available answer options, but the accomplishment of the screener is independent from

the content of the question. Finally, the fourth part is constituted by the answer categories.

Figure 1: Base screener wording employed in the experiment (Source: CAWI ITANES - 2015)

Part 1. Introduction Previous research shows that the large majority of people who gather information on-

line prefer a site or portal that they perceive as more trustworthy than others.

Part 2. Task

In this case, however, we are interested to know whether people take the time they

need to follow carefully instructions in interviews. To show that you’ve read this

much, please ignore the question and select only the options “Local newspaper

websites” and “None of these websites” as your two answers, regardless the websites

you actually visit.

Part 3. Trap question When you hear some breaking news, which is the news website you would visit more

frequently? (Maximum three answers)

Part 4. Answer categories

□ La Repubblica □ Il Giornale □ La Stampa

□ Corriere □ Dagospia □ Press association websites

□ Huffington post □ Il Fatto Quotidiano □ Other

□ Libero quotidiano □ Local newspapers websites □ None of these websites

A respondent passes the screener if (s)he follows the instructions described in the second part,

otherwise (s)he fails the screener, even if his/her answer is coherent with the trap-question. In our case,

in order to pass the screener the respondent must tick both “Local newspapers websites” and “I never

consult websites”, a clearly contradictory combination from a semantic point of view. In order to pass

the screener, the respondent should read carefully the instructions of part 2. People who limit their

3 For this reason, in the literature, the whole question is usually called trap question (Hauser and Schwarz, 2015).

Page 5: Short is better: Evaluating the quality of answers in ... · questions are tackled by means of an experimental design, ... psycholinguistics findings (Tourangeau, Rips e Rasinski,

5

reading to the introduction, to the trap question or, even, to the answer categories fail the test, being

deceived by the apparent semantic consistency of the answer. The cognitive strain requested to pass the

test varies accordingly to the amount of misleading text that must be read to correctly identify the

instructions, as well as to the presence/absence of the trap-question.

The main aim of the screener is to distinguish workers, who read carefully the questions and are

attentive in answering the surveys, and shirkers (Berinski et al, 2014), namely subjects who do not pay

enough attention and answer superficially to the questions of a survey4. The assumption here is that

people who pay the necessary attention to screeners are those who pay, in general, more attention to

every question in a survey and answer consistently. According to this assumption, people who answer

correctly the screeners can be defined as “quality respondents”. For this reason, several authors suggest

that screeners should be included in online surveys in order to allow researchers to exclude ex-post

inattentive respondents (Goodman, Cryder e Cheema, 2013, Oppenheimer et al., 2009), or, at least, to

stratify results by levels of respondents’ attentiveness (Berinsky et al., 2014).5

Despite the relevance of the topic, the knowledge on the empirical working of screeners is still

embryonic and the potential of the instrument to improve quality of survey response is far from being

clear. A large amount of screeners proposed in the literature presents a particularly complex wording

and lengths that exceed several times the average length of a survey question. This choice largely

reflects the aim for which screeners have been proposed, that is, distinguishing “quality” respondents

4 In the literature, this tendency is defined as satisficing (Krosnick 1991, Oppenheimer et al. 2009).

5 Stratification is a less onerous procedure than filtering of subjects who do not pass the screener. Excluding these subjects

could indeed lead to external validity issues, unbalancing the composition of the sample with respect to several individual

characteristics, such as socio-demographic ones, which are associated to the passing of the test.

Page 6: Short is better: Evaluating the quality of answers in ... · questions are tackled by means of an experimental design, ... psycholinguistics findings (Tourangeau, Rips e Rasinski,

6

from inattentive and shallow ones. In some situations, however, choosing screeners characterized by an

excessive cognitive load can turn out to be an issue.

If from one side a difficult screener can more certainly identify good (that is, more attentive)

respondents, from the other side it risks to identify as shirkers respondents who are actually not so

inattentive to the survey questions (and thus produce good quality answers), but simply fail the screener

because the task is too difficult6.

This aspect has been little considered in the literature. To understand the problem it is good to

remember that a screener, as every question in a survey, requires the respondent to apply a certain

amount of cognitive work in order to be correctly understood. The workload varies according to the

question’s wording (Kahneman, 1973; Kool, McGuire, Rosen e Botvinick, 2010). Starting from

psycholinguistics findings (Tourangeau, Rips e Rasinski, 2000), Lenzner et al. (2010) show that the

length and the syntactic complexity of a question - as well as the overload of respondents’ working

memory – can represent a hurdle in the understanding of the question and have a huge impact on data

quality (see also Christian, Dillman e Smyth, 2007). In the case of a screener, it is thus possible to

hypothesize that, an increase in complexity and length of the wording of the question will bring to a

higher cognitive load to complete the task.

Starting from these considerations, the first aim of the paper is to explore, by means of a survey

experiment that manipulates the wording of the screener, the relation between complexity of the

question wording and likelihood to complete correctly the task. The final aim is to calibrate the

6A complementary risk is either present. A screener which is too simple will not be able to operate an adequate distinction

between workers and shirkers, leading to coding as attentive a number of people who are not so attentive (in other words,

respondents who produce answers of scarce quality). In the light of screener examples presented in the literature, which are

usually characterized by a high level of complexity, this issue seems less relevant.

Page 7: Short is better: Evaluating the quality of answers in ... · questions are tackled by means of an experimental design, ... psycholinguistics findings (Tourangeau, Rips e Rasinski,

7

question wording, in order to identify an instrument in which the cognitive load optimally discriminates

between workers and shirkers, according to the rationale defined above.

Beside post-hoc discrimination between respondents according to their attentiveness, it has been

suggested that screeners can be employed to increase answers’ quality. In a certain way, it is supposed

that screeners “wake up” respondents, by activating their attention. The idea is that when a respondent

detects a tricky question, (s)he will be more attentive in order to avoid errors in subsequent questions.

Oppenheimer et al. (2009) propose to insert screeners which do not allow the continuation of the

survey until the respondent has successfully completed the test. In this way, it is guaranteed that the

respondent has correctly understood the screener wording (Guess, 2015) and that he/she understands

the need of answering to subsequent questions with an increased level of attentiveness (a strategy that

would be particularly useful for people who are initially shirkers). The idea that screeners can be

employed as an activation tools guided also a recent ad hoc experiment, carried out with Amazon

Mechanical Turk, in which Hauser and Schwartz (2015) show that the introduction of a screener placed

before a task enhances the likelihood of passing it. The effect of exposure to a screener on the quality

of subsequent questions has however showed a quite short latency. The instrument turned out to be of

little effect in maintain respondents’ attentiveness in answering to more distant questions in the survey

(Berinsky, Margolis e Sances, in press)

By means of an experiment that randomizes the screener’s position in a survey, our study aims

at verifying whether also in a realistic context of an online survey screeners can act as a tool of

activation (Hauser and Schwartz, 2015), hypothesizing that respondents exposed to a screener will

produce higher-quality answers with respect to those who have not been subjected to it and, in

particular, that this holds for those who have passed successfully the screener.

Page 8: Short is better: Evaluating the quality of answers in ... · questions are tackled by means of an experimental design, ... psycholinguistics findings (Tourangeau, Rips e Rasinski,

8

The experiment

A screener has been included in a self-administered CAWI survey of 3,000 individuals.

Following Berinsky and colleagues (2014, p. 740), the misleading information deals with the websites

one consults after having known a breaking news. The complete version of the screener is presented in

figure 1. In line with what suggested by Oppenheimer and colleagues (2009), this topic is not at odd

with the rest of the questions in the questionnaire, which is aimed at measuring Italians’ political

opinions and includes specific questions on media consumption. By means of a randomized procedure,

the experiment manipulates the cognitive load, that is the complexity of the task as function of the

complexity of its wording (hard, medium, easy), and its position (whether before or after a battery of

items). The experimental design is shown in table 1.

Table 1. Factorial design and experimental groups size

Position

Cognitive load Pre

(Treatment)

After

(Control)

Hard 515 510

Medium 502 492

Easy 479 502

For what concerns the cognitive load, the three experimental conditions are presented in figure

2. Indications to pass the task (part 2) and answer categories (part 4 of the screener) are the same for

each of the three groups. In the medium and easy version the introductory misleading information (part

1) has been removed, moreover in the easy version also the trap question (part 3) has been removed.

Page 9: Short is better: Evaluating the quality of answers in ... · questions are tackled by means of an experimental design, ... psycholinguistics findings (Tourangeau, Rips e Rasinski,

9

For what concerns the position, the screener has been randomly located before and after a

battery of questions regarding attitudes toward democracy7, which will be the one employed to evaluate

the quality of respondents.

Figure 2: Different experimental conditions (Source: CAWI ITANES - 2015)

Cognitive load / Task complexity

HARD MEDIUM EASY

Part 1.

Introduction

Previous research shows that the

large majority of people who

gather information on-line prefer

a site or portal that they perceive

as more trustworthy than others.

Part 2.

Task

(common to every

condition)

(In this case, however) we are (now) interested to know whether people take the time they

need to follow carefully instructions in interviews.(a)

To show that you’ve read this much, please ignore the question and select only the options

“Local newspaper websites” and “None of these websites” as your two answers, no matter of

the websites you actually visit.

Part 3.

Trap question

When you hear some breaking

news, which is the news website

you would visit more frequently?

(Maximum three answers)

When you hear some

breaking news, which is the

news website you would visit

more frequently? (Maximum

three answers)

(None)

Part 4.

Answer categories

(common to every

condition)

□ Repubblica □ Il Giornale □ La Stampa

□ Corriere □ Dagospia □ Press assoc. websites

□ Huffington post □ Il Fatto Quotidiano □ Other

□ Libero quotidiano □ Local newspapers websites □ None of these websites

(a) In the easy and medium versions, the first sentence of the task was slightly modified to make it coherent. “In this

case however, we are interested” becomes “We are now interested”.

7 Items are partly extracted from a battery of stealth democracy proposed by Hibbing and Theiss-Morse (2002). The

wording of the items is presented in figure 3.

Page 10: Short is better: Evaluating the quality of answers in ... · questions are tackled by means of an experimental design, ... psycholinguistics findings (Tourangeau, Rips e Rasinski,

10

Hypotheses

The first hypothesis concerns the relation between cognitive load and likelihood of correctly

accomplishing the task in the screener. As underlined above, the complexity of a screener should

influence the amount of cognitive load requested to the respondent and, thus, the likelihood of

accomplishing the task. Starting from this consideration, it is possible to draw the following

hypothesis:

Hp1. The higher is the cognitive load (in terms of complexity and length of the wording of the

screener), the lower is the passage rate8.

The main aim of a screener, however, is that of identifying respondents who pay attention to the

questions of the survey, answering with more accuracy. The second hypothesis, thus, focuses on the

relation between the outcome of the screener and quality of answers, and can be formulated as follows:

Hp2. Respondents who pass the screener produce answers to other questions of higher quality

than respondents who do not pass it.

It has also been underlined that a screener is considered as an instrument to activate

respondents’ attention, hypothesizing that the instrument enhances the focus of the respondent, who

answers in a more accurate way to the questions that follow. Therefore, we can formulate the third

hypothesis as follows:

8 Following Berinsky et al. (2014), we refer to ‘passage rate’ as the percentage of respondents that answer correctly to the

screener question.

Page 11: Short is better: Evaluating the quality of answers in ... · questions are tackled by means of an experimental design, ... psycholinguistics findings (Tourangeau, Rips e Rasinski,

11

Hp3. Exposure to a screener enhances the quality of answers to the questions that follow.

Finally, once shown that a screener is actually able to distinguish respondents according to the

quality of their answers, it is possible to ask which one is the right compromise between cognitive load

to which the respondent must be subjected and capacity of discriminating between workers and

shirkers. In the literature very complex screeners have been proposed so far, assuming that this solution

is the one that guarantees the identification of quality respondents in the clearer way. We can thus

formulate the last hypothesis as follows:

Hp4. The higher is the cognitive load for a screener, the higher is the quality of the answers of

respondents who pass it.

Data and Methods

Data come from the ITANES (Italian National Election Study) online panel 2013-15. Data

analyzed here concern the sixth wave of the survey, which took place about a month after Italian

regional elections of 2015, on a quota sample of 3,000 respondents. Those respondents were selected

randomly from a starting sample of panel participants (n=8,723), originally drawn by means of quota

sampling (according to quotas related to gender, age, and educational level), from a list of members of

an opt-in online community of a private research institute (SWG).

The quality of respondents has been evaluated considering their answers to a battery of items

placed immediately before or after the screener. The battery concerns attitudes toward democracy and

Page 12: Short is better: Evaluating the quality of answers in ... · questions are tackled by means of an experimental design, ... psycholinguistics findings (Tourangeau, Rips e Rasinski,

12

is made by 6 items as shown in figure 3. The respondent is asked to provide the degree of agreement

with every item by means of a scale going from 0 (totally disagree) to 10 (totally agree).

Reading the items, it is clear that the semantic polarity varies, as three items (1, 3 and 6) express

a negative attitude toward democracy and the other three (item 2, 4 and 5) express a positive attitude.

This choice is aimed at minimizing possible response set effects.

Figure 3 – Items on attitudes toward democracy

1. Compromises in politics are really just selling out on one’s principles

2. Parties are necessary to defend special interests of groups and social classes

3. Parties criticize one another, but actually they are all the same

4. Parties guarantee that people can participate to politics in Italy

5. There cannot be democracy without parties

6. Politicians would help the country more if they would stop talking and just take action on important problems

The first measure of answers’ quality is defined at the individual level, considering the so-called

straight-line response set, which represents the tendency that one has to answering with the same

answer category to every item of the battery. In the case of a battery in which the items present inverted

semantic polarity it is clear that a set of identical answers shows that the respondent does not

adequately consider the meaning of the questions. The measure is dichotomous: the variable is equal to

1 if respondents answer to all questions9 the same way and 0 otherwise.

The second measure of quality is defined at the aggregate level by means of the analysis of

internal consistency (Cronbach’s alpha) of the 6-item scale, adequately recoded insofar all items have

the same semantic polarity.10

Higher values of the coefficient indicate higher coherence among the

9 Value 1 is attributed also to respondents who have answered always “Don’t know”

10 The measures have been calculated through the listwise deletion. This led to a deletion of the 11% of the original

sample.1

Page 13: Short is better: Evaluating the quality of answers in ... · questions are tackled by means of an experimental design, ... psycholinguistics findings (Tourangeau, Rips e Rasinski,

13

answers of the battery, and thus the higher is the value of the coefficient, the higher is the quality of the

group of respondents on which the coefficient has been calculated. Results show Alpha’s confidence

intervals at the level of 95%, obtained by means of a bootstrap procedure on original data (Padilla,

Diverse e Newton, 2012).11

Results

The first analysis focuses on the passage rates of screeners. Table 2 shows the percentage of

respondents who pass the screener for every experimental groups. Data show clearly that the task is

more complex to be accomplished as long as the screener becomes more difficult (and, thus, the

cognitive load of the question is higher). In the easy version the screener is passed by the 50% of

subjects, while only the 22% passes the screener in its most difficult form. The first hypothesis is thus

confirmed. This latter result is particularly relevant for our aims, since the most complicated screener is

the one that is nearer to those proposed in the literature. If the aim of the instrument is to distinguish

between workers and shirkers, it is clear that such a level of complexity makes the screener extremely

selective, and only a little quota of the sample (1 out of 5) is able to pass the test. In addition, we can

underline other two elements of particular interest:

- Overall, respondents passing the screener are few, about 1 out of 3 (35%);

- Passage rates are not influenced by position of the screener. This assures us that the stealth

democracy battery does not influence the task described in the screener.

11

Synthetically, bootstrap is a technique that allows to infer standard errors and distribution of an estimate when the

underlying distribution is unknown. The technique is based on a set of n random resamples that lead to a distribution of n

parameters, that makes possible to infer the estimates of interest In our case, the bootstrapped estimate of confidence

intervals is calculated by means of 400 resamples.

Page 14: Short is better: Evaluating the quality of answers in ... · questions are tackled by means of an experimental design, ... psycholinguistics findings (Tourangeau, Rips e Rasinski,

14

Table 2. Passage rate by cognitive load and position

Position

Cognitive load

Pre

(N=1496)

Post

(N=1504) Passage rate by

cognitive load N

Hard 21 22 22 1025

Medium 36 33 35 994

Easy 48 51 50 981

Passage rates by

position 35 35 35 3000

For what concerns the ability of the instrument to distinguish between workers and shirkers,

table 3 shows clearly that respondents who passed the screener present higher quality answers. In this

group, the straight-line response set is practically absent, while it involves 13% of respondents who did

not pass the screener (the difference is significant, p < .001). The same results emerge for values of the

Cronbach alpha of the two groups. For what regards people who have passed the screener, the lower

bound of the confidence interval is higher than .70, a value considered as the minimum for research in

which non-validated scales are employed (Peterson, 1994). For the other respondents, identified by the

screener as not attentive in responding to questions, the confidence interval is located entirely under

this threshold. We can thus confirm our second hypothesis.

Table 3: Quality of the answers to the battery on attitudes towards democracy, by screener outcome.

Outcome % Response Set(a)

N Alpha I.C. 95% N(b)

Positive 1.3 1054 .72- .77 978

Negative 13.2 1946 .57 - .64 1683

Notes: (a)

Chi2(1) = 116.8; p < .01 (b)

Listwise deletion

Page 15: Short is better: Evaluating the quality of answers in ... · questions are tackled by means of an experimental design, ... psycholinguistics findings (Tourangeau, Rips e Rasinski,

15

Results of the experimental manipulation of the screener lead us to reject our third hypothesis,

since no effect due to activation seems to be detected. Table 4 shows clearly that the quality of the

answers on democracy attitudes, both in terms of response set and alpha, does not vary in the two

groups according to the position of the screener. This outcome is confirmed also by analyzing only

those respondents who have passed the screener and consequently realized the presence of the trap-

questions.

Table 4: Quality of the answers to the battery on attitudes towards democracy, by screener position

(whole sample and positive outcome results).

Position

% Response Set Alpha I.C. 95%

Whole

sample(a)

Only positive

outcome(b)

Whole sample Only positive

outcome

Pre 9.2 1.7 .63 - .69 .69 - .77

Post 8.8 0.9 .63 - .69 .72 - .79

N 3000 1054 2661(c)

978(c)

Notes: (a)

Chi2(1) = 0.2; p = .67 (b)

Chi2(1) = 1.2; p = .27 (c)

Listwise deletion

Finally, we have to evaluate the relation between cognitive load and quality of the respondents

who pass the test. Our expectation – inferred from the research practice that privileges screener several

times longer than usual survey questions – is that the harder is the trial, the higher is the quality of

respondents who pass it. Data obtained by means of the manipulation of the cognitive load do not go

toward this direction and lead us to reject also this hypothesis. In terms of response set, the quality of

respondents who pass the screener is the same, independently from the cognitive load (table 5, first

column). Therefore, by employing too complex screeners we obtain, a sub-optimal result: “good”

respondents fail the test and the number of people identified as quality respondents reduces

Page 16: Short is better: Evaluating the quality of answers in ... · questions are tackled by means of an experimental design, ... psycholinguistics findings (Tourangeau, Rips e Rasinski,

16

considerably. This confirms the result showed in Table 2, with passage rates dropping from 50% to

22% when considering respectively the easy and difficult screener.12

Table 5: Response set percentage to the battery on attitudes towards democracy, by screener cognitive

load and outcome.

Cognitive load

Positive outcome Negative outcome

% Response Set(a) N % Response Set(b) N

Hard 1.4 221 11.3 804

Medium 0.6 344 11.7 650

Easy 1.8 489 18.1 492

Notes: (a)

Chi2(2) = 2.4; p = .30 (b)

Chi2(2) = 14.1; p< .01

With respect to the fourth hypothesis, similar results come from the evaluation of Cronbach’s

alpha. The analysis of the confidence intervals, presented graphically in figure 4, confirms the absence

of a relation between cognitive load and quality of the answers, once distinguished the respondents by

the outcome of the task. This last result is particularly relevant when we have to define and calibrate

our instrument, since the increase of the cognitive load does not seem to present advantages in terms of

capacity of discriminating the quality of respondents. On the contrary, it seems to diminish excessively

the number of those who are not attentive enough to pass the test.

12

For what regards the response set, relatively easy screeners have the charming property of selecting better shirkers, as it is

possible to see in Table 5. 18% of those who fail the test with the easy screener answered with a straight-line response set to

the battery questions, while this happens only for the 12% who did not pass the medium and difficult tasks. This relation,

however, is not confirmed by the Cronbach's alpha.

Page 17: Short is better: Evaluating the quality of answers in ... · questions are tackled by means of an experimental design, ... psycholinguistics findings (Tourangeau, Rips e Rasinski,

17

Figure 4. Confidence interval at the 95% level of Cronbach's alpha (boostrap) of the battery on attitudes

toward democracy, by screener cognitive load and outcome.

Conclusion and discussions

Nowadays, online surveys represent one of the main instruments for collecting data in social

and political research (Callegaro et al., 2015: 4), allowing to conduct a large number of interviews in a

fast and cheap way (Loosveldt and Sonck, 2008: 96). This advantage has a cost: the absence of an

interviewer can lead to a reduced quality of the data collected. For this reason, the introduction of

survey instruments, aimed at controlling respondents’ attention while answering a survey, has been

suggested. These instruments, known as screeners, are tests in which the respondent must complete a

task by following instructions hidden inside the wording of a question, which contains a variable

Page 18: Short is better: Evaluating the quality of answers in ... · questions are tackled by means of an experimental design, ... psycholinguistics findings (Tourangeau, Rips e Rasinski,

18

amount of misleading information. The screener is also known as “trap question” because misleading

information are aimed at diverting respondent in his/her task.

The employment of screeners is increasingly growing up (Berinski et al. 2014), but knowledge

on their empirical working is still narrow and mainly limited to ad-hoc experimental studies. The

experiment proposed in this article aims at investigating, in the context of a real survey, how a screener

actually works and calibrates the cognitive load of the task according to its capacity of identifying

quality respondents. The experimental design, thus, randomizes two factors: the cognitive load and the

position of the screener.

The first result of the study is that only a limited number of respondents seems to read carefully

the wording of the question. Generally speaking, less than half of the sample passes successfully the

screener. The rate is equal to 50% for the easiest wording and drops to 22% when the wording is more

complex and the trap more insidious. This result leads to reflect on the quality of answers that people

produce in online surveys, in particular if one is subjected to questions more complex than the rest of

the survey. The result clearly represents an alarm bell for the quality of non-conventional questions -

such, for instance, vignette studies, which present on average long and complex wordings that vary in a

systematic way. Answers to these questions can be subjected to a large error due to the low attention of

the respondent and could be prone to large biases.

Our findings are even more significant if we consider that the passage of the task is associated

with the quality of answers that respondents give to other survey questions. This means that the

screener is actually able to distinguish between workers and shirkers and that it can be thus employed if

we want to maintain in the analysis only better respondents. Notwithstanding, this choice has a huge

cost, since the exclusion of people who do not pass the screener leads to erasing large portions of the

sample and a consistent drop in the statistical power of the study (in addition to the non-employment of

Page 19: Short is better: Evaluating the quality of answers in ... · questions are tackled by means of an experimental design, ... psycholinguistics findings (Tourangeau, Rips e Rasinski,

19

a large part of information). In employing the screener, the good calibration of the cognitive load is

thus important, in order to avoid excessively selective choices with respect to respondents’ quality.

Concerning this aspect, the experiment presented in this article stressed that the quality of the

answers of people who correctly passed the easy screener is substantially identical with respect to those

who passed the medium, as well as the most complex one. In order to differentiate among respondents,

it seems sufficient to introduce tasks characterized by a limited cognitive load, with the advantage of

identifying a higher number of quality respondents. This result is inconsistent with the general practice

of research, where the suggested screeners are generally very complex and require more cognitive

strain compared to the other questions in the survey. Our article thus recommend to rethink calibration

and empirical working of the screeners, which, in the common format, are not able to do efficiently

their job. The employment of brief and relatively simple screener, less invasive and more easily

employable in different contexts, turns out to be recommended. Starting from this introductory work,

more research on this topic will lead to test further the calibration of the instrument, in order to get to

an optimal wording, able to distinguish efficiently among respondents.

Finally, results of the manipulation of the screener’s position show that, in a context of a real

survey, the instrument does not act as an activation of respondents. Exposure to the screener does not

affect indeed the quality of the answers to questions immediately following. This result is not in line

with previous studies, which argued that screeners can behave as activation tools. This could be due to

the different conditions in which experiments have been carried out. In particular, Hauser and Schwartz

(2015) refer to an ad-hoc experimental study, with a relatively small sample (n < 400), with paid

respondents and a brief questionnaire, while our study is carried out on a large sample in real survey

conditions. However, the question remains open and further works are necessary to confirm the

findings of our experiment.

Page 20: Short is better: Evaluating the quality of answers in ... · questions are tackled by means of an experimental design, ... psycholinguistics findings (Tourangeau, Rips e Rasinski,

20

References

ATZMÜLLER, C., STEINER, P. M. (2010). Experimental vignette studies in survey research.

European Journal of Research Methods for the Behavioral and Social Sciences, 6(3), 128-138.

BERINSKY, A. J., MARGOLIS, M. F., SANCES, M. W. (2014). Separating the shirkers from

the workers? Making sure respondents pay attention on self-administered surveys. American Journal of

Political Science, 58(3), 739-753.

BERINSKY, A., MARGOLIS, M.F., SANCES, M.W. (In press). Can we turn shirkers into

workers?. Journal of Experimental Social Psychology.

CALLEGARO, M., LOZAR MANFREDA, K., VEHOVAR, V. (2015). Web survey

methodology. London: Sage.

CHRISTIAN, L. M., DILLMAN, D. A., SMYTH, J. D. (2007). Helping respondents get it right

the first time: the influence of words, symbols, and graphics in web surveys. Public Opinion Quarterly,

71(1), 113-125.

CORBETTA, P. (2015). La ricerca sociale: metodologia e tecniche. Vol. 2: Le tecniche

quantitative. Seconda edizione. Bologna: il Mulino.

Page 21: Short is better: Evaluating the quality of answers in ... · questions are tackled by means of an experimental design, ... psycholinguistics findings (Tourangeau, Rips e Rasinski,

21

GOODMAN, J. K., CRYDER, C. E., CHEEMA, A. (2013). Data collection in a flat world: The

strengths and weaknesses of Mechanical Turk samples. Journal of Behavioral Decision Making, 26(3),

213-224.

GUESS, A. M. (2015). Measure for measure: An experimental test of online political media

exposure. Political Analysis, 23(1), 59-75.

GROVES, R. M. (1989). Survey errors and survey costs. New York: Wiley-Interscience.

HAUSER, D. J. SCHWARZ, N. (2015). It’s a trap!: Instructional manipulation checks prompt

systematic thinking on "tricky" tasks. SAGE Open, 5, 1-6.

HIBBING, J. R., THEISS-MORSE, E. (2002). Stealth democracy: Americans' beliefs about

how government should work. Cambridge: Cambridge University Press.

KAHNEMAN, D. (1973). Attention and effort. Englewood Cliffs, NJ: Prentice-Hall.

KOOL, W., MCGUIRE, J. T., ROSEN, Z. B., BOTVINICK, M. M. (2010). Decision making

and the avoidance of cognitive demand. Journal of Experimental Psychology: General, 139(4), 665-

682.

KROSNICK, J. A. (1991). Response strategies for coping with the cognitive demands of

attitude measures in surveys. Applied cognitive psychology, 5(3), 213-236.

Page 22: Short is better: Evaluating the quality of answers in ... · questions are tackled by means of an experimental design, ... psycholinguistics findings (Tourangeau, Rips e Rasinski,

22

LENZNER, T., KACZMIREK, L., LENZNER, A. (2010). Cognitive burden of survey

questions and response times: A psycholinguistic experiment. Applied Cognitive Psychology, 24(7),

1003-1020.

LOOSVELDT, G., SONCK, N. (2008). An evaluation of the weighting procedures for an

online access panel survey. Survey Research Methods, 2(2), 93-105.

MEADE, A. W., CRAIG, S. B. (2012). Identifying careless responses in survey data.

Psychological methods, 17(3), 437-455.

MUTZ, D.C. (2011). Population-based survey experiments. Princeton: Princeton University

Press.

OPPENHEIMER, D. M., MEYVIS, T., DAVIDENKO, N. (2009). Instructional manipulation

checks: Detecting satisficing to increase statistical power. Journal of Experimental Social Psychology,

45(4), 867-872.

PADILLA, M. A., DIVERSE, J., NEWTON, M. (2012). Coefficient alpha bootstrap confidence

interval under non normality. Applied Psychological Measurement, 36(5), 331-348.

PETERSON, R. A. (1994) A meta-analysis of Cronbach's coefficient alpha. Journal of

Consumer Research, 21(2), 381-391

Page 23: Short is better: Evaluating the quality of answers in ... · questions are tackled by means of an experimental design, ... psycholinguistics findings (Tourangeau, Rips e Rasinski,

23

TOURANGEAU, R., RIPS, L. J., RASINSKI, K. (2000). The psychology of survey response.

Cambridge: Cambridge University Press.