ordinary search engine users assessing difficulty, effort and outcome for simple and complex search...

26
Ordinary Search Engine Users assessing Difficulty, Effort, and Outcome for Simple and Complex Search Tasks Georg Singer Ulrich Norbisrath Institute of Comuter Science, University of Tartu, Estonia Dirk Lewandowski Department of Information, Hamburg University of Applied Sciences, Germany iiiX 2012, Nijmegen 23 August, 2012

Upload: dirk-lewandowski

Post on 25-Jan-2015

69 views

Category:

Internet


0 download

DESCRIPTION

 

TRANSCRIPT

Page 1: Ordinary Search Engine Users Assessing Difficulty, Effort and Outcome for Simple and Complex Search Tasks

Ordinary Search Engine Users assessing Difficulty, Effort, and Outcome for Simple and Complex Search Tasks

Georg Singer Ulrich Norbisrath Institute of Comuter Science, University of Tartu, Estonia Dirk Lewandowski Department of Information, Hamburg University of Applied Sciences, Germany iiiX 2012, Nijmegen 23 August, 2012

Page 2: Ordinary Search Engine Users Assessing Difficulty, Effort and Outcome for Simple and Complex Search Tasks

Agenda

1.  Introduction 2.  Definitions

3.  Research questions

4.  Methods

5.  Results

6.  Discussion and conclusion

Page 3: Ordinary Search Engine Users Assessing Difficulty, Effort and Outcome for Simple and Complex Search Tasks

Agenda

1.  Introduction 2.  Definitions

3.  Research questions

4.  Methods

5.  Results

6.  Discussion and conclusion

Page 4: Ordinary Search Engine Users Assessing Difficulty, Effort and Outcome for Simple and Complex Search Tasks

Introduction

•  People use search engines for all kinds of tasks, from simply looking up trivia to planning their holiday trips. More complex tasks usually can take much longer than expected.

•  Disparity between expected and real effort for such tasks

•  Definition of a complex search task (Singer, Norbisrath & Danilov, 2012): –  Requires at least one of the elements aggregation (finding several documents to

a known aspect), discovery (detecting new aspects), and synthesis (synthesizing the found information into a single document).

–  Complex tasks typically require going through those steps multiple times.

Page 5: Ordinary Search Engine Users Assessing Difficulty, Effort and Outcome for Simple and Complex Search Tasks

Motivation

•  We observed that users are on the one hand highly satisfied with search engines, but on the other hand, everybody knows many cases where “the search engine failed”.

•  Users’ dissatisfaction in terms of unsuccessful queries might partly be caused by users not being able to judge the task effort properly and therefore their experience not being in line with their expectations.

•  There is little research about users carrying out complex search tasks and examining their ability to judge task effort, which is based on a reasonably large and diverse sample.

Page 6: Ordinary Search Engine Users Assessing Difficulty, Effort and Outcome for Simple and Complex Search Tasks

Motivation

•  Conduct a study with a reasonable sample of ordinary Web search engine users

•  Collect self-reported measures of task difficulty, task effort and task outcome, before and after working on the task.

Page 7: Ordinary Search Engine Users Assessing Difficulty, Effort and Outcome for Simple and Complex Search Tasks

Agenda

1.  Introduction 2.  Definitions 3.  Research questions

4.  Methods

5.  Results

6.  Discussion and conclusion

Page 8: Ordinary Search Engine Users Assessing Difficulty, Effort and Outcome for Simple and Complex Search Tasks

Definitions

§  A search task is complex if it requires at least one of the elements aggregation, discovery and synthesis.

§  A search task is difficult if a lot of cognitive input is needed to carry out the task.

§  A search task requires increased effort if the user needs either

§  more cognitive effort to understand the task and formulate queries (time effort),

§  or more mechanical effort (number of queries, number of pages visited, browser tabs opened and closed).

Page 9: Ordinary Search Engine Users Assessing Difficulty, Effort and Outcome for Simple and Complex Search Tasks

Agenda

1.  Introduction 2.  Definitions

3.  Research questions 4.  Methods

5.  Results

6.  Discussion and conclusion

Page 10: Ordinary Search Engine Users Assessing Difficulty, Effort and Outcome for Simple and Complex Search Tasks

Research questions

RQ1: Can users assess difficulty, effort and task outcome for simple search tasks?

RQ2: Can users assess difficulty, effort and task outcome for complex search tasks? RQ3: Are there significant performance differences between assessing simple and complex search tasks?

RQ4: Does the users’ ability to judge if the information they have found is correct or not depend on task complexity?

RQ5: Is there a correlation between the overall search performance (ranking in the experiment) and the ability to assess difficulty, time effort, query effort, and task outcome for complex tasks?

RQ6: Does the judging performance depend on task complexity or simply the individual user?

Page 11: Ordinary Search Engine Users Assessing Difficulty, Effort and Outcome for Simple and Complex Search Tasks

Agenda

1.  Introduction 2.  Definitions

3.  Research questions

4.  Methods 5.  Results

6.  Discussion and conclusion

Page 12: Ordinary Search Engine Users Assessing Difficulty, Effort and Outcome for Simple and Complex Search Tasks

Methods

•  Laboratory study with 60 participants, conducted in August, 2011, in Hamburg, Germany

•  Each user was given 12 tasks (6 simple, 6 complex)

•  Browser interactions were collected using the Search Logger plugin for Firefox (http://www.search-logger.com).

•  Pre- and post-questionnaires for each task

•  Although tasks were presented in a certain order (switching between simple and complex), participants were able to switch between tasks as they liked.

•  Users were allowed to use any search engine (or other Web resource) they liked.

Page 13: Ordinary Search Engine Users Assessing Difficulty, Effort and Outcome for Simple and Complex Search Tasks

Research design

Pre$task)ques,onnaire)

1.#This#task#is#easy##2.#It#will#take#me#less#than##five#minutes#to#complete#the#task##3.#I#will#need#fewer#than#five#queries#to#complete#the#task##4.#I#will#find#the#correct#informa?on##

1.#This#task#was#easy##2.#It#took#me#less#than##five#minutes#to#complete#the#task##3.#I#needed#fewer#than#five#queries#to#complete#the#task##4.#I#have#found#the#correct#informa?on##

Post$task)ques,onnaire)Work)on)task)independently)

1.#Result#is#correct##2.#Result#is#partly#correct##3.#Result#is#wrong##

Objec,ve)results)assessment)

Collec?ng#browser##interac?on#data#

!# !# !#

Page 14: Ordinary Search Engine Users Assessing Difficulty, Effort and Outcome for Simple and Complex Search Tasks

User sample

•  60 users •  Recruitment followed a demographic

structure model

•  A sample of that size cannot be representative, but is a vast improvement over samples usually used (i.e., self-selection, students)

•  Data from 4 users was corrupted and therefore, was not analysed

them as “complex” or “di�cult”, sometimes interchangeably.To help guide the reader of this paper, we give some defini-tions, which we will use throughout this paper.

A task is an abstract description of activities to achieve acertain goal [9, 11].

Search is the process of finding information.A search task is a piece of work concerning the retrieval

of information related to an information need. The searchis usually carried out with IR systems [11].

A search task is complex if it requires at least one of theelements aggregation, discovery and synthesis [15]. It typ-ically requires reviewing many documents and synthesizingthem into a desired format.

A search task is di�cult if a lot of cognitive input is neededto carry out the task.

A search task requires increased e↵ort if the user needs ei-ther more cognitive e↵ort to understand the task and formu-late queries (time e↵ort), or more mechanical e↵ort (numberof queries, number of pages visited, browser tabs opened andclosed).

4. RESEARCH QUESTIONSIn the study task complexity is the independent variable.

Search performance, searchers’ assessments of task di�culty,query e↵ort, time e↵ort and outcome are dependent vari-ables.

To guide our research, we formulated the following re-search questions:

RQ1: Can users assess di�culty, e↵ort and task outcomefor simple search tasks?

RQ2: Can users assess di�culty, e↵ort and task outcomefor complex search tasks?

RQ3: Are there significant performance di↵erences betweenassessing simple and complex search tasks?

RQ4: Does the users’ ability to judge if the informationthey have found is correct or not depend on task com-plexity?

RQ5: Is there a correlation between the overall search per-formance (ranking in the experiment) and the abilityto assess di�culty, time e↵ort, query e↵ort, and taskoutcome for complex tasks?

Here we investigate the correlation between search perfor-mance and judging performance.

RQ6: Does the judging performance depend on task com-plexity or simply the individual user?

This question investigates the association between judgingperformance, task complexity and the individual user.

5. RESEARCH METHODThe results presented in this paper are based on a body of

data gathered in the course of a larger experiment in August2011. One additional article using distinct parts of the dataand describing di↵erent aspects has been published [18], andone article has been submitted for review [16]. The follow-ing description of the research design is based on Singer etal. [16]. The experiment was conducted in August 2011 in

Basic Data Gender

Age Span Female Male Total

18-24 5 4 925-34 9 7 1635-44 7 8 1545-54 8 8 1655-59 3 1 4Total 32 28 60

Table 1: Demography of user sample

Hamburg, Germany. Participants were invited to the univer-sity, where they were given a set of search tasks (see below)to fulfill. The study was carried out in one of the univer-sity’s computer labs, where each participant had her owncomputer and was instructed to work on the search tasksindependently. Participants were not observed directly, buttheir browser interactions were recorded using the Search-Logger plug-in [17]. While the tasks were presented in a cer-tain order to the participants, they were allowed to choosethe order of the tasks according to their wishes, and it wasalso possible to interrupt a task, work on another one, andlater return.We recruited a sample of 60 volunteers, using a demo-

graphic structure model. The aim was to go beyond theusual user samples consisting mainly of students, often ex-perienced searchers from information science or computerscience, and also, to increase the sample size. As a usersample of the intended size could not be representative, wewanted at least to make sure that adults from various ageranges, and also men and women alike, were considered. Fordetails on the sample, see Table 1. The e↵ective number ofstudy participants providing data to our study was reducedto 56, as the data of 4 (2 females, 2 males) out of the 60users was corrupt and could therefore not be used.The search experiment consisted of 12 search tasks. As

our experiment was conducted in Germany, the language ofthe tasks was German. A prerequisite for all tasks was thata correct answer had to be available somewhere in publicwebsites in German as of August 2011. The study partici-pants had 3 hours to complete the experiment.Simple tasks are characterized by asking the users to find

simple facts. The needed information is contained in onedocument (web site) and can retrieved with one single query.Complex tasks on the other hand are formulated in a waythat the users have enough context to comprehend the tasksituation but the tasks are still characterized by uncertaintyand ambiguity [10]. There is no single correct answer re-trievable and the required information is spread over variousdocuments (web sites). Fulfilling the task typically requiresissuing multiple queries, aggregating information from var-ious sources and synthesizing the information into a singlesolution document [15]. The tasks were as follows ((S) markssimple, and (C) complex tasks):

1. (S) When was the composer of the piece “The MagicFlute” born?

2. (S) How hot can it be on average in July inAachen/Germany?

3. (S) How many opera pieces did Verdi compose?

4. (S) When and by whom was penicillin discovered?

Page 15: Ordinary Search Engine Users Assessing Difficulty, Effort and Outcome for Simple and Complex Search Tasks

Tasks

•  Simple vs. complex tasks (examples) –  (S) When was the composer of the piece “The Magic Flute” born?

–  (S) When and by whom was penicillin discovered?

–  (C) Are there any differences regarding the distribution of religious affiliations between Austria, Germany, and Switzerland? Which ones?

–  (C) There are five countries whose names are also carried by chemical elements. France has two (31. Ga – Gallium and 87. Fr – Francium), Germany has one (32. Ge – Germanium), Russia has one (44. Ru – Ruthenium) and Poland has one (84. Po – Polonium). Please name the fifth country.

Page 16: Ordinary Search Engine Users Assessing Difficulty, Effort and Outcome for Simple and Complex Search Tasks

Agenda

1.  Introduction 2.  Definitions

3.  Research questions

4.  Methods

5.  Results 6.  Discussion and conclusion

Page 17: Ordinary Search Engine Users Assessing Difficulty, Effort and Outcome for Simple and Complex Search Tasks

Results on RQ 1 (“Can users assess difficulty, effort and task outcome for simple search tasks?”) •  An assessment was graded

“correct” when the user’s self-judged values were the same in the pre-task and the post-task questionnaire.

•  For all measures, users were generally able to judge the simple tasks correctly (around 90%).

# of tasks %

di�culty

incorrect 29 9.8correct 266 90.2

time e↵ort

incorrect 27 9.1correct 268 90.8

query e↵ort

incorrect 38 12.9correct 257 87.1

ability to find right result

incorrect 16 5.4correct 279 94.6

Table 2: Users judging simple search tasks

simple search tasks. “# of tasks” represents the numberof simple tasks that have been processed by the study par-ticipants. The total number of tasks (correct plus incorrectones) should have been 56*6=336 (56 valid users x 6 tasks).It is slightly lower due to invalid or not given answers ornot fulfilled tasks. % shows the percentage of the number ofjudged tasks to the total valid answers for tasks. We gradedan answer as correct when the users’ self-judged values werethe same in the pre-task questionnaire and the post-taskquestionnaire. For example if they judged a task to be di�-cult in the pre-task questionnaire and after carrying out thetask stated again that it was a di�cult task, the judgmentwas graded as correct.

For all parameters (di�culty, time e↵ort, query e↵ort, andresult finding ability) approximately 90% of the users man-aged to match estimated and experienced values for simpletasks. However, in our study the users had slightly moretrouble estimating the time e↵ort needed than the querye↵ort (in terms of estimating going over a threshold of num-bers of queries).

RQ2: Can users assess difficulty, effort andtask outcome for complex search tasks?Table 3 outlines our findings on users and their ability toestimate di�culty, e↵ort, and being able to find the correctresult for complex search tasks. For all parameters about70% of all tasks were judged correctly with slightly worseestimations for query e↵ort and result-finding skills.

RQ3: Are there significant differences betweenassessing simple and complex search tasks?To analyze if there exist significant di↵erences between howusers assess certain parameters (di�culty, time e↵ort, querye↵ort, search success), we have compared the di↵erencesbetween users’ pre-task estimate and post-task experiencebased values for 295 simple tasks and 286 complex tasks asoutlined in Table 4 (100% means a 100% probability to judgethe right di�culty; pre-task estimate and post task evalua-tion are totally in line). We used paired sample t-tests tocompare the results from simple and complex tasks. Wepaired average di�culty for simple tasks and average di�-

# of tasks %

di�culty

incorrect 95 33.2correct 191 66.8

time e↵ort

incorrect 99 34.6correct 187 65.3

query e↵ort

incorrect 91 31.8correct 195 68.2

ability to find right result

incorrect 78 27.2correct 208 72.8

Table 3: Users judging complex search tasks

di�culty(%)

timee↵ort(%)

querye↵ort(%)

taskoutcome

(%)

Simpletasks

(n=295)

90±2 91±2 87±2 95±1

Complextasks

(n=286)

67±3 65±3 68±3 73±3

p-value <0.001 <0.001 <0.001 <0.001

Table 4: Correctly judged tasks per dependent vari-

able (mean values over tasks)

culty for complex tasks. We followed the same procedurefor time e↵ort, query e↵ort and task outcome.In the case of simple tasks, users are significantly better at

estimating all four parameters: di�culty, time e↵ort, querye↵ort and search success, i.e. the di↵erence between theirpre-task estimate and their post-task experience based valuefor a certain parameter is significantly smaller and they showa higher probability to correctly judge the parameter. Ta-ble 5 shows the judging performance for all users for eachspecific task. Also here the division in terms of performancebetween simple (S) and complex (C) tasks is clearly visible.

RQ4: Does the users’ ability to judge if the in-formation they have found is correct or not de-pend on task complexity?We compared users’ subjective values for the questionwhether they thought they could/would find the correct in-formation (yes or no) with the objectively graded outcome(by building the di↵erence of the two submitted values).The objective result is a manual review of all the answersgiven by the participants of our study. In this evaluation weskipped the tasks where a user had not delivered any result,as we did not know what was the reason for not delivering(could be not being able, found wrong results and did notwant to submit, or simply forgot to submit).We also analyzed, whether the users were able to judge

the correctness of their found results after having finished

Page 18: Ordinary Search Engine Users Assessing Difficulty, Effort and Outcome for Simple and Complex Search Tasks

Results on RQ 2 (“Can users assess difficulty, effort and task outcome for complex search tasks?”)

•  Ratio of correctly judged tasks drops to approx. 2/3 when considering complex tasks.

# of tasks %

di�culty

incorrect 29 9.8correct 266 90.2

time e↵ort

incorrect 27 9.1correct 268 90.8

query e↵ort

incorrect 38 12.9correct 257 87.1

ability to find right result

incorrect 16 5.4correct 279 94.6

Table 2: Users judging simple search tasks

simple search tasks. “# of tasks” represents the numberof simple tasks that have been processed by the study par-ticipants. The total number of tasks (correct plus incorrectones) should have been 56*6=336 (56 valid users x 6 tasks).It is slightly lower due to invalid or not given answers ornot fulfilled tasks. % shows the percentage of the number ofjudged tasks to the total valid answers for tasks. We gradedan answer as correct when the users’ self-judged values werethe same in the pre-task questionnaire and the post-taskquestionnaire. For example if they judged a task to be di�-cult in the pre-task questionnaire and after carrying out thetask stated again that it was a di�cult task, the judgmentwas graded as correct.

For all parameters (di�culty, time e↵ort, query e↵ort, andresult finding ability) approximately 90% of the users man-aged to match estimated and experienced values for simpletasks. However, in our study the users had slightly moretrouble estimating the time e↵ort needed than the querye↵ort (in terms of estimating going over a threshold of num-bers of queries).

RQ2: Can users assess difficulty, effort andtask outcome for complex search tasks?Table 3 outlines our findings on users and their ability toestimate di�culty, e↵ort, and being able to find the correctresult for complex search tasks. For all parameters about70% of all tasks were judged correctly with slightly worseestimations for query e↵ort and result-finding skills.

RQ3: Are there significant differences betweenassessing simple and complex search tasks?To analyze if there exist significant di↵erences between howusers assess certain parameters (di�culty, time e↵ort, querye↵ort, search success), we have compared the di↵erencesbetween users’ pre-task estimate and post-task experiencebased values for 295 simple tasks and 286 complex tasks asoutlined in Table 4 (100% means a 100% probability to judgethe right di�culty; pre-task estimate and post task evalua-tion are totally in line). We used paired sample t-tests tocompare the results from simple and complex tasks. Wepaired average di�culty for simple tasks and average di�-

# of tasks %

di�culty

incorrect 95 33.2correct 191 66.8

time e↵ort

incorrect 99 34.6correct 187 65.3

query e↵ort

incorrect 91 31.8correct 195 68.2

ability to find right result

incorrect 78 27.2correct 208 72.8

Table 3: Users judging complex search tasks

di�culty(%)

timee↵ort(%)

querye↵ort(%)

taskoutcome

(%)

Simpletasks

(n=295)

90±2 91±2 87±2 95±1

Complextasks

(n=286)

67±3 65±3 68±3 73±3

p-value <0.001 <0.001 <0.001 <0.001

Table 4: Correctly judged tasks per dependent vari-

able (mean values over tasks)

culty for complex tasks. We followed the same procedurefor time e↵ort, query e↵ort and task outcome.In the case of simple tasks, users are significantly better at

estimating all four parameters: di�culty, time e↵ort, querye↵ort and search success, i.e. the di↵erence between theirpre-task estimate and their post-task experience based valuefor a certain parameter is significantly smaller and they showa higher probability to correctly judge the parameter. Ta-ble 5 shows the judging performance for all users for eachspecific task. Also here the division in terms of performancebetween simple (S) and complex (C) tasks is clearly visible.

RQ4: Does the users’ ability to judge if the in-formation they have found is correct or not de-pend on task complexity?We compared users’ subjective values for the questionwhether they thought they could/would find the correct in-formation (yes or no) with the objectively graded outcome(by building the di↵erence of the two submitted values).The objective result is a manual review of all the answersgiven by the participants of our study. In this evaluation weskipped the tasks where a user had not delivered any result,as we did not know what was the reason for not delivering(could be not being able, found wrong results and did notwant to submit, or simply forgot to submit).We also analyzed, whether the users were able to judge

the correctness of their found results after having finished

Page 19: Ordinary Search Engine Users Assessing Difficulty, Effort and Outcome for Simple and Complex Search Tasks

Results on RQ3 (“Are there significant performance differences between assessing simple and complex search tasks?”)

•  Comparison of the differences between pre-task and post-task judgments over all tasks (i.e., user-independent)

•  Paired sample t-tests to compare the results

•  Users are significantly better at judging the complexity of simple tasks.

# of tasks %

di�culty

incorrect 29 9.8correct 266 90.2

time e↵ort

incorrect 27 9.1correct 268 90.8

query e↵ort

incorrect 38 12.9correct 257 87.1

ability to find right result

incorrect 16 5.4correct 279 94.6

Table 2: Users judging simple search tasks

simple search tasks. “# of tasks” represents the numberof simple tasks that have been processed by the study par-ticipants. The total number of tasks (correct plus incorrectones) should have been 56*6=336 (56 valid users x 6 tasks).It is slightly lower due to invalid or not given answers ornot fulfilled tasks. % shows the percentage of the number ofjudged tasks to the total valid answers for tasks. We gradedan answer as correct when the users’ self-judged values werethe same in the pre-task questionnaire and the post-taskquestionnaire. For example if they judged a task to be di�-cult in the pre-task questionnaire and after carrying out thetask stated again that it was a di�cult task, the judgmentwas graded as correct.

For all parameters (di�culty, time e↵ort, query e↵ort, andresult finding ability) approximately 90% of the users man-aged to match estimated and experienced values for simpletasks. However, in our study the users had slightly moretrouble estimating the time e↵ort needed than the querye↵ort (in terms of estimating going over a threshold of num-bers of queries).

RQ2: Can users assess difficulty, effort andtask outcome for complex search tasks?Table 3 outlines our findings on users and their ability toestimate di�culty, e↵ort, and being able to find the correctresult for complex search tasks. For all parameters about70% of all tasks were judged correctly with slightly worseestimations for query e↵ort and result-finding skills.

RQ3: Are there significant differences betweenassessing simple and complex search tasks?To analyze if there exist significant di↵erences between howusers assess certain parameters (di�culty, time e↵ort, querye↵ort, search success), we have compared the di↵erencesbetween users’ pre-task estimate and post-task experiencebased values for 295 simple tasks and 286 complex tasks asoutlined in Table 4 (100% means a 100% probability to judgethe right di�culty; pre-task estimate and post task evalua-tion are totally in line). We used paired sample t-tests tocompare the results from simple and complex tasks. Wepaired average di�culty for simple tasks and average di�-

# of tasks %

di�culty

incorrect 95 33.2correct 191 66.8

time e↵ort

incorrect 99 34.6correct 187 65.3

query e↵ort

incorrect 91 31.8correct 195 68.2

ability to find right result

incorrect 78 27.2correct 208 72.8

Table 3: Users judging complex search tasks

di�culty(%)

timee↵ort(%)

querye↵ort(%)

taskoutcome

(%)

Simpletasks

(n=295)

90±2 91±2 87±2 95±1

Complextasks

(n=286)

67±3 65±3 68±3 73±3

p-value <0.001 <0.001 <0.001 <0.001

Table 4: Correctly judged tasks per dependent vari-

able (mean values over tasks)

culty for complex tasks. We followed the same procedurefor time e↵ort, query e↵ort and task outcome.In the case of simple tasks, users are significantly better at

estimating all four parameters: di�culty, time e↵ort, querye↵ort and search success, i.e. the di↵erence between theirpre-task estimate and their post-task experience based valuefor a certain parameter is significantly smaller and they showa higher probability to correctly judge the parameter. Ta-ble 5 shows the judging performance for all users for eachspecific task. Also here the division in terms of performancebetween simple (S) and complex (C) tasks is clearly visible.

RQ4: Does the users’ ability to judge if the in-formation they have found is correct or not de-pend on task complexity?We compared users’ subjective values for the questionwhether they thought they could/would find the correct in-formation (yes or no) with the objectively graded outcome(by building the di↵erence of the two submitted values).The objective result is a manual review of all the answersgiven by the participants of our study. In this evaluation weskipped the tasks where a user had not delivered any result,as we did not know what was the reason for not delivering(could be not being able, found wrong results and did notwant to submit, or simply forgot to submit).We also analyzed, whether the users were able to judge

the correctness of their found results after having finished

Page 20: Ordinary Search Engine Users Assessing Difficulty, Effort and Outcome for Simple and Complex Search Tasks

Results on RQ 4 (“Does the users’ ability to judge if the information they have found is correct or not depend on task complexity?”)

•  Comparison between subjective judgments (from questionnaires) and objective result (as judged by a research assistant).

•  Users’ judging ability depends on task complexity:

–  Using the pre-task data, 87% of simple tasks are predicted correctly, where for complex tasks, only 52% were predicted correctly.

–  When considering post-task data, 88% of simple tasks are predicted correctly, where for complex tasks, this number is also much lower (60%).

Task di�culty(%)

timee↵ort(%)

querye↵ort(%)

taskout-come(%)

1 (S) (n=51) 92±4 90±4 88±5 98±22 (S) (n=48) 81±6 85±5 73±6 88±53 (S) (n=47) 89±5 91±4 87±5 96±34 (S) (n=51) 98±2 100±0 98±3 96±35 (S) (n=49) 84±5 82±6 82±6 94±36 (S) (n=49) 96±3 96±3 94±3 96±37 (C) (n=47) 62±7 51±7 60±7 85±58 (C) (n=48) 67±7 69±7 71±7 77±69 (C) (n=48) 60±7 81±6 73±6 81±610 (C) (n=46) 72±7 67±7 72±7 52±711 (C) (n=49) 65±7 67±7 65±7 61±712 (C) (n=47) 74±6 55±7 68±7 79±6

Table 5: Fraction of users correctly judging task pa-

rameters per task

Task type Correctly estimatedtasks (%)

simple(n=259)

87±2

complex(n=233)

52±3

p-value <0.001

Table 6: Judgments of expected search outcome (in

pre-task questionnaire) compared to correctness of

manually evaluated search results (mean values over

tasks)

the search task. These evaluations gives us an estimate ofhow well users can judge that a result they found on theInternet is actually correct and how well they can judge inadvance, if they will be able to find the correct result.

Table 6 shows that the ability to predict, whether it is pos-sible to find the correct information, is significantly higherfor simple search tasks than for complex search tasks, 87%versus 52% in case of complex tasks. We used paired sam-ple t-tests to compare the results from simple and complextasks. We paired average correctness for simple tasks andaverage correctness for complex tasks.

Table 7 depicts that the ability to judge whether a foundinformation is correct or not is significantly higher in caseof simple tasks than it is in case of complex tasks. Here thedi↵erence is also significant, 88% versus 60%.

The di↵erence of observations between simple and com-plex tasks is due to the fact that not all users always cor-rectly entered their rating into our system and thereforethose tasks had to be omitted.

RQ5: Is there a correlation between the over-all search performance (ranking in the experi-ment) and the ability to assess difficulty, effortand outcome for complex tasks?As also mentioned in the methods section, to understand therelation between search performance (ranking of the user inthe experiment) and the ability to estimate task di�culty,e↵ort, and outcome, we ordered the users according to their

Task type Correctly estimatedtasks (%)

simple(n=259)

88±2

complex(n=230)

60±3

p-value <0.001

Table 7: Assessments of self-judged search results

(in post-task questionnaire) compared to correct-

ness of manually evaluated search results (mean val-

ues over tasks)

Avg.di�cultyin %

Avg.time

e↵ort in%

Avg.query

e↵ort in%

Avg.task

outcomein %

1.quartile(n=67)

67±6 64±6 67±6 85±4

4.quartile(n=59)

73±6 75±6 73±6 64±6

p-value n.s. n.s. n.s. <0.05

Table 8: Correct estimations of best and worst quar-

tile for expected and experienced task parameters

ranking in the experiment. We ranked the users first by thenumber of correct answers given and then, in cases of userswith the same number of correct answers, by answers withright elements (simple and complex tasks).Then we compared the complex tasks of the first quartile

of ranked users (n=67 tasks) with the complex tasks of thefourth quartile of ranked users (n=59 tasks) as outlined inTable 8. The number of tasks is di↵erent due to the fact thatnot all users always correctly entered their estimates into thesystem and therefore those tasks had to be omitted.The results show that good searchers are not significantly

better at judging di�culty and e↵ort for complex searchtasks. However, they are significantly better at judging thetask outcome.

RQ6: Does the judging performance depend ontask complexity or only on the individual user?In this subsection we examine whether the judging perfor-mance depends on task complexity or simply on the individ-ual ability to make those judgments. It could for examplebe that some users are very good at judging simple as wellas complex tasks while others perform badly for both taskgroups. In an analysis that takes the average over all simpletasks and compares them with the average over all com-plex tasks (independent of the user e.g. as done for researchquestion 3) this fact would not show up.Table 9 shows the results for users judging the task dif-

ficulty for simple and complex tasks at the same time. 3out of 53 users (6%) were able to judge the di�culty totallyright for simple and complex tasks. 30 out of 53 users (56%)managed to judge the di�culty for simple tasks right and atthe same time were not totally correct for the complex tasks.Three users (6%) were able to correctly judge all complex

Task di�culty(%)

timee↵ort(%)

querye↵ort(%)

taskout-come(%)

1 (S) (n=51) 92±4 90±4 88±5 98±22 (S) (n=48) 81±6 85±5 73±6 88±53 (S) (n=47) 89±5 91±4 87±5 96±34 (S) (n=51) 98±2 100±0 98±3 96±35 (S) (n=49) 84±5 82±6 82±6 94±36 (S) (n=49) 96±3 96±3 94±3 96±37 (C) (n=47) 62±7 51±7 60±7 85±58 (C) (n=48) 67±7 69±7 71±7 77±69 (C) (n=48) 60±7 81±6 73±6 81±610 (C) (n=46) 72±7 67±7 72±7 52±711 (C) (n=49) 65±7 67±7 65±7 61±712 (C) (n=47) 74±6 55±7 68±7 79±6

Table 5: Fraction of users correctly judging task pa-

rameters per task

Task type Correctly estimatedtasks (%)

simple(n=259)

87±2

complex(n=233)

52±3

p-value <0.001

Table 6: Judgments of expected search outcome (in

pre-task questionnaire) compared to correctness of

manually evaluated search results (mean values over

tasks)

the search task. These evaluations gives us an estimate ofhow well users can judge that a result they found on theInternet is actually correct and how well they can judge inadvance, if they will be able to find the correct result.

Table 6 shows that the ability to predict, whether it is pos-sible to find the correct information, is significantly higherfor simple search tasks than for complex search tasks, 87%versus 52% in case of complex tasks. We used paired sam-ple t-tests to compare the results from simple and complextasks. We paired average correctness for simple tasks andaverage correctness for complex tasks.

Table 7 depicts that the ability to judge whether a foundinformation is correct or not is significantly higher in caseof simple tasks than it is in case of complex tasks. Here thedi↵erence is also significant, 88% versus 60%.

The di↵erence of observations between simple and com-plex tasks is due to the fact that not all users always cor-rectly entered their rating into our system and thereforethose tasks had to be omitted.

RQ5: Is there a correlation between the over-all search performance (ranking in the experi-ment) and the ability to assess difficulty, effortand outcome for complex tasks?As also mentioned in the methods section, to understand therelation between search performance (ranking of the user inthe experiment) and the ability to estimate task di�culty,e↵ort, and outcome, we ordered the users according to their

Task type Correctly estimatedtasks (%)

simple(n=259)

88±2

complex(n=230)

60±3

p-value <0.001

Table 7: Assessments of self-judged search results

(in post-task questionnaire) compared to correct-

ness of manually evaluated search results (mean val-

ues over tasks)

Avg.di�cultyin %

Avg.time

e↵ort in%

Avg.query

e↵ort in%

Avg.task

outcomein %

1.quartile(n=67)

67±6 64±6 67±6 85±4

4.quartile(n=59)

73±6 75±6 73±6 64±6

p-value n.s. n.s. n.s. <0.05

Table 8: Correct estimations of best and worst quar-

tile for expected and experienced task parameters

ranking in the experiment. We ranked the users first by thenumber of correct answers given and then, in cases of userswith the same number of correct answers, by answers withright elements (simple and complex tasks).Then we compared the complex tasks of the first quartile

of ranked users (n=67 tasks) with the complex tasks of thefourth quartile of ranked users (n=59 tasks) as outlined inTable 8. The number of tasks is di↵erent due to the fact thatnot all users always correctly entered their estimates into thesystem and therefore those tasks had to be omitted.The results show that good searchers are not significantly

better at judging di�culty and e↵ort for complex searchtasks. However, they are significantly better at judging thetask outcome.

RQ6: Does the judging performance depend ontask complexity or only on the individual user?In this subsection we examine whether the judging perfor-mance depends on task complexity or simply on the individ-ual ability to make those judgments. It could for examplebe that some users are very good at judging simple as wellas complex tasks while others perform badly for both taskgroups. In an analysis that takes the average over all simpletasks and compares them with the average over all com-plex tasks (independent of the user e.g. as done for researchquestion 3) this fact would not show up.Table 9 shows the results for users judging the task dif-

ficulty for simple and complex tasks at the same time. 3out of 53 users (6%) were able to judge the di�culty totallyright for simple and complex tasks. 30 out of 53 users (56%)managed to judge the di�culty for simple tasks right and atthe same time were not totally correct for the complex tasks.Three users (6%) were able to correctly judge all complex

Page 21: Ordinary Search Engine Users Assessing Difficulty, Effort and Outcome for Simple and Complex Search Tasks

Results on RQ5 (“Is there a correlation between the overall search performance (ranking in the experiment) and the ability to assess difficulty, time effort, query effort, and task outcome for complex tasks?”)

•  Comparison of the top performing users and the worst performing users (1st quartile vs. 4th quartile).

•  Good searchers are not significantly better at judging difficulty and effort for complex tasks, but they are significantly better at judging the task outcome.

Task di�culty(%)

timee↵ort(%)

querye↵ort(%)

taskout-come(%)

1 (S) (n=51) 92±4 90±4 88±5 98±22 (S) (n=48) 81±6 85±5 73±6 88±53 (S) (n=47) 89±5 91±4 87±5 96±34 (S) (n=51) 98±2 100±0 98±3 96±35 (S) (n=49) 84±5 82±6 82±6 94±36 (S) (n=49) 96±3 96±3 94±3 96±37 (C) (n=47) 62±7 51±7 60±7 85±58 (C) (n=48) 67±7 69±7 71±7 77±69 (C) (n=48) 60±7 81±6 73±6 81±610 (C) (n=46) 72±7 67±7 72±7 52±711 (C) (n=49) 65±7 67±7 65±7 61±712 (C) (n=47) 74±6 55±7 68±7 79±6

Table 5: Fraction of users correctly judging task pa-

rameters per task

Task type Correctly estimatedtasks (%)

simple(n=259)

87±2

complex(n=233)

52±3

p-value <0.001

Table 6: Judgments of expected search outcome (in

pre-task questionnaire) compared to correctness of

manually evaluated search results (mean values over

tasks)

the search task. These evaluations gives us an estimate ofhow well users can judge that a result they found on theInternet is actually correct and how well they can judge inadvance, if they will be able to find the correct result.

Table 6 shows that the ability to predict, whether it is pos-sible to find the correct information, is significantly higherfor simple search tasks than for complex search tasks, 87%versus 52% in case of complex tasks. We used paired sam-ple t-tests to compare the results from simple and complextasks. We paired average correctness for simple tasks andaverage correctness for complex tasks.

Table 7 depicts that the ability to judge whether a foundinformation is correct or not is significantly higher in caseof simple tasks than it is in case of complex tasks. Here thedi↵erence is also significant, 88% versus 60%.

The di↵erence of observations between simple and com-plex tasks is due to the fact that not all users always cor-rectly entered their rating into our system and thereforethose tasks had to be omitted.

RQ5: Is there a correlation between the over-all search performance (ranking in the experi-ment) and the ability to assess difficulty, effortand outcome for complex tasks?As also mentioned in the methods section, to understand therelation between search performance (ranking of the user inthe experiment) and the ability to estimate task di�culty,e↵ort, and outcome, we ordered the users according to their

Task type Correctly estimatedtasks (%)

simple(n=259)

88±2

complex(n=230)

60±3

p-value <0.001

Table 7: Assessments of self-judged search results

(in post-task questionnaire) compared to correct-

ness of manually evaluated search results (mean val-

ues over tasks)

Avg.di�cultyin %

Avg.time

e↵ort in%

Avg.query

e↵ort in%

Avg.task

outcomein %

1.quartile(n=67)

67±6 64±6 67±6 85±4

4.quartile(n=59)

73±6 75±6 73±6 64±6

p-value n.s. n.s. n.s. <0.05

Table 8: Correct estimations of best and worst quar-

tile for expected and experienced task parameters

ranking in the experiment. We ranked the users first by thenumber of correct answers given and then, in cases of userswith the same number of correct answers, by answers withright elements (simple and complex tasks).Then we compared the complex tasks of the first quartile

of ranked users (n=67 tasks) with the complex tasks of thefourth quartile of ranked users (n=59 tasks) as outlined inTable 8. The number of tasks is di↵erent due to the fact thatnot all users always correctly entered their estimates into thesystem and therefore those tasks had to be omitted.The results show that good searchers are not significantly

better at judging di�culty and e↵ort for complex searchtasks. However, they are significantly better at judging thetask outcome.

RQ6: Does the judging performance depend ontask complexity or only on the individual user?In this subsection we examine whether the judging perfor-mance depends on task complexity or simply on the individ-ual ability to make those judgments. It could for examplebe that some users are very good at judging simple as wellas complex tasks while others perform badly for both taskgroups. In an analysis that takes the average over all simpletasks and compares them with the average over all com-plex tasks (independent of the user e.g. as done for researchquestion 3) this fact would not show up.Table 9 shows the results for users judging the task dif-

ficulty for simple and complex tasks at the same time. 3out of 53 users (6%) were able to judge the di�culty totallyright for simple and complex tasks. 30 out of 53 users (56%)managed to judge the di�culty for simple tasks right and atthe same time were not totally correct for the complex tasks.Three users (6%) were able to correctly judge all complex

Page 22: Ordinary Search Engine Users Assessing Difficulty, Effort and Outcome for Simple and Complex Search Tasks

Results on RQ 6 (“Does the judging performance depend on task complexity or simply the individual user?”)

•  There are some users who are able to correctly judge all task parameters for both simple and complex tasks, but these are only a few.

•  The majority of users gets only some parameters right (for simple and for complex tasks, as well).

•  These findings hold true for all task parameters considered.

•  Judging performance is much more dependent on task complexity than on the individual user.

Page 23: Ordinary Search Engine Users Assessing Difficulty, Effort and Outcome for Simple and Complex Search Tasks

Agenda

1.  Introduction 2.  Definitions

3.  Research questions

4.  Methods

5.  Results

6.  Discussion and conclusion

Page 24: Ordinary Search Engine Users Assessing Difficulty, Effort and Outcome for Simple and Complex Search Tasks

Discussion

•  No problem with assessing the difficulty of simple tasks. –  Reason might be that all have sufficient experience with such tasks.

•  Two thirds are able to sufficiently judge the subjective difficulty of complex tasks.

–  However, large gap between self-judged success and objective results!

•  Only 47% of submitted results for complex tasks were completely correct.

–  The problem with complex tasks might not be users finding no results, but the results found only seemingly being correct. This may to a certain extent explain users’ satisfaction with search engines’ outcomes.

à More research needed on when users only think they have found correct results.

Page 25: Ordinary Search Engine Users Assessing Difficulty, Effort and Outcome for Simple and Complex Search Tasks

Conclusion and limitations

•  Users tend to overestimate their own search capabilities in case of complex tasks. Search engines should offer more support on these tasks.

•  While our experiences with recruiting using a demographic structure model are good, it lead to some problems in this study:

–  Users’ strategies vary greatly, which resulted in high standard errors for some indicators.

–  A solution would be either to increase sample size, or to focus on more specific user groups.

–  However, using specific user groups (e.g., students) might produce significant results, but they will not hold true for other populations. After all, search engines are used by “everyone”, and research in that area should aim at producing results relevant to the whole user population.

Page 26: Ordinary Search Engine Users Assessing Difficulty, Effort and Outcome for Simple and Complex Search Tasks

Thank you. Dirk Lewandowski, [email protected]

Part of this research was supported by the European Union Regional Development Fund through the Estonian Centre of Excellence in Computer Science and by the target funding scheme SF0180008s12. In addition the research was supported by the Estonian Information Technology Foundation (EITSA) and the Tiger University program, as well as by Archimedes Estonia.