user-centered system evaluation

User-centered System Evaluation

Reference

• Diane Kelly (2009). Methods for Evaluating Interactive Information Retrieval Systems with Users. Foundations and Trends in Information Retrieval, 3(1-2), 1-224. DOI: 10.1561/1500000012

INTRODUCTION

Interactive Information Retrieval (IIR)

• Traditional IR evaluations abstract users out of the evaluation process

• IIR focuses on user’s behaviors and experiences, – physical, cognitive and affective– Interactions between users and systems– Interactions between users and information

Different evaluation questions

• Classic IR evaluation (non-user centric): does this system retrieve relevant document?

• IIR evaluation (user-centric): can people use this system to retrieve relevant documents.

• Therefore: IIR is viewed as a sub-area of HCI

Relevance Feedback

• Same information needs different queries different search results different relevance feedback.

• Dealing with users is difficult as causes and consequences of interactions cannot be observed easily (it is in user’s head)

• The available observation: query, save a document, provide relevance feedback.

• Based on these observation, we must infer

Difficulties

• Each individual user has a different cognitive composition and behavioral disposition

• Some interactions are not easily observable nor measurable– Motivation,– How much to know the topic– expectations

IIR

• Using users to evaluate IR• Different approaches– Using users to evaluate the research results of a system

(users are treated as black boxes)– Search log analysis (queries, search results and click-

through behavior)– TREC Interactive Track evaluation model (evaluating a

system or interface)– General information search behavior in electronic

environments (observing and documenting user’s natural search behaviors and interactions)

APPROACHES

Research goals

• Setting up a clear research goal: – Exploration: when the subject is less known, focusing on

learning the subject, rather than make prediction, open-end research questions or hypotheses are uncommon.

– Description: documenting and describing a subject (query log or query behavior analysis), to provide benchmark description and classification, results can be used to inform other studies

– Explanation: examine the relationship between two or more variables with the goal of prediction and explanation, establish causality,

Approaches

• Evaluations vs. Experiments– Evaluation: to assess the goodness of a system, interface or

interaction technique.– Experiments: to understand behavior, (similar as

experiments in psychology or education), compare at least two things.

• Lab and naturalistic studies– Lab (more controls) vs. naturalistic (less controls)

• Longitudinal studies– Taking place over an extended period of time and

measurements are taken at fixed intervals.

Approaches• Case studies

– The intensive study of a small number of cases– A case maybe a user, a system or an organization.– It usually takes place in naturalistic settings and involve some longitudinal

elements.– Not for generalizing rather than gaining an in-depth view of a particular

case.• Wizard of Oz studies and simulations

– Testing “non-real” or simulated system– Used for proof-of-concept– Provide an indication of what might happen in ideal circumstances– Wizard of Oz studies are simulations– Simulated users can represent different actions or steps a real user might

take while interacting with an IR system

RESEARCH BASICS

Problems and Questions

• Identify and describe problems– Provide roadmap for research

• Example of research questions– Exploratory:

• How do people re-find information on the Web?

– Descriptive:• What Web browser functionalities are currently being used during

web-based information-seeking tasks

– Explanatory:• What are the differences between written and spoken queries in terms

of their retrieval characteristics and performance outcomes?• What is the relationship between query box size and query length?

What is the relationship between query length and performance?

Theory

• A theory is a system of logical principles that attempts to explain relations among natural, observable phenomena.

• Theory is abstract, general, can generate more specific hypotheses

Hypotheses

• Hypotheses state expected relationships between two variables

• Alternative hypotheses vs. null hypotheses– Specific relationship vs. no relationship

• Hypotheses can be directional or non-directional

Variables and measurement

• Variables represent concept• To analyze concepts

– Conceptualization• To define concepts: provide temporary definition, divide into

dimensions

– Operationalization• How to measure the concept:

• Direct and indirect observables– Directly observed:

• # of queries entered, the amount time spent searching

– Indirectly observed:• User satisfaction

Variables

• Independent: the causes– examining differences in how males and females use an

experimental and baseline IIR system– Sex is independent variables

• Dependent: the effects– E.g., Satisfaction or performance of the systems.

• Confounding variables– Affect the independent or dependent variable, but have not

been controlled by the researcher.– E.g., maybe males are more familiar with these systems than

females.

Measurement• Range of variation

– Preciseness of the measure– E.g., category of usage frequency of a system

• Exhaustiveness– Complete list of choices

• Exclusiveness– How to differentiate partially relevant vs. somewhat relevant (in your

relevance rubric)• Equivalence

– Find items that are of the same type and at the same level of specificity• Different scales: I know details=very familiar, I know nothing=very unfamiliar

• Appropriateness– How likely are you to recommend this system to others? Scale: a five-point

scale with strongly agree and strongly disagree – which does not match the question

Level of Measurement

• Two basic levels of measurement: discrete vs. continuous– Discrete measures: categorical responses

• Nominal: no order– E.g., interface type, sex, task-type

• Ordinal: ordered– Rank-order (from most relevant to least relevant) or Likert-type

order (five-point scale with 1=not relevant, 5=relevant)– Relative measure

» one subject’s 2 may not represent the same thing internally as another subject’s 2.

» we could not say that a document rated 4 was twice as relevant as a document rated 2 since the scale contains no true zero

Level of Measurement

• Two basic levels of measurement: discrete vs. continuous– Continuous measure: interval vs. ratio

• Different between consecutive points are equal, but there is no true zero for interval scales– Fahrenheit temperature scale, IQ test scores– Zero does not mean no heat or no intelligence– The differences between 50 vs. 80 and 90 vs. 120 are same

• Ratio: the highest level of measurement: the number of occurrences.– There is a true zero– E.g. time, number of pages viewed (zero is meaningful)

EXPERIMENTAL DESIGN

• The basic experimental design in IIR evaluation examines the relationship between two or more systems or interfaces (independent variable) on some set of outcome measures (dependent variables)

IIR design

• General goal of IIR is to determine if a particular system helps subjects find relevant documents

• Developing a valid baseline in IIR evaluation involves identifying and blending the status quo and the experimental system.

• Random assignment can be used to increase the characteristics being evenly distributed across groups

Factorial Designs

• Good for studying the impact of more than one stimulus or variable

Rotation and counterbalancing

• The primary purpose of rotation and counterbalancing is to control for order effects and to increase the change that results can be attributed to the experimental treatments and conditions.

• Rotating variables:– Latin square design– Graeco-Latin square design

Rotation and counterbalancing

A basic design with no rotation. Numbers in cells represent different topics

Cons:1. Order effects2. Some topics are easier than others, some systems may do better with some

topics than others.3. Fatigue can impact the results

Latin Square rotation

Basic Latin Square rotation of topics

Basic Latin Square rotation of topics and randomization of columns

Problems:-Interaction among topics- the order effects of interfaces still exist

Graeco-Latin Square Design

• To solve the problem of orders of interfaces existing above.

• Graeco-Latin Square is a combination of two or more Latin squares.

Graeco-Latin Square Design

Study mode

• Batch-mode– Multiple subjects complete the study at the same location and time

• Single-mode– Subjects complete the study alone, with only the researcher present.

• The choice of mode is determined by the purpose of the study.– Single-mode: if each subject has to be interviewed, or some

interactive communication needed between subject and researcher– Batch-mode: self-contained, efficient (but subject can influence each

other)

Protocols

• A protocol is a step by step account of what will happen in a study.

• Protocol helps maintain the integrity of the study and ensure that subjects experience the study in similar ways.

Tutorials

• Provide some instruction on how to use a new IIR system– Printed materials– Verbal instructions– Video tutorial

• Try to avoid bias in the tutorial– Such as specially focusing on one special feature.

Pilot testing

• To estimate time• To identify problems with instruments,

instructions, and protocols• To get detailed feedback from test subjects

SAMPLING

Sampling

• It is not possible to include all elements from a population in a study

• The population in IIR evaluation is assumed to be all people who engage in online information search.

• The size of sample: the more the better• Two approaches to sampling: probability

sampling and non-probability sampling

Probability Sampling

• Selecting a sample from a population that maintains the same variation and diversity that exists within the population.

• Representative sample: – In a population: 60% are males and 40% are females,

then your representative sample would also contain roughly the same ratio of males and females.

– Increase the generalizability of the results– Assumes that all elements in the population have an

equal chance of being selected.

Probability sampling

• Simple random sampling– Randomly pick up an element

• Systematic sampling– Pick up every kth element, where k=population

size/sample size• Stratified sampling– Subdivide the population into more refined groups

according to specific strata– Select a sample that is proportionate to the

population in each strata.

Non-probability sampling

• Used when all of the elements in a population is unknown, or not available.

• It limits its ability to generalize• Researchers should be cautious when

generalizing their data and be aware of the sampling limitations in their research.

Non-probability sampling

• Three major types of non-probability sampling:– Convenience: relying on available elements the

researcher can access: undergraduate students, people is located closer to the researcher.

– Purposive or judgmental sampling: a researcher selects subjects or other elements that have particular characteristics, expertise or perspectives

– Quota sampling: similar as stratified sampling, but the subjects for the strata are based on a first-come-first-served policy.

Subject Recruitment• Many ways to recruit subjects

– Send solicitations to mailinglists– Inviting– Using referral services

– Crowdsourcing– Mechanical Turk

– Web advertising– Mass mailings– Virtual posting in online locations

– Pros and Cons: using lab mates, or own research group members as study subjects

COLLECTIONS

Collections for testing

• Identification of a set of documents for subjects to search, a set of tasks or topics which directs this searching, and the ground truth about the relevance of the searched objects to the topics -

• A test collection: corpus, topics, and relevance judgments

TREC collections

• TREC Interactive and HARD tracks– Newswire, blog, legal– Artificial topics– Relevance assessment generalization problem

Web corpora

• The major drawback is that it is impossible to replicate the study since the Web is constantly changing.

• The same queries issued at different time can get completely different results

Natural corpora

• Corpora assembled over time by study participants– Pros: meaningful to subjects, controllable– Cons: lack of replicability and equivalence across

participants,

Tasks and topics

• Most information needs can be characterized in terms of tasks and topics– Information need = task = topic

• Information needs– People do not know their information needs– People have difficulties to articulate their

information needs– Or using a vocabulary proper for a system

Generating information needs

• It is not clear at what level of specificity a task or topic should be defined– Task can be broken down into a series of sub-

tasks, such as writing a research proposal• Working on the query logs to develop

information needs

DATA COLLECTION TECHNIQUES

Data collection techniques

• Corpora, tasks, topics, and relevance assessments are major instruments to evaluate IIR systems

• Other instruments: questionnaires, screen capture software allow researchers to collect data.

Think-Aloud

• Subjects articulate their thinking and decision-making during the evaluation process of IIR.

• Microphone, recording software, • It is unnatural as most people do not

articulate their thoughts as they complete tasks.

Stimulated Recall

• Researcher records the screen of the computer as the subject completes a searching task. Then, the recording is played back to the subject and ask the subject to articulate thinking and decision-making.

• Tool: screen recording software

Spontaneous and prompted self-report

• Elicit feedback from subjects periodically while they search.

• Goal: get more refined feedback about the search, rather than summative feedback at the end of the search

observation

• Researcher is seated near subjects and observes them when they conduct IIR activities

• Tool: video camera, screen capture software• Time consuming, and labor intensive• Prone to selective attention and researcher

bias.

logging

• Analyzing transaction logs.• Client-side logging provides a more robust and

comprehensive log of the user’s interactions.• But is very hard to build a client-side logger

Questionnaire

• Consist of – closed questions where a specific response set is

provides (e.g. a five-point scale) quantitative analysis – open questions qualitative analysis

• Closed questions: Likert-type scale (e.g. five to seven point: strongly agree, agree, neutral, disagree, strongly disagree)

• Open questions: content analysis• Different modes: electronic, pen-and-paper,

interview

Interview

• Few IIR evaluation consist solely of interviews, but interviews are a common component of many study protocols.

• Subjects response to open-ended questions in interview better than in other two modes (electronic, or pen-and-paper)

• Interview: structured, semi-structured or open

MEASURES

Four basic measures

• Four basic classes of measures– Contextual (age, sex, search experience, personality-

type), – Interaction (# of queries issued, # of documents

viewed, query length), can be extracted from log data– Performance (# of relevant documents saved, mean

average precision, discounted cumulated gain), can be computed from log data

– Usability: subject attitudes and feelings about the system and their interactions

contextual

• Individual differences: their impact on the study results

• Information needs: domain expertise is measured using credentials

• Persistence of information needs• Immediacy of information need• Information-seeking stage

Interaction

• Measures:– # of queries, # of search results viewed, # of

documents viewed, # of documents saved, query length

• The implicit definition of interaction is tied to feedback

Performance

• When directly apply TREC measures to IIR evaluation, assume: relevance is binary, static, uni-dimensional and generalizable

• Whether the TREC-based performance metrics is meaningful to end users– A measure that evaluates systems based on the

retrieval of 1000 documents is unlikely to be meaningful to users since most users will not look through 1000 documents.

Traditional IR performance measures

Interactive recall and precision

Measures that accommodate multi-level relevance and rank

Time-based measures

• A variety of time-based measures– The length of time subjects spend in different

states or modes– The amount of time it takes a subject to save the

first relevant articles– The number of relevant documents saved during a

fixed period of time– The number of actions or steps taken to complete

a task

Cost and utility measures

• Some search services are not free• Have always been an important part of the

evaluation of library and information services

Evaluative feedback from subjects

• Usability– Effectiveness, efficiency and satisfaction as key

dimensions of usability– Effectiveness: precision, recall– Efficiency: the time it takes a subject to complete

a task.– Satisfaction: be satisfied for each different

experimental feature of the system, subject perceptions of outcomes and interactions

Available instruments for measuring usability

• Questionnaire for User Interface Satisfaction (QUIS): http://lap.umd.edu/quis/– 10-point scale for software, screen, terminology,

system, etc.• The USE questionnaire– Usefulness, ease of use, ease of learning, satisfaction

(7-point scale)• Software Usability Measurement Inventory

(SUMI): http://sumi.ucc.ie/whatis.html– Agree, do not know and disagree for 50 items

DATA ANALYSIS

Qualitative data analyses

• The goal of most qualitative data analyses that are constructed in IIR is to reduce the qualitative responses into a set of categories or themes that can be used to characterize and summarize responses.

• Content analysis: it starts with a well-defined and structured classification scheme, including categories and classification rules.

• Open coding: the categories are usually developed inductively during the analysis process as the researcher analyzes the data.

Quantitative data analysis

VALIDITY AND RELIABILITY

validity

• Internal validity: quality of what happens during the study– Whether the selected instrument yields poor or

inaccurate data• External validity: to what extent the results from

a study can be generalized to the real world.• Lab studies are generally less valid, but more

reliable than naturalistic studies• Using instruments with established reliability

user-centered system evaluation

Documents

system users

users behaviors

iir evaluation usercentric

evaluation processiir

research results

query behavior analysis

subject query log

clear research goal