user-centered system evaluation

74
User-centered System Evaluation

Upload: wendy-velez

Post on 02-Jan-2016

42 views

Category:

Documents


0 download

DESCRIPTION

User-centered System Evaluation. Reference. Diane Kelly (2009). Methods for Evaluating Interactive Information Retrieval Systems with Users. Foundations and Trends in Information Retrieval, 3(1-2), 1-224. DOI: 10.1561/1500000012. introduction. Interactive Information Retrieval (IIR). - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: User-centered System Evaluation

User-centered System Evaluation

Page 2: User-centered System Evaluation

Reference

• Diane Kelly (2009). Methods for Evaluating Interactive Information Retrieval Systems with Users. Foundations and Trends in Information Retrieval, 3(1-2), 1-224. DOI: 10.1561/1500000012

Page 3: User-centered System Evaluation

INTRODUCTION

Page 4: User-centered System Evaluation

Interactive Information Retrieval (IIR)

• Traditional IR evaluations abstract users out of the evaluation process

• IIR focuses on user’s behaviors and experiences, – physical, cognitive and affective– Interactions between users and systems– Interactions between users and information

Page 5: User-centered System Evaluation

Different evaluation questions

• Classic IR evaluation (non-user centric): does this system retrieve relevant document?

• IIR evaluation (user-centric): can people use this system to retrieve relevant documents.

• Therefore: IIR is viewed as a sub-area of HCI

Page 6: User-centered System Evaluation

Relevance Feedback

• Same information needs different queries different search results different relevance feedback.

• Dealing with users is difficult as causes and consequences of interactions cannot be observed easily (it is in user’s head)

• The available observation: query, save a document, provide relevance feedback.

• Based on these observation, we must infer

Page 7: User-centered System Evaluation

Difficulties

• Each individual user has a different cognitive composition and behavioral disposition

• Some interactions are not easily observable nor measurable– Motivation,– How much to know the topic– expectations

Page 8: User-centered System Evaluation

IIR

• Using users to evaluate IR• Different approaches– Using users to evaluate the research results of a system

(users are treated as black boxes)– Search log analysis (queries, search results and click-

through behavior)– TREC Interactive Track evaluation model (evaluating a

system or interface)– General information search behavior in electronic

environments (observing and documenting user’s natural search behaviors and interactions)

Page 9: User-centered System Evaluation

APPROACHES

Page 10: User-centered System Evaluation

Research goals

• Setting up a clear research goal: – Exploration: when the subject is less known, focusing on

learning the subject, rather than make prediction, open-end research questions or hypotheses are uncommon.

– Description: documenting and describing a subject (query log or query behavior analysis), to provide benchmark description and classification, results can be used to inform other studies

– Explanation: examine the relationship between two or more variables with the goal of prediction and explanation, establish causality,

Page 11: User-centered System Evaluation

Approaches

• Evaluations vs. Experiments– Evaluation: to assess the goodness of a system, interface or

interaction technique.– Experiments: to understand behavior, (similar as

experiments in psychology or education), compare at least two things.

• Lab and naturalistic studies– Lab (more controls) vs. naturalistic (less controls)

• Longitudinal studies– Taking place over an extended period of time and

measurements are taken at fixed intervals.

Page 12: User-centered System Evaluation

Approaches• Case studies

– The intensive study of a small number of cases– A case maybe a user, a system or an organization.– It usually takes place in naturalistic settings and involve some longitudinal

elements.– Not for generalizing rather than gaining an in-depth view of a particular

case.• Wizard of Oz studies and simulations

– Testing “non-real” or simulated system– Used for proof-of-concept– Provide an indication of what might happen in ideal circumstances– Wizard of Oz studies are simulations– Simulated users can represent different actions or steps a real user might

take while interacting with an IR system

Page 13: User-centered System Evaluation

RESEARCH BASICS

Page 14: User-centered System Evaluation

Problems and Questions

• Identify and describe problems– Provide roadmap for research

• Example of research questions– Exploratory:

• How do people re-find information on the Web?

– Descriptive:• What Web browser functionalities are currently being used during

web-based information-seeking tasks

– Explanatory:• What are the differences between written and spoken queries in terms

of their retrieval characteristics and performance outcomes?• What is the relationship between query box size and query length?

What is the relationship between query length and performance?

Page 15: User-centered System Evaluation

Theory

• A theory is a system of logical principles that attempts to explain relations among natural, observable phenomena.

• Theory is abstract, general, can generate more specific hypotheses

Page 16: User-centered System Evaluation

Hypotheses

• Hypotheses state expected relationships between two variables

• Alternative hypotheses vs. null hypotheses– Specific relationship vs. no relationship

• Hypotheses can be directional or non-directional

Page 17: User-centered System Evaluation

Variables and measurement

• Variables represent concept• To analyze concepts

– Conceptualization• To define concepts: provide temporary definition, divide into

dimensions

– Operationalization• How to measure the concept:

• Direct and indirect observables– Directly observed:

• # of queries entered, the amount time spent searching

– Indirectly observed:• User satisfaction

Page 18: User-centered System Evaluation

Variables

• Independent: the causes– examining differences in how males and females use an

experimental and baseline IIR system– Sex is independent variables

• Dependent: the effects– E.g., Satisfaction or performance of the systems.

• Confounding variables– Affect the independent or dependent variable, but have not

been controlled by the researcher.– E.g., maybe males are more familiar with these systems than

females.

Page 19: User-centered System Evaluation

Measurement• Range of variation

– Preciseness of the measure– E.g., category of usage frequency of a system

• Exhaustiveness– Complete list of choices

• Exclusiveness– How to differentiate partially relevant vs. somewhat relevant (in your

relevance rubric)• Equivalence

– Find items that are of the same type and at the same level of specificity• Different scales: I know details=very familiar, I know nothing=very unfamiliar

• Appropriateness– How likely are you to recommend this system to others? Scale: a five-point

scale with strongly agree and strongly disagree – which does not match the question

Page 20: User-centered System Evaluation

Level of Measurement

• Two basic levels of measurement: discrete vs. continuous– Discrete measures: categorical responses

• Nominal: no order– E.g., interface type, sex, task-type

• Ordinal: ordered– Rank-order (from most relevant to least relevant) or Likert-type

order (five-point scale with 1=not relevant, 5=relevant)– Relative measure

» one subject’s 2 may not represent the same thing internally as another subject’s 2.

» we could not say that a document rated 4 was twice as relevant as a document rated 2 since the scale contains no true zero

Page 21: User-centered System Evaluation

Level of Measurement

• Two basic levels of measurement: discrete vs. continuous– Continuous measure: interval vs. ratio

• Different between consecutive points are equal, but there is no true zero for interval scales– Fahrenheit temperature scale, IQ test scores– Zero does not mean no heat or no intelligence– The differences between 50 vs. 80 and 90 vs. 120 are same

• Ratio: the highest level of measurement: the number of occurrences.– There is a true zero– E.g. time, number of pages viewed (zero is meaningful)

Page 22: User-centered System Evaluation

EXPERIMENTAL DESIGN

Page 23: User-centered System Evaluation

• The basic experimental design in IIR evaluation examines the relationship between two or more systems or interfaces (independent variable) on some set of outcome measures (dependent variables)

Page 24: User-centered System Evaluation

IIR design

• General goal of IIR is to determine if a particular system helps subjects find relevant documents

• Developing a valid baseline in IIR evaluation involves identifying and blending the status quo and the experimental system.

• Random assignment can be used to increase the characteristics being evenly distributed across groups

Page 25: User-centered System Evaluation

Factorial Designs

• Good for studying the impact of more than one stimulus or variable

Page 26: User-centered System Evaluation

Rotation and counterbalancing

• The primary purpose of rotation and counterbalancing is to control for order effects and to increase the change that results can be attributed to the experimental treatments and conditions.

• Rotating variables:– Latin square design– Graeco-Latin square design

Page 27: User-centered System Evaluation

Rotation and counterbalancing

A basic design with no rotation. Numbers in cells represent different topics

Cons:1. Order effects2. Some topics are easier than others, some systems may do better with some

topics than others.3. Fatigue can impact the results

Page 28: User-centered System Evaluation

Latin Square rotation

Basic Latin Square rotation of topics

Basic Latin Square rotation of topics and randomization of columns

Problems:-Interaction among topics- the order effects of interfaces still exist

Page 29: User-centered System Evaluation

Graeco-Latin Square Design

• To solve the problem of orders of interfaces existing above.

• Graeco-Latin Square is a combination of two or more Latin squares.

Page 30: User-centered System Evaluation

Graeco-Latin Square Design

Page 31: User-centered System Evaluation

Study mode

• Batch-mode– Multiple subjects complete the study at the same location and time

• Single-mode– Subjects complete the study alone, with only the researcher present.

• The choice of mode is determined by the purpose of the study.– Single-mode: if each subject has to be interviewed, or some

interactive communication needed between subject and researcher– Batch-mode: self-contained, efficient (but subject can influence each

other)

Page 32: User-centered System Evaluation

Protocols

• A protocol is a step by step account of what will happen in a study.

• Protocol helps maintain the integrity of the study and ensure that subjects experience the study in similar ways.

Page 33: User-centered System Evaluation

Tutorials

• Provide some instruction on how to use a new IIR system– Printed materials– Verbal instructions– Video tutorial

• Try to avoid bias in the tutorial– Such as specially focusing on one special feature.

Page 34: User-centered System Evaluation

Pilot testing

• To estimate time• To identify problems with instruments,

instructions, and protocols• To get detailed feedback from test subjects

Page 35: User-centered System Evaluation

SAMPLING

Page 36: User-centered System Evaluation

Sampling

• It is not possible to include all elements from a population in a study

• The population in IIR evaluation is assumed to be all people who engage in online information search.

• The size of sample: the more the better• Two approaches to sampling: probability

sampling and non-probability sampling

Page 37: User-centered System Evaluation

Probability Sampling

• Selecting a sample from a population that maintains the same variation and diversity that exists within the population.

• Representative sample: – In a population: 60% are males and 40% are females,

then your representative sample would also contain roughly the same ratio of males and females.

– Increase the generalizability of the results– Assumes that all elements in the population have an

equal chance of being selected.

Page 38: User-centered System Evaluation

Probability sampling

• Simple random sampling– Randomly pick up an element

• Systematic sampling– Pick up every kth element, where k=population

size/sample size• Stratified sampling– Subdivide the population into more refined groups

according to specific strata– Select a sample that is proportionate to the

population in each strata.

Page 39: User-centered System Evaluation

Non-probability sampling

• Used when all of the elements in a population is unknown, or not available.

• It limits its ability to generalize• Researchers should be cautious when

generalizing their data and be aware of the sampling limitations in their research.

Page 40: User-centered System Evaluation

Non-probability sampling

• Three major types of non-probability sampling:– Convenience: relying on available elements the

researcher can access: undergraduate students, people is located closer to the researcher.

– Purposive or judgmental sampling: a researcher selects subjects or other elements that have particular characteristics, expertise or perspectives

– Quota sampling: similar as stratified sampling, but the subjects for the strata are based on a first-come-first-served policy.

Page 41: User-centered System Evaluation

Subject Recruitment• Many ways to recruit subjects

– Send solicitations to mailinglists– Inviting– Using referral services

– Crowdsourcing– Mechanical Turk

– Web advertising– Mass mailings– Virtual posting in online locations

– Pros and Cons: using lab mates, or own research group members as study subjects

Page 42: User-centered System Evaluation

COLLECTIONS

Page 43: User-centered System Evaluation

Collections for testing

• Identification of a set of documents for subjects to search, a set of tasks or topics which directs this searching, and the ground truth about the relevance of the searched objects to the topics -

• A test collection: corpus, topics, and relevance judgments

Page 44: User-centered System Evaluation

TREC collections

• TREC Interactive and HARD tracks– Newswire, blog, legal– Artificial topics– Relevance assessment generalization problem

Page 45: User-centered System Evaluation

Web corpora

• The major drawback is that it is impossible to replicate the study since the Web is constantly changing.

• The same queries issued at different time can get completely different results

Page 46: User-centered System Evaluation

Natural corpora

• Corpora assembled over time by study participants– Pros: meaningful to subjects, controllable– Cons: lack of replicability and equivalence across

participants,

Page 47: User-centered System Evaluation

Tasks and topics

• Most information needs can be characterized in terms of tasks and topics– Information need = task = topic

• Information needs– People do not know their information needs– People have difficulties to articulate their

information needs– Or using a vocabulary proper for a system

Page 48: User-centered System Evaluation

Generating information needs

• It is not clear at what level of specificity a task or topic should be defined– Task can be broken down into a series of sub-

tasks, such as writing a research proposal• Working on the query logs to develop

information needs

Page 49: User-centered System Evaluation

DATA COLLECTION TECHNIQUES

Page 50: User-centered System Evaluation

Data collection techniques

• Corpora, tasks, topics, and relevance assessments are major instruments to evaluate IIR systems

• Other instruments: questionnaires, screen capture software allow researchers to collect data.

Page 51: User-centered System Evaluation

Think-Aloud

• Subjects articulate their thinking and decision-making during the evaluation process of IIR.

• Microphone, recording software, • It is unnatural as most people do not

articulate their thoughts as they complete tasks.

Page 52: User-centered System Evaluation

Stimulated Recall

• Researcher records the screen of the computer as the subject completes a searching task. Then, the recording is played back to the subject and ask the subject to articulate thinking and decision-making.

• Tool: screen recording software

Page 53: User-centered System Evaluation

Spontaneous and prompted self-report

• Elicit feedback from subjects periodically while they search.

• Goal: get more refined feedback about the search, rather than summative feedback at the end of the search

Page 54: User-centered System Evaluation

observation

• Researcher is seated near subjects and observes them when they conduct IIR activities

• Tool: video camera, screen capture software• Time consuming, and labor intensive• Prone to selective attention and researcher

bias.

Page 55: User-centered System Evaluation

logging

• Analyzing transaction logs.• Client-side logging provides a more robust and

comprehensive log of the user’s interactions.• But is very hard to build a client-side logger

Page 56: User-centered System Evaluation

Questionnaire

• Consist of – closed questions where a specific response set is

provides (e.g. a five-point scale) quantitative analysis – open questions qualitative analysis

• Closed questions: Likert-type scale (e.g. five to seven point: strongly agree, agree, neutral, disagree, strongly disagree)

• Open questions: content analysis• Different modes: electronic, pen-and-paper,

interview

Page 57: User-centered System Evaluation

Interview

• Few IIR evaluation consist solely of interviews, but interviews are a common component of many study protocols.

• Subjects response to open-ended questions in interview better than in other two modes (electronic, or pen-and-paper)

• Interview: structured, semi-structured or open

Page 58: User-centered System Evaluation

MEASURES

Page 59: User-centered System Evaluation

Four basic measures

• Four basic classes of measures– Contextual (age, sex, search experience, personality-

type), – Interaction (# of queries issued, # of documents

viewed, query length), can be extracted from log data– Performance (# of relevant documents saved, mean

average precision, discounted cumulated gain), can be computed from log data

– Usability: subject attitudes and feelings about the system and their interactions

Page 60: User-centered System Evaluation

contextual

• Individual differences: their impact on the study results

• Information needs: domain expertise is measured using credentials

• Persistence of information needs• Immediacy of information need• Information-seeking stage

Page 61: User-centered System Evaluation

Interaction

• Measures:– # of queries, # of search results viewed, # of

documents viewed, # of documents saved, query length

• The implicit definition of interaction is tied to feedback

Page 62: User-centered System Evaluation

Performance

• When directly apply TREC measures to IIR evaluation, assume: relevance is binary, static, uni-dimensional and generalizable

• Whether the TREC-based performance metrics is meaningful to end users– A measure that evaluates systems based on the

retrieval of 1000 documents is unlikely to be meaningful to users since most users will not look through 1000 documents.

Page 63: User-centered System Evaluation

Traditional IR performance measures

Page 64: User-centered System Evaluation

Interactive recall and precision

Page 65: User-centered System Evaluation

Measures that accommodate multi-level relevance and rank

Page 66: User-centered System Evaluation

Time-based measures

• A variety of time-based measures– The length of time subjects spend in different

states or modes– The amount of time it takes a subject to save the

first relevant articles– The number of relevant documents saved during a

fixed period of time– The number of actions or steps taken to complete

a task

Page 67: User-centered System Evaluation

Cost and utility measures

• Some search services are not free• Have always been an important part of the

evaluation of library and information services

Page 68: User-centered System Evaluation

Evaluative feedback from subjects

• Usability– Effectiveness, efficiency and satisfaction as key

dimensions of usability– Effectiveness: precision, recall– Efficiency: the time it takes a subject to complete

a task.– Satisfaction: be satisfied for each different

experimental feature of the system, subject perceptions of outcomes and interactions

Page 69: User-centered System Evaluation

Available instruments for measuring usability

• Questionnaire for User Interface Satisfaction (QUIS): http://lap.umd.edu/quis/– 10-point scale for software, screen, terminology,

system, etc.• The USE questionnaire– Usefulness, ease of use, ease of learning, satisfaction

(7-point scale)• Software Usability Measurement Inventory

(SUMI): http://sumi.ucc.ie/whatis.html– Agree, do not know and disagree for 50 items

Page 70: User-centered System Evaluation

DATA ANALYSIS

Page 71: User-centered System Evaluation

Qualitative data analyses

• The goal of most qualitative data analyses that are constructed in IIR is to reduce the qualitative responses into a set of categories or themes that can be used to characterize and summarize responses.

• Content analysis: it starts with a well-defined and structured classification scheme, including categories and classification rules.

• Open coding: the categories are usually developed inductively during the analysis process as the researcher analyzes the data.

Page 72: User-centered System Evaluation

Quantitative data analysis

Page 73: User-centered System Evaluation

VALIDITY AND RELIABILITY

Page 74: User-centered System Evaluation

validity

• Internal validity: quality of what happens during the study– Whether the selected instrument yields poor or

inaccurate data• External validity: to what extent the results from

a study can be generalized to the real world.• Lab studies are generally less valid, but more

reliable than naturalistic studies• Using instruments with established reliability