user-centered system evaluation

Download User-centered System Evaluation

Post on 02-Jan-2016




0 download

Embed Size (px)


User-centered System Evaluation. Reference. Diane Kelly (2009). Methods for Evaluating Interactive Information Retrieval Systems with Users. Foundations and Trends in Information Retrieval, 3(1-2), 1-224. DOI: 10.1561/1500000012. introduction. Interactive Information Retrieval (IIR). - PowerPoint PPT Presentation


Methods for Evaluating Interactive Information Retrieval Systems with Users

User-centered System Evaluation

1ReferenceDiane Kelly (2009). Methods for Evaluating Interactive Information Retrieval Systems with Users. Foundations and Trends in Information Retrieval, 3(1-2), 1-224. DOI: 10.1561/15000000122introduction3Interactive Information Retrieval (IIR)Traditional IR evaluations abstract users out of the evaluation processIIR focuses on users behaviors and experiences, physical, cognitive and affectiveInteractions between users and systemsInteractions between users and information4Different evaluation questionsClassic IR evaluation (non-user centric): does this system retrieve relevant document?IIR evaluation (user-centric): can people use this system to retrieve relevant documents.

Therefore: IIR is viewed as a sub-area of HCI5Relevance FeedbackSame information needs different queries different search results different relevance feedback.Dealing with users is difficult as causes and consequences of interactions cannot be observed easily (it is in users head)The available observation: query, save a document, provide relevance feedback.Based on these observation, we must infer6DifficultiesEach individual user has a different cognitive composition and behavioral dispositionSome interactions are not easily observable nor measurableMotivation,How much to know the topicexpectations7IIRUsing users to evaluate IRDifferent approachesUsing users to evaluate the research results of a system (users are treated as black boxes)Search log analysis (queries, search results and click-through behavior)TREC Interactive Track evaluation model (evaluating a system or interface)General information search behavior in electronic environments (observing and documenting users natural search behaviors and interactions)

8Approaches9Research goalsSetting up a clear research goal: Exploration: when the subject is less known, focusing on learning the subject, rather than make prediction, open-end research questions or hypotheses are uncommon.Description: documenting and describing a subject (query log or query behavior analysis), to provide benchmark description and classification, results can be used to inform other studiesExplanation: examine the relationship between two or more variables with the goal of prediction and explanation, establish causality, 10ApproachesEvaluations vs. ExperimentsEvaluation: to assess the goodness of a system, interface or interaction technique.Experiments: to understand behavior, (similar as experiments in psychology or education), compare at least two things.Lab and naturalistic studiesLab (more controls) vs. naturalistic (less controls)Longitudinal studiesTaking place over an extended period of time and measurements are taken at fixed intervals.11ApproachesCase studiesThe intensive study of a small number of casesA case maybe a user, a system or an organization.It usually takes place in naturalistic settings and involve some longitudinal elements.Not for generalizing rather than gaining an in-depth view of a particular case.Wizard of Oz studies and simulationsTesting non-real or simulated systemUsed for proof-of-conceptProvide an indication of what might happen in ideal circumstancesWizard of Oz studies are simulationsSimulated users can represent different actions or steps a real user might take while interacting with an IR system12Research basics13Problems and QuestionsIdentify and describe problemsProvide roadmap for researchExample of research questionsExploratory:How do people re-find information on the Web?Descriptive:What Web browser functionalities are currently being used during web-based information-seeking tasksExplanatory:What are the differences between written and spoken queries in terms of their retrieval characteristics and performance outcomes?What is the relationship between query box size and query length? What is the relationship between query length and performance?14TheoryA theory is a system of logical principles that attempts to explain relations among natural, observable phenomena.Theory is abstract, general, can generate more specific hypotheses15HypothesesHypotheses state expected relationships between two variables Alternative hypotheses vs. null hypothesesSpecific relationship vs. no relationshipHypotheses can be directional or non-directional16Variables and measurementVariables represent conceptTo analyze conceptsConceptualizationTo define concepts: provide temporary definition, divide into dimensionsOperationalizationHow to measure the concept: Direct and indirect observablesDirectly observed: # of queries entered, the amount time spent searchingIndirectly observed:User satisfaction17VariablesIndependent: the causesexamining differences in how males and females use an experimental and baseline IIR systemSex is independent variablesDependent: the effectsE.g., Satisfaction or performance of the systems.Confounding variablesAffect the independent or dependent variable, but have not been controlled by the researcher.E.g., maybe males are more familiar with these systems than females.18MeasurementRange of variationPreciseness of the measureE.g., category of usage frequency of a systemExhaustivenessComplete list of choicesExclusivenessHow to differentiate partially relevant vs. somewhat relevant (in your relevance rubric)EquivalenceFind items that are of the same type and at the same level of specificityDifferent scales: I know details=very familiar, I know nothing=very unfamiliarAppropriatenessHow likely are you to recommend this system to others? Scale: a five-point scale with strongly agree and strongly disagree which does not match the question19Level of MeasurementTwo basic levels of measurement: discrete vs. continuousDiscrete measures: categorical responsesNominal: no orderE.g., interface type, sex, task-typeOrdinal: orderedRank-order (from most relevant to least relevant) or Likert-type order (five-point scale with 1=not relevant, 5=relevant)Relative measureone subjects 2 may not represent the same thing internally as another subjects 2.we could not say that a document rated 4 was twice as relevant as a document rated 2 since the scale contains no true zero20Level of MeasurementTwo basic levels of measurement: discrete vs. continuousContinuous measure: interval vs. ratioDifferent between consecutive points are equal, but there is no true zero for interval scalesFahrenheit temperature scale, IQ test scoresZero does not mean no heat or no intelligenceThe differences between 50 vs. 80 and 90 vs. 120 are sameRatio: the highest level of measurement: the number of occurrences.There is a true zeroE.g. time, number of pages viewed (zero is meaningful)21Experimental design22The basic experimental design in IIR evaluation examines the relationship between two or more systems or interfaces (independent variable) on some set of outcome measures (dependent variables)23IIR designGeneral goal of IIR is to determine if a particular system helps subjects find relevant documentsDeveloping a valid baseline in IIR evaluation involves identifying and blending the status quo and the experimental system.Random assignment can be used to increase the characteristics being evenly distributed across groups24Factorial DesignsGood for studying the impact of more than one stimulus or variable

25Rotation and counterbalancingThe primary purpose of rotation and counterbalancing is to control for order effects and to increase the change that results can be attributed to the experimental treatments and conditions.Rotating variables:Latin square designGraeco-Latin square design26Rotation and counterbalancing

A basic design with no rotation. Numbers in cells represent different topics

Cons:Order effectsSome topics are easier than others, some systems may do better with some topics than others.Fatigue can impact the results27Latin Square rotation

Basic Latin Square rotation of topicsBasic Latin Square rotation of topics and randomization of columnsProblems:Interaction among topics the order effects of interfaces still exist28Graeco-Latin Square DesignTo solve the problem of orders of interfaces existing above.Graeco-Latin Square is a combination of two or more Latin squares. 29Graeco-Latin Square Design

30Study modeBatch-modeMultiple subjects complete the study at the same location and timeSingle-modeSubjects complete the study alone, with only the researcher present.The choice of mode is determined by the purpose of the study.Single-mode: if each subject has to be interviewed, or some interactive communication needed between subject and researcherBatch-mode: self-contained, efficient (but subject can influence each other)31ProtocolsA protocol is a step by step account of what will happen in a study.Protocol helps maintain the integrity of the study and ensure that subjects experience the study in similar ways.

32TutorialsProvide some instruction on how to use a new IIR systemPrinted materialsVerbal instructionsVideo tutorialTry to avoid bias in the tutorialSuch as specially focusing on one special feature.33Pilot testingTo estimate timeTo identify problems with instruments, instructions, and protocolsTo get detailed feedback from test subjects34sampling35SamplingIt is not possible to include all elements from a population in a studyThe population in IIR evaluation is assumed to be all people who engage in online information search.The size of sample: the more the betterTwo approaches to sampling: probability sampling and non-probability sampling36Probability SamplingSelecting a sample from a population that maintains the same variation and diversity that exists within the population.Representative sample: In a population: 60% are males and 40% are females, then your representative sample would also contain roughly the same ratio of males and females.Increase the generalizability of the resultsAssumes that all elements in t