evaluation of interactive machine learning systems › ial2019 › pdf ›...

EvaluationofInteractiveMachineLearningSystems

NadiaBoukhelifaIAL,September2019

EvaluationofInteractiveMachineLearningSystems

NadiaBoukhelifaIAL,September2019

visual

(IVMLs)

WhoamI?

3

NadiaBOUKHELIFA

Ph.D.2007—UniversityofKent,UKComputerScience,InformationVisualization

Post-doc—INRIA,TélécomParisTech,FranceVisualization,Human-ComputerInteraction

Researcher,Tenured,2016—INRA,FranceMulti-dimensionalDataVisualization,InteractiveModelling

INRA—NationalInstituteofAgriculturalResearch

4

NadiaBOUKHELIFA

INRA—NationalInstituteofAgriculturalResearch

5

NadiaBOUKHELIFA

http://www.cfsg.fr/site-de-grignon

�6

MALICESTeam:ModellingandKnowledgeIntegration

Expertiseformalisation

Uncertainty

Deterministicandstochasticmodelling

DecisionmakingOptimisation,visualisation

�7

Expertiseformalisation

Uncertainty

Deterministicandstochasticmodelling

DecisionmakingOptimisation,visualisation

NotexpertinAI

MALICESTeam:ModellingandKnowledgeIntegration

�8

Context:complexsystems,complexdatasets,complexmodels…

Azodyn-blé (Jeuffroy, Recous, Girard, Barré, Champeil, Meynard) modèle complet

Caractéristiques du sol (Clay%, CaCO3%, N%)

Données climatiques (température, pluie, Irr)

Nature du précédent

Amendements organiques (date, dose, forme)

Minéralisation nette de l�humus, des résidus et des amendements organiques (Mh, Mr, Ma)

Engrais minéral ou organique (date et dose)

Nmin dans sol

Vitesse de croissance la semaine avant apport

Indice de nutrition azotée (NNI)

Quantité de N aérien réel

Courbe de dilution maxi, Vmax

Climat (rayonnement, température)

Quantité maxi de N accumulé dans la culture

Biomasse aérienne

Rayonnement intercepté

Indice foliaire

RUE

NG/m²réel Quantité max de N dans les grains

Max Biomass in grains

Biomasse des grains = rendement

Quantité d�N dans les grains

Teneur en protéines

Acides aminés accumulés

N remobilisé

N dans organes végétatifs

Nmin dans sol à récolte climat (température)

CAU de l�engrais

NE/m²

climat avant floraison (température, rayonnement, déficit hydrique) NG/m² max

ENTREES

SORTIES

Jeuffroy, M-H., and Sylvie Recous. "Azodyn: a simple model simulating the date of nitrogen deficiency for decision support in wheat fertilization." European journal of Agronomy 10, no. 2 (1999): 129-144.

�9

Context:InteractiveModelExploration

Whyhumanintheloopinmodelling?

• tointegratevaluableexpertsknowledgethatmaybehardtoencodedirectlyintomathematicalorcomputationalmodels.

• tohelpresolveexistinguncertaintiesasaresultof,forexample,biasanderrorthatmayarisefromautomaticmachinelearning.

• tobuildtrustbymakinghumansinvolvedinthemodellingorlearningprocesses.

• tocaterforindividualhumandifferencesandsubjectiveassessmentssuchasinartandcreativeapplications.

�10

VisualisationOptimisationModel Simulations

Experimental System

Application!Multiple Computational Stages!

Multiple Actors!

Feedback

Interaction!Results !

11

Boukhelifa,Nadia,etal."AnExploratoryStudyonVisualExplorationofModelSimulationsbyMultipleTypesofExperts."Proceedingsofthe2019CHIConferenceonHumanFactorsinComputingSystems.ACM,2019.

OurapproachtoInteractiveModelExploration


Experimental System


Multiple Actors!

Feedback


12 Boukhelifa et al., CHI2019

Collaborativesetup



Experimental System


Multiple Actors!

Feedback


13

• Modelswrittenbythirdparties• Expertsspecialistsinpartsofthemodelledprocess

=>Co-locatedmultipleexpertise:domain,model,optimisation,visualization.


Collaborativesetup

Boukhelifa et al., CHI2019


Experimental System


Multiple Actors!

Feedback


Data & Models 14




Experimental System


Multiple Actors!

Feedback


Data & Models 15

Trade-offs




Experimental System


Multiple Actors!

Feedback


Data & Models Pareto Fronts 16

setofnon-dominatedcompromisepoints,wherenoobjectivecanbeimprovedwithoutsacrificingatleastoneotherobjective.

QU

AN

TITY

OF

ITEM

2

QUANTITY OF ITEM 1




Experimental System


Multiple Actors!

Feedback


Visualization SystemData & Models Pareto Fronts 17

QU

AN

TITY

OF

ITEM

2

QUANTITY OF ITEM 1


�18

http://www.seguetech.com/

a way to generate pretty images from data

Definition#1:Visualization

�19


Thepurposeofvisualizationisinsight,notpictures!


�20


Informa<onVisualisa<on

“Theuseofcomputer-supported,interactive,visualrepresentationsofabstractdatatoamplifycognition.”[Cardetal.,1999]

Thepurposeofvisualizationisinsight,notpictures!


Card,StuartK.,JockD.Mackinlay,andBenShneiderman."Usingvisiontothink."Readingsininformationvisualization.MorganKaufmannPublishersInc.,1999.Card,S.

WhyvisualiseyourdataRawDatafromAnscombe’sQuartet

�21

FrankAnscombe

StatisticalanalysisRawDatafromAnscombe’sQuartet

�22

StatisticalProperties

Meanofx 9.0

Varianceofx 11.0

Meanofy 7.5

Varianceofy 4.12

Correlationbetweenxandy 0.816

Linearregressionline y=3+0.5x

VisualrepresentationofthedataRawDatafromAnscombe’sQuartet

�23

VisualProperties

Samestats,differentgraphs!

Nevertrustsummarystatisticsalone;visualiseyourdata!

�24

DatasaurusDozenDataset,AutodeskResearchhttps://www.autodeskresearch.com/publications/samestat

Visualrepresentationofthedata

�25

Communicate visually

Explore interactively

Evaluate

InformationVisualization

�26

Definition#2:Human-ComputerInteraction

“Human-computer interaction is a discipline concerned with the design, evaluation and implementation of interactive computing systems for human use and with the study of major phenomena surrounding them.”

Hewett,T.etal.ACMCurriculaforHuman-ComputerInteraction.1992;

“algorithms that can interact with both computational agents and human agents (in active learning: oracles) and can optimize their learning behavior through these interactions.”

�27

Definition#3:iML

“Users Are People, Not Oracles”

Takingintoaccounthumanfactors

Amershietal.,PowertothePeople:TheRoleofHumansinInteractiveMachineLearning,2014

AndreasHolzinger.Interactivemachinelearning(iml).InformatikSpektrum,39(1),2016.DanielKottke,InteractiveAdaptiveLearning,IntelligentEmbeddedSystems,UniversityofKassel,2018

�28

Definition#4:IVML

“In interactive visual machine learning (IVML), a human operator and a machine collaborate to achieve a task mediated by an interactive visual interface.”

Ourdefinition:

N.Boukhelifa,A.Bezerianos,andE.Lutton.EvaluationofInteractiveMachineLearningSystems.InHumanandMachineLearning,pp.341-360.Springer,Cham,2018.

�29

EvaluationofInteractiveSystems

✔Userperformance

✔Userexperience

✔Algorithmicperformance

Weknowhowtoassess:

�30

EvaluationofIVMLSystems

https://unsplash.com/photos/VBe9zj-JHBs

�31

OurexperiencewithEVE-EvolutionaryVisualExploration

N.Boukhelifa,A.Bezerianos,I-CTrelea,N.MPerrot,andELutton.AnExploratoryStudyonVisualExplorationofModelSimulationsbyMultipleTypesofExperts."ACMCHIConferenceonHumanFactorsinComputingSystems,2019


N.Boukhelifa,A.Bezerianos,W.CancinoandE.Lutton.EvolutionaryVisualExploration:EvaluationofanIECFrameworkforGuidedVisualSearch.EvolutionaryComputationJournal,MITpress,2017.

N.Boukhelifa,A.BezerianosandE.Lutton.AMixedApproachfortheEvaluationofaGuidedExploratoryVisualization System.EuroVisWorkshoponReproducibility,Verification,andValidationinVisualization(EuroRV3)2015.

W.Cancino,N.Boukhelifa,A.BezerianosandE.Lutton.EvolutionaryVisualExploration:ExperimentalAnalysisofAlgorithmBehaviour.GECCOworkshoponGeneticandEvolutionaryComputation(VizGEC2013).

N.Boukhelifa,W.Cancino,A.BezerianosandE.Lutton.EvolutionaryVisualExploration:EvaluationWithExpertUsers.ComputerGraphicsForum(EuroVis2013),EurographicsAssociation,2013,32(3).


N.Boukhelifa,A.Bezerianos,I-CTrelea,N.MPerrot,andELutton.AnExploratoryStudyonVisualExplorationofModelSimulationsbyMultipleTypesofExperts."ACMCHIConferenceonHumanFactorsinComputingSystems,2019


N.Boukhelifa,A.Bezerianos,W.CancinoandE.Lutton.EvolutionaryVisualExploration:EvaluationofanIECFrameworkforGuidedVisualSearch.EvolutionaryComputationJournal,MITpress,2017.

N.Boukhelifa,A.BezerianosandE.Lutton.AMixedApproachfortheEvaluationofaGuidedExploratoryVisualization System.EuroVisWorkshoponReproducibility,Verification,andValidationinVisualization(EuroRV3)2015.

W.Cancino,N.Boukhelifa,A.BezerianosandE.Lutton.EvolutionaryVisualExploration:ExperimentalAnalysisofAlgorithmBehaviour.GECCOworkshoponGeneticandEvolutionaryComputation(VizGEC2013).

�32


w. Cancino A. Bezerinaos E. Lutton

Our experience with EVE - Evolutionary Visual Exploration

�33

EvolutionaryVisualExploration

Needtoexplorecombineddimensionshttps://unsplash.com/photos/YTbFHT9_IhY

�34

Howtoexploreahugesearchspaceofpossiblecombined

dimensions?

�35

Howtoexploreahugesearchspaceofpossiblecombined

dimensions?

n-DdatasetInteresting2D projectionsEV

E IEAs

EvolutionaryVisualExploration

�36

WhyIEAs?

✔cancombineobjective&subjectivemeasures

✔supportexploration&exploitation

✔adapttouserchangeofinterest

Suitableforexploratoryvisualization:

EVEPrototype

�37

IEA

Bou

khel

ifa e

t al.,

Eur

oVis

201

3

ArtificialEvolution

NaturalEvolution EvolutionaryAlgorithms(EAs)

�39

Wikipedia

EVE:CreatingNewProjections

�40

Biscuit data (6D)

Top_heat, Bottom_heat

+,-,*,/, (.)(.), exp and log

offspring1: 0.2 * Top_heat + 0.7 * Bottom_heat

offspring2: Exp(Top_heat)

offspring3: 0.99 - (0.69 * log(Top_heat)) …

Favor linear patterns

Choose

offspringX

InteractiveEvolution(IEAs)

EvaluationofProjections

�41

User assessment ComplexitySurrogate function

Scagnostics, Wilkinson and Wills, 2008

learned

interactivecomputed

TheSearchSpace

TheSetofalldimensionsencodedastrees(GPframework,Koza92)

�42

initialdimensions,operators,constants

SubspaceExploration

�43

SubspaceExploration

�44

KeyfeaturesofEVE

�45

Intuitive

Adaptable

Interactive

Flexible

Isit?

EvaluatingEVE

46

Amixedapproach

�47

Algorithm centred evaluation

User centred Evaluation

Isthehuman-machinecooperationproducingtherightresults?

�48

Whenwehaveground-truth

Aim

• istheexplorationactuallysteeredtowardaninterestingareaofthesearchspace?

• aretheproposedsolutionsvaried?

Quantitativestudymethodology

• syntheticdataset

• pre-specifiedtask

�49

Evaluatinghuman-AIcollaboration

• 12participants• mean28.5years• noexperiencerequired• Syntheticdataset5D• 20minutes

Participants

�50

Gameseparatetwopointclusters

Task

�51

IEAisabletofollowtheorderofuserranking

Convergenceanalysis

�52

failure

failuresuccess

success

Interfaceevaluation-usestrategiesVisualpatternselectionstrategies

Keyfindings

Userstakedifferentsearchandevaluationstrategiesevenforasimpletask.

Onaveragethesurrogatefunctionfollowstheorderofuserrankingfairlyconsistently.

Linkbetweenuserevaluationstrategy,andoutcomeofexploration&speedofconvergence.

�55

Promisingresultsbut…

• Realworldsituationshavemorecomplexdatasetsandtasks.

• Usersarenotalwaysconsistentorgivedetailedfeedback.

• Wedonotalwayshavegroundtruth.

What happens when we do not have ground truth ?

�56

User-centredevaluation

Insightandusabilityevaluation

• areexpertsabletoconfirmoldknowledge?

• areexpertsabletogainnewinsight?

Qualitativestudymethodology

• thinkaloud,observe,interview&questionnaire

• videotapedandlogdatacapture

�57

• 5domainexperts

• mean34.2years

• owndatasets

• pre-questionnaire

• 2.5hours

Participants

�58

Training

T1:showinthetoolwhatyoualreadyknowaboutthedata

T2:explorethedatainlightofaresearchquestion

Tasks

�59

EVEResults

Hypothesisgeneration

�60

“thiscombinationmaybeanimportantfindingbecauseitinvolvesparametersthataffectonlyonepartofthe

simulationmodel...”

Cityemergencemodel

EVEResults

Hypothesisquantification

‘‘wealwaystalkaboutthisqualitatively.Thisisthefirst

timeIseeconcreteweights...’’

�61

Electricityconsumptionprofiles

EVEResults

�62

expertswereableto:• tryoutalternativescenarios• thinklaterally• quantifyaqualitativehypothesis• formulateanewhypothesisorrefineoldone• (domainvalue)

But…

�63

« On one hand, humans perform unpredictable and sophisticated reasoning; on the other, artificial solvers are technically complex and adopt solving strategies which are very different from those employed by humans.

Secondly, the environment from which the problem to be solved is drawn is usually uncontrollable and uncertain.

Together, these factors complicate the task of designing precise and effective evaluation studies. »

G. Cortellessa and A. Cesta, AAAI

EvaluationofIVMLisChallenging

Cortellessa,Gabriella,andAmedeoCesta."EvaluatingMixed-InitiativeSystems:AnExperimentalApproach."ICAPS.Vol.6.2006.

�64

EvaluationofIVMLisChallenging

Ch#1ComplexHumanFactors

Ch#2MultipleExpertise

Ch#3StochasticProcesses

Ch#4Co-adaptation

Ch#5Uncertainty

�65

Ch#1ComplexHumanFactors

“Users Are People, Not Oracles”

FrustratedBored

FatigueAnnoyed

Inconsistent

Reluctanttogivefeedback

Interruptibility

Amershi et al., Power to the People: The Role of Humans in Interactive Machine Learning , 2014

Bias

e.g.,Needbettertechniquestocaptureuserintent.

�66


ShouldIVMLhelpgroupsreachconsensus?Encouragemultiple

views?

Forus:buildcommongroundandselectbest

trade-offs

�67


ShouldIVMLhelpgroupsreachconsensus?Encouragemultiple

views?

Forus:buildcommongroundandselectbest

trade-offs

Weneedtocaterforindividualaswellascollective

exploration

Needsetupsthatcanswitchbetweenindividualand

collectivelearning

�68

• Riskofgettingstuckinlocaloptima?

�68

• Effectofstochasticityonuser’smentalmodeoftheML

Ch#3StochasticProcesses

Ch#3Co-adaptation

�69

“ Co-adaptive phenomena are defined as those in which the environment affects human behavior and at the same time, human behavior affects the environment. Such phenomena pose theoretical and methodological challenges and are difficult to study in traditional ways. ”

W. E. Mackay, Users and Customizable Software: A Co-Adaptive Phenomenon, 1990.

Ch#4Uncertainty

�70

Differentsourcesofuncertaintyarisingfrom:• Automatic

inferences• Analysts

reasoning

https://www.facebook.com/pedromics

�71

N.Boukhelifa,A.Bezerianos,andE.LuQon.Evalua<onofInterac<veMachineLearningSystems.InHumanandMachineLearning,pp.341-360.Springer,Cham,2018.

ReviewofCHI/VIZpublications2012-2017Keywordsearch:IML+Evaluation19papersfromdifferentdomains

Notanexhaustivesurvey!

IMLEvaluation-HCIStudies

�72



�73


HumanFeedback:

• implicit(7papers)

• explicit(8papers)

• mixed(4papers):• implicithumanfeedbackhelpsinferinformationtocomplementimplicithumanfeedback.


�74


SystemFeedback:

• toinformhumansaboutthestateofthemachinelearningalgorithm,andprovenanceofsystemsuggestions.

• canbevisual,progressive,andcanindicateuncertainty.

• mostsystemsgavefeedback.

• challenge:toinformwithoutoverwhelmingtheuser.


�75


Typesofstudy:

• 12/19studiesinvolvingusers(someforofcontrolledstudy)

• Difficulttoconduct:• potentialconfoundingfactors

• nogroundtruth


�76


EvaluationMetrics:

objective:• time,precision,re-call,#Insights

• iMLvs.baselineorvariantsofML;with/outsystemfeedback;explicitvs.implicit;impactofuserfeedback

• difficultyseparatingusabilityissuesfromtaskresults

•


�77


EvaluationMetrics:

subjective:• aspectsofuserexperience,e.g.,reported,easiness,speed,taskload,trust,andconfidence.

•


�78


EvaluationMetrics:

subjective:• aspectsofuserexperience,e.g.,reported,easiness,speed,taskload,trust,andconfidence.

•


“One sign of success of iML systems is when humans forget that they are feeding information to an algorithm, and rather focus on

synthesising information relevant to their task”. Endertetal.,2012[38]

GuidelinesforIVMLevaluation

�79

Weknowhowtoevaluateinterfaces&interactivesystems,butveryfewguidelinesexistspecificallyforiMLsystems!

JakobNielsen's10UsabilityHeuristicsforUserInterfaceDesign

G1:VisibilityofsystemstatusG2:MatchbetweensystemandtherealworldG3:UsercontrolandfreedomG4:ConsistencyandstandardsG5:ErrorpreventionG6:RecognitionratherthanrecallG7:FlexibilityandefficiencyofuseG8:AestheticandminimalistdesignG9:Helpusersrecognize,diagnose,&recoverfromerrorsG10:Helpanddocumentation


�80

G1Developingsignificantvalue-addedautomation.

G2Consideringuncertaintyaboutauser’sgoals.

G3Consideringthestatusofauser’sattentioninthetimingofservices.

G4Inferringidealactioninlightofcosts,benefits,anduncertainties.

G5Employingdialogtoresolvekeyuncertainties.

G6Allowingefficientdirectinvocationandtermination.

G7Minimizingthecostofpoorguessesaboutactionandtiming.

G8Scopingprecisionofservicetomatchuncertainty,variationingoals.

G9Providingmechanismsforefficientagent−usercollaborationtorefineresults.

G10Employingsociallyappropriatebehaviorsforagent−userinteraction.

G11Maintainingworkingmemoryofrecentinteractions.

G12Continuingtolearnbyobserving.

PrinciplesofMixed-InitiativeUserInterfaces-EricHorvitz,1999


�81

G1Makeclearwhatthesystemcando.

G2Makeclearhowwellthesystemcandowhatitcando.G3Timeservicesbasedoncontext.G4Showcontextuallyrelevantinformation.G5Matchrelevantsocialnorms.G6Mitigatesocialbiases.

G7Supportefcientinvocation.

G8Supportefcientdismissal.

G9Supportefcientcorrection.

G10Scopeserviceswhenindoubt.

G11Makeclearwhythesystemdidwhatitdid.G12Rememberrecentinteractions.G13Learnfromuserbehavior.G14Updateandadaptcautiously.G15Encouragegranularfeedback.

G16Conveytheconsequencesofuseractions.

G17Provideglobalcontrols.

G18Notifyusersaboutchanges.

Newguidelinesproposed,e.g.,Amershietal.,2019.

Noconclusions-researchquestions!

�82

Q1WhataspectsorcomponentsoftheIVMLsystemaremostimportanttoevaluate?

Q2Whattypesoftaskscanbedelegatedtomachinelearningandwhicharebestlefttohumans?

Q3WhoarethetargetusersofIVMLsystems?

Q4Whatistheroleofexpertiseinthiscontext?

Q5ShouldwehavedomainexpertstrainIVMLsystems?

Q6Whataretherisksandbenefitsofintroducinghumanexpertiseintothe(machine)learningprocess?

Q7Whatmetricsshouldweusetoevaluate?

Q8Canweestablishapplicationordomain-independentmetrics?

Q9DoweestablishdifferentevaluationsmeasuresfortheunderstandingofIVMLsystemsandfortheirperformance?

Q10Canweestablishbenchmarkdatasetsandtestuse-casestohelpevaluateIVMLsystems?

Q11Shouldweseekreplicationinthiscontext,andifso,howdowesupportreplicationofresultsinaco-learningoradaptiveenvironment?

Q12Howdowecommunicatetheevaluationresultstootherdisciplines?

eviva-ml.github.io

iML-eval:On-goingResearchTopic

�83

21 October, 2019

eviva-ml.github.io

�84

eviva-ml.github.ioEarly registration deadline 20/09

Organizing Committee

• Nadia Boukhelifa (INRA, FR)• Anastasia Bezerianos (Univ. Paris-Sud and INRIA, FR)• Enrico Bertini (NYU Tandon School of Engineering, USA)• Christopher Collins (Uni. of Ontario Institute of Technology, CA)• Steven Drucker (Microsoft Research, USA)• Alex Endert (Georgia Tech, USA)• Jessica Hullman (Northwestern University, USA)• Michael Sedlmair (University of Stuttgart, DE) • Remco Chang (Tufts University, USA)• Chris North (Virginia Tech, USA)

http://pfl.grignon.inra.fr/nb/

https://www.lri.fr/~anab/

http://enrico.bertini.io/

http://vialab.science.uoit.ca/portfolio/christopher-m-collins

https://www.microsoft.com/en-us/research/people/sdrucker/

http://va.gatech.edu/endert/

http://users.eecs.northwestern.edu/~jhullman/

https://homepage.univie.ac.at/michael.sedlmair/

http://www.cs.tufts.edu/~remco/

http://people.cs.vt.edu/north/

Noconclusions-researchquestions!

�85

Q1WhataspectsorcomponentsoftheIVMLsystemaremostimportanttoevaluate?

Q2Whattypesoftaskscanbedelegatedtomachinelearningandwhicharebestlefttohumans?

Q3WhoarethetargetusersofIVMLsystems?

Q4Whatistheroleofexpertiseinthiscontext?

Q5ShouldwehavedomainexpertstrainIVMLsystems?

Q6Whataretherisksandbenefitsofintroducinghumanexpertiseintothe(machine)learningprocess?

Q7Whatmetricsshouldweusetoevaluate?

Q8Canweestablishapplicationordomain-independentmetrics?

Q9DoweestablishdifferentevaluationsmeasuresfortheunderstandingofIVMLsystemsandfortheirperformance?

Q10Canweestablishbenchmarkdatasetsandtestuse-casestohelpevaluateIVMLsystems?

Q11Shouldweseekreplicationinthiscontext,andifso,howdowesupportreplicationofresultsinaco-learningoradaptiveenvironment?

Q12Howdowecommunicatetheevaluationresultstootherdisciplines?

eviva-ml.github.io

Danke!

evaluation of interactive machine learning systems › ial2019 › pdf ›...

Documents