predicting observers' task from their scanpaths on natural...
TRANSCRIPT
While the hypothesis that it is possible to decode the ob-server's task from eye movements (Yarbus 1967) has re-ceived some support (e.g., DeAngelus & Pelz, 2009; Henderson, Shinkareva, Wang, Luke, & Olejarczyk, 2013; Iqbal & Bailey, 2004), Greene, Liu, and Wolfe (2012) argued against it by reporting a failure.
In this study, we perform a more systematic investigation of this problem, probing a larger number of experimental factors than previously. Our main goal is to determine the informativeness of eye movements for task and mental state decoding.
Predicting observers' task from their scanpaths on natural scenes
Ove
r al
l dat
aO
ver
sing
le im
ages
Ali Borji and Laurent Itti {borji,itti}@usc.edu
University of Southern California, Computer Science, Psychology, and Neuroscience
(1) Introduction & Motivation
Supported by the National Science Foundation, the Army Research Office, and the U.S. Army
http://ilab.usc.edu/publications/doc/Borji_Itti14vss.pdf
(4) Experiment 2(3) Experiment 1
(2) Related works
Conclusions & Discussions1) Successful task decoding results provide further evidence that fixations convey diagnostic information regarding the observer's mental state and task, consistent with the cognitive rel-evance theory of attention (see Hayhoe & Ballard, 2005).
2) It is possible to reliably infer the observer's task from Greene et al.'s (2012) data using stronger classifiers. Classification was better when we treated images individually.
3) Is it always possible to decode task from eye movements? We argue that there is no gen-eral answer to this type of pattern recognition questions. Answers depend on the used stimuli, observers, and questions.
4) Here we followed the procedure by Greene et al. (2012) in which: (a) no observer viewed the same image twice and (b) the same scene was shown under multiple questions. The first rule aims to eliminate memory biases. The second rule ensures that the final result is not due to differences in stimuli.
5) Recently Kanan et al. (2014), were also able to decode the task from eye movement pat-terns on Greene et al.,’s data with about the same performance as us (> 33 %).
6) Beyond scientific value, decoding task from eye movements has practical applications such as wearable visual technologies (e.g., Google glass) and patient diagnosis (e.g., ASD, ADHD, FASD, Parkinson's, and Alzheimer's).
[1] DeAngelus M., Pelz J. B. (2009). Top-down control of eye movements: Yarbus revisited. Visual Cognition, 17, 790–811.[2] Kanan C., Ray N., Bseiso D. N. F., Hsiao J. H., Cottrell G. W. (2014). Predicting an observer's task using multi-fixation pattern analysis. In Proceedings of the ACM Symposium on Eye Tracking Research and Applications (ETRA-2014).[3] Yarbus A. L. (1967). Eye movements and vision. New York: Plenum.[4] Iqbal S., Bailey B. (2004). Using eye gaze patterns to identify user tasks. Proceedings of the Grace Hopper Celebration of Women in Computing, October 6–9, 2004.[5] Henderson J., Shinkareva S., Wang J., Luke S., Olejarczyk J. (2013). Predicting cognitive state from eye movements. Plos One, 8 (5), e64937.[6] Itti L., Koch C., Niebur E. (1998). A model of saliency-based visual attention for rapid scene analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20, 1254–1259. [7] Seiffert C., Khoshgoftaar T. M., Van Hülse J., Napolitano A. (2010). RUSBoost: A hybrid approach to alleviating class imbalance. IEEE Transactions on Systems, Man, and Cybernetics - Part A: Systems and Humans, 40, 185–197. [8] Borji, A., and L. Itti. Defending Yarbus: Eye movements reveal observers' task. Journal of vision 14.3 (2014): 29.[9] Hayhoe M., Ballard D. (2005). Eye movements in natural behavior. Trends in Cognitive Sciences, 9, 188–194.
image 1 - 0.4906 [0.2500]
Easy
8 - 0.5118 [0.2941]
9 - 0.4835 [0.1728] 10 - 0.2800 [0.2206]
7 - 0.2694 [0.3492]
2 - 0.3741 [0.3088]
13 - 0.1859 [0.2280]
11 - 0.3977 [0.1765]
4 - 0.2812 [0.2279]
6 - 0.3565 [0.2610]5 - 0.3094 [0.5331]
12 - 0.3965 [0.5845]
3 - 0.3518 [0.2537]
14 - 0.3623 [0.5956] 15 - 0.3765 [0.4118] 16 - 0.3600 [0.3382]
17 - 0.2259 [0.1765] 18 - 0.3270 [0.4154] 19 - 0.1318 [0.4118] 20 - 0.3670 [0.3235]
Hard
ITTI98 GBVS
mem
ory
deca
depe
ople
wea
lth
TASK
OBSERVER
1 2 3 4
5 6 7 8
9 10 11 12 13
14 16 1715
mean = 0.3420
chance level = 0.25
0 2 4 6 8 10 12 14 16 18 200
0.1
0.2
0.3
0.4
0.5
0.6
0.7
Image number
Dec
odin
g ac
cura
cy
Dec
odin
g ac
cura
cy
Type 1 Type 2 Type 3 Type 4 All0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
kNNRUSBoost
* *
chance
Easy Hard
image 1 - 0.1648 [0.1810] 2 - 0.3733 [0.3556]
6 - 0.2771 [0.2635] 7 - 0.1019 [0.1143]
3 - 0.3152 [0.3492]
8 - 0.2724 [0.2698]
4 - 0.2352 [0.2318] 5 - 0.2619 [0.2635]
10 - 0.4162 [0.3905]9 - 0.2828 [0.2825]
15 - 0.2505 [0.2222]14 - 0.3648 [0.3619]13 - 0.2600 [0.2698]12 - 0.3867 [0.3810]11 - 0.1514 [0.1492]
Task 1 Task 2 Task 3 Task 4 Task 5 Task 6 Task 7
Yarbus’ experiment on Repin's painting
Greene et al,’s results
(5) Experiment 2 (cntd)
TASKS:1. Memorize the picture (memory).2. Determine the decade in which the picture was taken (decade).3. Determine how well the people in the picture know each other (people).4. Determine the wealth of the people in the picture (wealth).
Decoding Accuracy:- viewed image: 33% correct vs. chance = 1.5% (64 images)
- participant: 26% vs. chance = 6.3% (16 participants)
- task: 27.1% correct vs. chance = 25% (4 tasks) ?! n.s.
Decoding Accuracy: [over all data with Bonferroni correction]Using kNN and Feature Type 3, we achieved accuracy of 0.3118, which is above Greene et al.'s (2012) results and is significantly better than chance (k = 8; p = 0.014).
Using RUSBoost and Feature Type 3, we achieved accuracy of 0.3412 (p = 1.07e – 04).
Decoding Accuracy: [over all data with Bonferroni correction]We report results using a leave-one-out procedure. We have 21 observers, each viewing 15 images (each five images under a different question; three questions per observer) thus resulting in 315 scan paths. Using Feature Type 1, we achieved average accuracy of 0.2421, which is significantly above chance (binomial test, p = 2.4535e – 06). Using Feature Type 2 (i.e., NSS histogram of nine saliency models as in Experiment 1) results in accuracy of 0.2254 (p = 5.6044e – 05).
A) Confusion matrix averaged over 15 images and 50 RUSBoost runs using Feature Type 1. We observe high accuracies for Task 3 (estimating age), Task 5 (remembering cloths), and Task 7 (estimating how long the visitor has been away) but low accuracy for the free-viewing task. There is a high confusion between Task 2 and Tasks 6 and 1 and also between Task 1 and Task 7.
B) The easiest and hardest stimuli using Feature Type 1.
C) RUSBoost classification using Feature Type 2 in Experiment 2.
Decoding Accuracy: [over single images with Bonferroni correction]Three observers viewed each image under one question thus resulting in 21 data points per image (i.e., 3 Ob-servers × 7 Questions). Note that each set of three observers were assigned the same question. RUSBoost classifier and Feature Type 1 results in average accuracy of 0.2724 over 50 runs and 15 images. Using first two feature types (a 10,000 + 23 × 70 = 11610D vector) results in average perfor-mance of 0.2743. Over all runs (i.e., table rows), the minimum accuracy (average over all 15 images) is 0.2540 and maximum accuracy is 0.3079. Note that our accuracies are almost two times higher than the 14.29% chance level (i.e., 1/7).
Example stimuli for the A) scene memori-zation task, B) reading task, C) scene search task and D) pseudo-reading task.
Decoding Accuracy> 64 % (chance = 25 %) using multivariate pattern classification.
The features were the mean and standard deviation of fixation duration, the mean and standard deviation of saccade amplitude, the number of fixations per trial, and the three parameters μ, σ, and τ quantifying the shape of the fixation duration distribution with an ex-Gaussian distribution.
The features were (1) number of fixations, (2) the mean fixation duration, (3) mean saccade ampli-tude, (4) percent of image covered by fixations assuming a 1° fovea, dwell-time on various re-gions of interest: (5) faces, (6) human bodies, and (7) objects.
Observers: 21 USC students (10 male, 11 female) Visual angles: 43 x 25 degreesStimuli: 15 paintings (13 oil on canvas, some by Repin)Viewing time: 30 seconds
Features: Type 1: the smoothed fixation map, down sampled to 100 × 100 (10000 D)Type 2: histograms of normalized scan path saliency (NSS) (9 x 70 = 630 D )Type 3: first 4 features of Greene et al. (4D)Type 4: < x, y > locations of the first five fixations (i.e., a 10 D vector)
Classifiers: - k-nearest-neighbor; kNN- RUSBoost (random undersampling boost) algorithm (Seiffert, Khoshgoftaar, Van Hülse, & Napolitano, 2010)
Henderson et al., PLOS ONE 2013
Greene et al., Vis. Res. 2012
A
BC
D
AB
Easiest Hardest
Task
1Ta
sk 2
Task
3Ta
sk 4
Task
5Ta
sk 6
Task
7
TASK
OBSERVER
1 2 3
4 5 6
7 8 9
10 11 12
13 14 15
16 17 18
19 20 21
1 2 3
4 5 6
7 8 9
10 11 12
13 14 15
16 17 18
19 20 21
Confusion matrix of RUSBoost classifier in Exp II
Predicted
Gro
und
truth
0.24
0.18
0.00
0.11
0.02
0.18
0.09
0.29
0.33
0.22
0.18
0.04
0.18
0.11
0.13
0.20
0.40
0.22
0.24
0.09
0.33
0.04
0.11
0.09
0.20
0.11
0.13
0.07
0.07
0.09
0.09
0.09
0.40
0.18
0.11
0.07
0.04
0.04
0.07
0.09
0.20
0.04
0.16
0.04
0.16
0.13
0.09
0.04
0.24
Task 1 Task 2 Task 3 Task 4 Task 5 Task 6 Task 7
Task 1
Task 2
Task 3
Task 4
Task 5
Task 6
Task 7
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0
0.05
0.1
0.15
0.2
0.25
AIM
Cla
ssifi
catio
n ac
cura
cy
AWS
GBV
SH
ouC
VPR
Hou
NIP
SIT
TI
ITTI
-CIO
ITTI
-CIO
LTXE
ITTI
-EIT
TI-E
ntro
pIT
TI-I
ITTI
-LIT
TI-O
ITTI
-OLT
XEIT
TI-S
corr
ITTI
-TIT
TI-V
aria
nce
ITTI
-XIT
TI98
PQFT
SEO
SUN
ITTI
-C
Feature importance(RUSBoost classification over all data in Exp II)
Chance level0.1429 = 1/7
C