modeling visual search in a thousand sceneskehinger.com/pdf/vss2009_ehinger_slides.pdf ·...
TRANSCRIPT
Modeling visual search in a thousand scenes:The roles of saliency, target features,and scene context
Krista Ehinger, Barbara Hidalgo‐Sotelo, Antonio Torralba, & Aude Oliva
VSS 2009, 5/10/09 1Modeling Visual Search in 1000 Scenes
VSS 2009, 5/10/09 Modeling Visual Search in 1000 Scenes 2
• 14 participants, 912 images = 45,144 fixations• Person present/absent?
Overview of experiment
VSS 2009, 5/10/09 Modeling Visual Search in 1000 Scenes 3Fixations 1-3 on target-absent scenes
Overview of Model
VSS 2009, 5/10/09 Modeling Visual Search in 1000 Scenes 4
Local and globalimage features
Combination ofsources of guidance
Scene context
Target features
Saliency
• 14 participants, 912 images = 45,144 fixations
Overview of experiment
VSS 2009, 5/10/09 Modeling Visual Search in 1000 Scenes 5
Human Agreement
• Inter‐observer agreement = upper bound for model performance
VSS 2009, 5/10/09 Modeling Visual Search in 1000 Scenes 6
Human Agreement
VSS 2009, 5/10/09 Modeling Visual Search in 1000 Scenes 7
• Inter‐observer agreement = upper bound for model performance
• Cross‐image control = lower bound for model performance
The ROC curve
Cog Lunch 10/21/2008
Model
Selected image regionsROC curve
The ROC curve
Cog Lunch 10/21/2008
Model
Selected image regionsROC curve
The ROC curve
Cog Lunch 10/21/2008
Model
Selected image regionsROC curve
The ROC curve
Cog Lunch 10/21/2008
Model
Selected image regionsROC curve
The ROC curve
Cog Lunch 10/21/2008
Model
Selected image regionsROC curve
The ROC curve
Cog Lunch 10/21/2008
Model
Selected image regionsROC curve
The ROC curve
Cog Lunch 10/21/2008
Model
Selected image regionsROC curve
The ROC curve
Cog Lunch 10/21/2008
Model
Selected image regionsROC curve
The ROC curve
Cog Lunch 10/21/2008
Model
Selected image regionsROC curve
The ROC curve
Cog Lunch 10/21/2008
Model
Selected image regionsROC curve
AUC
Human Agreement
VSS 2009, 5/10/09 Modeling Visual Search in 1000 Scenes 18False alarm rate
Fixa
tion
dete
ctio
n ra
teHuman AgreementAUC = 0.93
Cross-Image ControlAUC = 0.68
Human agreement examples
VSS 2009, 5/10/09 Modeling Visual Search in 1000 Scenes 19
High inter-observer agreement
Low inter-observer agreement
Overview of Model
VSS 2009, 5/10/09 Modeling Visual Search in 1000 Scenes 20
Local and globalimage features
Combination ofsources of guidance
Target features
Scene context
Saliency
VSS 2009, 5/10/09 Modeling Visual Search in 1000 Scenes 21
Guidance by Saliency
Saliency Model
VSS 2009, 5/10/09 Modeling Visual Search in 1000 Scenes 22False alarm rate
Fixa
tion
dete
ctio
n ra
te
Human AgreementAUC = 0.93
Cross-Image ControlAUC = 0.68
Saliency ModelAUC = 0.77
Saliency Model: Examples
VSS 2009, 5/10/09 Modeling Visual Search in 1000 Scenes 23
Best performance
AUC = 0.94
Worst performance
AUC = 0.36
Overview of Model
VSS 2009, 5/10/09 Modeling Visual Search in 1000 Scenes 24
Local and globalimage features
Combination ofsources of guidance
Scene context
Target featuresTarget features
SaliencySaliency
Pedestrian Detector• Histograms of Oriented Gradients
(HOG) detector by Dalal & Triggs
VSS 2009, 5/10/09 Modeling Visual Search in 1000 Scenes 25
Dalal & Triggs, 2005 CVPR
Positivefeatures
Negativefeatures
Average gradient
VSS 2009, 5/10/09 Modeling Visual Search in 1000 Scenes 26
Guidance by Target Features
Target Features Model
VSS 2009, 5/10/09 Modeling Visual Search in 1000 Scenes 27False alarm rate
Fixa
tion
dete
ctio
n ra
te
Human AgreementAUC = 0.93
Cross-Image ControlAUC = 0.68
Target Features ModelAUC = 0.78
Target Features Model: Examples
VSS 2009, 5/10/09 Modeling Visual Search in 1000 Scenes 28
AUC = 0.95
Best performance
AUC = 0.50
Worst performance
Overview of Model
VSS 2009, 5/10/09 Modeling Visual Search in 1000 Scenes 29
Local and globalimage features
Combination ofsources of guidance
Scene context
Target featuresTarget features
SaliencySaliency
Scene context
Target features
Saliency
VSS 2009, 5/10/09 Modeling Visual Search in 1000 Scenes 30
What is the context region for pedestrians?
Training image(contains a pedestrian)
Scene Context Model
VSS 2009, 5/10/09 Modeling Visual Search in 1000 Scenes 31
Oliva & Torralba, 2001
Orientations at variousspatial scales
Scene “gist” + positionof pedestrian
VSS 2009, 5/10/09 Modeling Visual Search in 1000 Scenes 32
Guidance by Scene Context
Scene Context Model
VSS 2009, 5/10/09 Modeling Visual Search in 1000 Scenes 33False alarm rate
Fixa
tion
dete
ctio
n ra
te
Human AgreementAUC = 0.93
Cross-Image ControlAUC = 0.68
Scene Context ModelAUC = 0.85
Scene Context Model: Examples
VSS 2009, 5/10/09 Modeling Visual Search in 1000 Scenes 34
Best performance
AUC = 0.95
Worst performance
AUC = 0.27
Overview of Model
VSS 2009, 5/10/09 Modeling Visual Search in 1000 Scenes 35
Local and globalimage features
Combination ofsources of guidance
Scene context
Target features
Saliency
VSS 2009, 5/10/09 Modeling Visual Search in 1000 Scenes 36
Combined Sources of Guidance
Combined Model
VSS 2009, 5/10/09 Modeling Visual Search in 1000 Scenes 37False alarm rate
Fixa
tion
dete
ctio
n ra
te
Human AgreementAUC = 0.93
Cross-Image ControlAUC = 0.68
Combined ModelAUC = 0.88
Combined Model: Examples
VSS 2009, 5/10/09 Modeling Visual Search in 1000 Scenes 38
Best performance
AUC = 0.94
Worst performance
AUC = 0.36
Target Absent vs. Target Present
VSS 2009, 5/10/09 Modeling Visual Search in 1000 Scenes 39
False alarm rate
Fixa
tion
dete
ctio
n ra
te
Target Absent Scenes
False alarm rate
Target Present Scenes
Summary of results
• Combined model accounts for 94% of human agreement in search fixations
• Scene context gives the best prediction of human search fixations in this task
• How to get that last 6%?
VSS 2009, 5/10/09 Modeling Visual Search in 1000 Scenes 40
Overview of Model
VSS 2009, 5/10/09 Modeling Visual Search in 1000 Scenes 41
Local and globalimage features
Combination ofsources of guidance
Scene context
Target features
Saliency
Context “oracle”
“Context Oracle” Implementation
VSS 2009, 5/10/09 Modeling Visual Search in 1000 Scenes 42
Context Oracle, AUC = 0.90
Context Model, AUC = 0.67
Context Oracle
VSS 2009, 5/10/09 Modeling Visual Search in 1000 Scenes 43False alarm rate
Fixa
tion
dete
ctio
n ra
te
Human AgreementAUC = 0.93
Cross-Image ControlAUC = 0.68
Combined ModelAUC = 0.88Context Oracle
AUC = 0.88
Overview of Model
VSS 2009, 5/10/09 Modeling Visual Search in 1000 Scenes 44
Local and globalimage features
Combination ofsources of guidance
Scene context
Target features
Saliency
Context “oracle”
Combined Model with Oracle
VSS 2009, 5/10/09 Modeling Visual Search in 1000 Scenes 45False alarm rate
Fixa
tion
dete
ctio
n ra
te
Human AgreementAUC = 0.93
Cross-Image ControlAUC = 0.68
Combined Model(with oracle)AUC = 0.89
Combined Model(computational)AUC = 0.88
Summary of results
• Combined model accounts for 94% of human agreement in search fixations
• Context predicts human fixations better than saliency or target features in this search task
• How to get that last 6%?– Context “oracle”?
• Improves performance to 95% of human agreement
– Something else?
VSS 2009, 5/10/09 Modeling Visual Search in 1000 Scenes 46
What’s next?
VSS 2009, 5/10/09 Modeling Visual Search in 1000 Scenes 47False alarm rate
Fixa
tion
dete
ctio
n ra
te
Human AgreementAUC = 0.93
Cross-Image ControlAUC = 0.68
Combined ModelAUC = 0.88
Acknowledgements
Ehinger, K. A., Hidalgo‐Sotelo, B., Torralba, A. & Oliva, A. Modeling Search for People in 900 Scenes: A combined source model of eye guidance. Visual Cognition, in press.
Barbara Hildalgo‐Sotelo Aude OlivaAntonio Torralba
VSS 2009, 5/10/09 51Modeling Visual Search in 1000 Scenes
Funded by a Singleton graduate research fellowship to KE, an NSF graduate research fellowship to BHS, and NSF CAREER awards to AT and AO.