can you trust what you see? the magic of visual perception

Can you trust what you see?The magic of visual perception

OgeMarques,PhDProfessor

College ofEngineering andComputer ScienceFloridaAtlanticUniversity– BocaRaton,FL(USA)

The Distinguished Speakers Program is made possible by

For additional information, please visit http://dsp.acm.org/

About ACM

ACM, the Association for Computing Machinery is the world’s largest educational and scientific computing society, uniting educators, researchers and

professionals to inspire dialogue, share resources and address the field’s challenges.

ACM strengthens the computing profession’s collective voice through strong leadership, promotion of the highest standards, and recognition of technical

excellence.

ACM supports the professional growth of its members by providing opportunities for life-long learning, career development, and professional

networking.

With over 100,000 members from over 100 countries, ACM works to advance computing as a science and a profession. www.acm.org

A man enters a room…

Source:https://www.youtube.com/watch?v=zNbF006Y5x4

Surprised?

• Thevideoiscalled“Assumptions”

• Theauthor(andactor)isBritishprofessionalmagicianandonlyProfessorinthePublicUnderstandingofPsychology (UniversityofHertfordshire)RichardWiseman

• Formore:http://richardwiseman.wordpress.com/

My background

• OgeMarques,PhD– ProfessorofEngineeringandComputerScienceatFAU

– Resarchfocus:Intelligentprocessingofvisualinformation(blendofimageprocessing,computervision,humanvision,artificialintelligenceandmachinelearning).

– 10yearsago,I’vedecidedtostudyhumanvisionandactivelyinteractwithresearchersinthefield.

– HerearesomeofthethingsI’velearnedalongtheway…

Facebook:https://www.facebook.com/ProfessorOgeMarques

Goals of this talk

• Toexploretogetherseveralvisualperceptionphenomenathatchallengeourcommonknowledgeofhowwellwemakedecisionsupontheinformationthatarrivesatourbrainthroughoureyes.

• Toexaminepossibleapplications ofhumanvisionknowledgetothesolutionofcomputervisionresearch questions.

Visual illusions

• Seriousvisionresearch– “Errorsofperception

(phenomena ofillusions)canbeduetoknowledgebeinginappropriate orbeingmisapplied.Soillusionsareimportant forinvestigatingcognitiveprocesses ofvision.”(RichardGregory)

• Fun(partytricks)– “Tricksworkonlybecause

magiciansknow,atanintuitivelevel,howwelookattheworld.[…]Magiciansweretakingadvantageofthesecognitiveillusions longbeforeanyscientist identifiedthem.” (StephenMacknik andSusanaMartinez-Conde)

Speaking of fun tricks…

Source:https://www.youtube.com/watch?v=r6h02WuxmVY

Warm up

• Whatdoyousee?

Source:Frisby andStone(2012)

Warm up

• Whichcircleisbigger?

Source:https://en.wikipedia.org/wiki/Ebbinghaus_illusion#/media/File:Mond-vergleich.svg

Warm up

• Whichlineislonger?

Source:https://s-media-cache-ak0.pinimg.com/originals/5a/a5/34/5aa534b42bf7c6cd61e1b710a360d056.gif

Warm up


Warm up


Source:http://ww2.justanswer.com/uploads/fael/2011-08-06_032529_ponzoillusionapplet.gif

Warm up

• Duckorrabbit?

Source:Frisby andStone(2012)

Warm up

• Whichdotisdifferentthananyother?

What do we know about visual perception?

Not much compared to what we don’t know

Source:Barenholtz (2009)

Ignorance

Knowledge

State of the art

Things we DO know

• Visiblelight

Things we DO know

• Eye(retinalimage)

Things we DO know

• Eyetobrainpath

Things we DO know

• VisionforACTIONvs.visionforRECOGNITION

What?

Where?

Example of what we DON’T know (yet)

• Themoonseemslargerwhenitisnearthehorizonthanwhenitishighinthesky. Why?

• Itfoolsthehumanbrain,butcannotbecapturedinaphoto.

• Manycompetingtheories,noconsensus.

Source:https://freethoughtblogs.com/singham/files/2014/02/moonrise-timelapse-over-la.jpg

Themoonillusion

How scientists learn about human vision

• Patientswithbraindamageoreyeconditions• Directaccesstothebrain

– Single-cellrecording– Modernbrainimagingandactivityrecordingdevices

• Controlledexperiments– Calibratedmonitorsandrooms– Eye-trackingdevices– Psychophysics

Can you trust your brain?

• “Ourbrainsarebrilliantinstruments,abletoreason,synthesize,rememberandimagineatanextraordinarypitchandrate.Wetrustthemimmediatelyandinnately – andhavereasonstobedeeplyproudofthemtoo.

• However,thesebrains[…]arealsoverysubtlyanddangerouslyflawedmachines,flawedinwaysthattypicallydon’tannouncethemselvestousandthereforegiveusfewcluesastohowonguardweshouldbeaboutourmentalprocesses.”

(AlaindeBotton,“Thefaultywalnut”)

Source:http://www.thebookoflife.org/the-faulty-walnut/

Sometimes we see what is notthere

Source:Palmer(1999)

Count the black dots…

Source:http://www.slideshare.net/mrg3515/optical-illusions-8167051/3

Sometimes we only see half of the story

Source:WikimediaCommons

Sometimes we must make a ‘best guess’

Source:http://www.slideshare.net/mrg3515/optical-illusions-8167051/3

Sometimes we even combine two or more illusions

Source:Goldstein(2002)

Sometimes we have trouble with (relative) brightness and contrast

Source:WikimediaCommons

Sometimes we have trouble with color (constancy)

Source:http://www.lottolab.org/

“The dress”

• On Feb 26,2015thisdress “broke theInternet”– #whiteandgold

or– #blackandblue?

“The dress”: a simplifiedexplanation

• Most of the time,our visualsystemdoesaremarkable job of inferring the ambientlighting conditions at any given timeanddiscounting their contribution to colorcomputations.

• But inthis image,the cues to the lightingconditions areparticularly ambiguous.

• Is the lightilluminating the dress brightand yellowish or is itdim and blueish?Yourbrain has to make aguess.

Source:http://web.mit.edu/bcs/nklab/what_color_is_the_dress.shtml

“The dress” meets the color cube

• An experiment by RosaLafer-Sousa(Kanwisher Lab,MIT)combined the dress with Beau Lotto’s colorcube.Here arethe results:

Source:http://web.mit.edu/bcs/nklab/what_color_is_the_dress.shtml

“The dress”

• But what coloris it?

• Think the controversy is over?Think again!

Sometimes we miss on seeing things…

…becausetheyhappentoofast


…becausetheyhappentooslowly

Source:O’Regan


…becausesomething(else)flashes/flickers

Source:Skoda(https://www.youtube.com/watch?v=qpPYdMs97eE)

And sometimes interpretation changes with viewing distance

Source:Torralba &Oliva (2006)

Sometimes our prior knowledge gets on the way…

Source:Adelson (1995)

Sometimes…

…ourinterpretationofanimagedependsonwhetherwearelookingatitsparts ortakingitasawhole

Sometimes…

…thethingsthatwestruggletoseeforthefirsttimebecomesurprisinglyeasyfromthesecondtimeon

Sometimes we know we’re being fooled…

Source:YouTube

Ames room: explanation


Sometimes we know that what we’re seeing is not what is there…

…butwestillcan’thelpit.

Source:Gregory(2006)

Applications to multimedia research

• Computationalmodelingofvisualattention– Imageretrieval– Objectdetection

• Facerecognition– Game:GuessThatFace

Visual Attention

Wecanonlypayattentiontopartofthevisualscene

Whichpart?

Source:Yarbus (1965)

We can only pay attention to part of the visual scene

Whichpart?

Source:Yarbus (1967)

We can only pay attention to part of the visual scene

• Contemporarycomputermodels

Source:http://www.saliencytoolbox.net/

Our work

VisualAttention+ImageRetrieval

Hindawi Publishing CorporationEURASIP Journal on Advances in Signal ProcessingVolume 2007, Article ID 43450, 17 pagesdoi:10.1155/2007/43450

Research ArticleAn Attention-Driven Model for Grouping SimilarImages with Image Retrieval Applications

Oge Marques,1 Liam M. Mayron,1 Gustavo B. Borba,2 and Humberto R. Gamba2

1 Department of Computer Science and Engineering, Florida Atlantic University, Boca Raton, FL 33431-0991, USA2 Programa de Pos-Graduacao em Engenharia Eletrica e Informatica Industrial, Universidade Tecnologica Federal do Parana (UTFPR),Curitiba, Parana 80230-901, Brazil

Received 1 December 2005; Revised 3 August 2006; Accepted 26 August 2006

Recommended by Gloria Menegaz

Recent work in the computational modeling of visual attention has demonstrated that a purely bottom-up approach to identify-ing salient regions within an image can be successfully applied to diverse and practical problems from target recognition to theplacement of advertisement. This paper proposes an application of a combination of computational models of visual attention tothe image retrieval problem. We demonstrate that certain shortcomings of existing content-based image retrieval solutions canbe addressed by implementing a biologically motivated, unsupervised way of grouping together images whose salient regions ofinterest (ROIs) are perceptually similar regardless of the visual contents of other (less relevant) parts of the image. We propose amodel in which only the salient regions of an image are encoded as ROIs whose features are then compared against previously seenROIs and assigned cluster membership accordingly. Experimental results show that the proposed approach works well for severalcombinations of feature extraction techniques and clustering algorithms, suggesting a promising avenue for future improvements,such as the addition of a top-down component and the inclusion of a relevance feedback mechanism.

Copyright © 2007 Hindawi Publishing Corporation. All rights reserved.

1. INTRODUCTION

The dramatic growth in the amount of digital images avail-able for consumption and the popularity of inexpensivehardware and software for acquiring, storing, and distribut-ing images have fostered considerable research activity in thefield of content-based image retrieval (CBIR) [1] during thepast decade [2, 3]. Simply put, in a CBIR system users searchthe image repository providing information about the actualcontents of the image, which is often done using another im-age as an example. A content-based search engine translatesthis information in some way as to query the database (basedon previously extracted and stored indexes) and retrieve thecandidates that are more likely to satisfy the user’s request.

In spite of the large number of related papers, proto-types, and several commercial solutions, the CBIR problemhas not been satisfactorily solved. Some of the open prob-lems include the gap between the image features that can beextracted using image processing algorithms and the seman-tic concepts to which they may be related (the well-knownsemantic gap problem [4–6], which can often be translated as“the discrepancy between the query a user ideally would andthe one it actually could submit to an information retrieval

system” [7]), the lack of widely adopted testbeds and bench-marks [8, 9], and the inflexibility and poor functionality ofmost existing user interfaces, to name just a few.

Some of the early CBIR solutions extract global featuresand index an image based on them. Other approaches takeinto account the fact that, in many cases, users are search-ing for regions or objects of interest as opposed to the entirepicture. This has led to a number of proposed solutions thatdo not treat the image as a whole, but rather deal with por-tions (regions or blobs) within an image, such as [10, 11], orfocus on objects of interest, instead [12]. The object-basedapproach for the image retrieval problem has grown to be-come an area of research referred to as object-based imageretrieval (OBIR) in the literature [12–14].

Object- and region-based approaches usually must relyon image segmentation algorithms, which leads to a num-ber of additional problems. More specifically, they must em-ploy strong segmentation—“a division of the image data intoregions in such a way that region T contains the pixels ofthe silhouette of object O in the real world and nothing else”[3], which is unlikely to succeed for broad image domains.A frequently used alternative to strong segmentation is weaksegmentation, in which “region T is within bounds of object

Our work

Visualattention+objectdetection(usingagame)

Ask’nSeek: a new game for object detection and labeling

Axel Carlier1, Oge Marques2, and Vincent Charvillat1

1 IRIT-ENSEEIHT, University of Toulouse, France{Axel.Carlier, Vincent.Charvillat}@enseeiht.fr

2 Florida Atlantic University, USA [email protected]

Abstract. This paper proposes a novel approach to detect and label objects withinimages and describes a two-player web-based guessing game – Ask’nSeek – thatsupports these tasks in a fun and interactive way. Ask’nSeek asks users to guessthe location of a hidden region within an image with the help of semantic andtopological clues. The information collected from game logs is combined withresults from content analysis algorithms and used to feed a machine learning al-gorithm that outputs the outline of the most relevant regions within the image andtheir names. Two noteworthy aspects of the proposed game are: (i) it solves twocomputer vision problems – object detection and labeling – in a single game; and(ii) it learns spatial relations within the image from game logs. The game has beenevaluated through user studies, which confirmed that it was easy to understand,intuitive, and fun to play.

1 Introduction

There are many open problems in computer vision (e.g., object detection) for whichstate-of-the-art solutions still fall short of performing perfectly. The realization thatmany of those tasks are arduous for computers and yet relatively easy for humans hasinspired many researchers to approach those problems from a ‘human computation’viewpoint, using methods that include crowdsourcing (“a way of solving problem basedon a large number of small contributions from a large number of different persons”) andgames – often called, more specifically, “games with a purpose (GWAPs)” [1].

In this paper we propose a novel approach to solving a subset of computer vi-sion problems – namely object detection and labeling

3 – using games and describeAsk’nSeek, a two-player web-based guessing game targeted at the tasks of object de-tection and labeling. Ask’nSeek asks users to guess the location of a small rectangularregion hidden within an image with the help of semantic and topological clues (e.g., “tothe right of the bus”), by clicking on the image location which they believe correspondsto (one of the points of) the hidden region. Once enough games have been played us-ing a given image, our novel machine learning algorithm combines user-provided input(coordinates of clicked points and spatial relationships between points and regions –‘above’, ‘below’, ‘left’, ‘right’, ‘on’, ‘partially on’, or ‘none’) with results from off-the-shelf computer vision algorithms applied to the image, to produce the outline (boundingbox) of the most relevant regions within the image and their associated labels. These

3 In this paper we use the phrase object labeling to refer to the process of assigning a textuallabel to an object’s bounding box.

Face Recognition

Weseemtobeparticularlygoodatrecognizingfamous/familiarfacesevenwhenthey’reblurry

even though the effective resolution in that region is verylimited. Recognition performance changes only slightlyafter obscuring the gait or body, but is affected dramaticallywhen the face is hidden, as illustrated in Fig. 2. This doesnot appear to be a skill that can be acquired through generalexperience; even police officers with extensive forensicexperience perform poorly unless they are familiar with thetarget individuals. The fundamental question this finding,and others like it [49], [66], bring up is the following: Howdoes the facial representation and matching strategy usedby the visual system change with increasing familiarity, soas to yield greater tolerance to degradations? We do not yetknow exactly what aspect of the increased experience witha given individual leads to an increase in the robustness ofthe encoding; is it the greater number of views seen or isthe robustness an epiphenomenon related to some bio-logical limitations such as slow memory consolidationrates? Notwithstanding our limited understanding, someimplications for computer vision are already evident. In

considering which aspects of human performance to takeas benchmarks, we ought to draw a distinction betweenfamiliar and unfamiliar face recognition. The latter mayend up being a much more modest goal than the formerand might constitute a false goal towards which to strive.The appropriate benchmark for evaluating machine-basedface recognition systems is human performance withfamiliar faces.

3) Result 3: High-Frequency Information by Itself Does NotLead to Good Face Recognition Performance: We have longbeen enamored of edge maps as a powerful initial repre-sentation for visual inputs. The belief is that edges capturethe most important aspects of images (the discontinuities)while being largely invariant to shallow shading gradientsthat are often the result of illumination variations. In thecontext of human vision as well, line drawings appear to besufficient for recognition purposes. Caricatures and quickpen portraits are often highly recognizable. Do theseobservations mean that high spatial frequencies arecritical, or at least sufficient, for face recognition? Severalresearchers have examined the contribution of differentspatial frequency bands to face recognition [14], [21].Their findings suggest that high spatial frequencies mightnot be too important for face perception. In the particulardomain of line drawings, Graham Davies and his col-leagues have reported [16] that images which containexclusively contour information are very difficult to re-cognize (specifically, they found that subjects could recog-nize only 47% of the line drawings compared to 90% of theoriginal photographs; see Fig. 3). How can we reconcilesuch findings with the observed recognizability of linedrawings in everyday experience? Bruce and colleagues[6], [7] have convincingly argued that such depictions do,in fact, contain significant photometric cues and that thecontours included in such a depiction by an accomplishedartist correspond not just to a low-level edge map, but in

Fig. 2. Frames from video-sequences used in Burton et al. [10] study.

(a) Original input. (b) Body obscured. (c) Face obscured. Based on

results from such manipulations, researchers concluded that

recognition of familiar individuals in low-resolution video is based

largely on facial information.

Fig. 1. Unlike current machine-based systems, human observers are able to handle significant degradations in face images. For instance,

subjects are able to recognize more than half of all familiar faces shown to them at the resolution depicted here. Individuals shown in

order are: Michael Jordan, Woody Allen, Goldie Hawn, Bill Clinton, Tom Hanks, Saddam Hussein, Elvis Presley, Jay Leno,

Dustin Hoffman, Prince Charles, Cher, and Richard Nixon.

Sinha et al. : Face Recognition by Humans: Nineteen Results Researchers Should Know About

1950 Proceedings of the IEEE | Vol. 94, No. 11, November 2006

even though the effective resolution in that region is verylimited. Recognition performance changes only slightlyafter obscuring the gait or body, but is affected dramaticallywhen the face is hidden, as illustrated in Fig. 2. This doesnot appear to be a skill that can be acquired through generalexperience; even police officers with extensive forensicexperience perform poorly unless they are familiar with thetarget individuals. The fundamental question this finding,and others like it [49], [66], bring up is the following: Howdoes the facial representation and matching strategy usedby the visual system change with increasing familiarity, soas to yield greater tolerance to degradations? We do not yetknow exactly what aspect of the increased experience witha given individual leads to an increase in the robustness ofthe encoding; is it the greater number of views seen or isthe robustness an epiphenomenon related to some bio-logical limitations such as slow memory consolidationrates? Notwithstanding our limited understanding, someimplications for computer vision are already evident. In

considering which aspects of human performance to takeas benchmarks, we ought to draw a distinction betweenfamiliar and unfamiliar face recognition. The latter mayend up being a much more modest goal than the formerand might constitute a false goal towards which to strive.The appropriate benchmark for evaluating machine-basedface recognition systems is human performance withfamiliar faces.

3) Result 3: High-Frequency Information by Itself Does NotLead to Good Face Recognition Performance: We have longbeen enamored of edge maps as a powerful initial repre-sentation for visual inputs. The belief is that edges capturethe most important aspects of images (the discontinuities)while being largely invariant to shallow shading gradientsthat are often the result of illumination variations. In thecontext of human vision as well, line drawings appear to besufficient for recognition purposes. Caricatures and quickpen portraits are often highly recognizable. Do theseobservations mean that high spatial frequencies arecritical, or at least sufficient, for face recognition? Severalresearchers have examined the contribution of differentspatial frequency bands to face recognition [14], [21].Their findings suggest that high spatial frequencies mightnot be too important for face perception. In the particulardomain of line drawings, Graham Davies and his col-leagues have reported [16] that images which containexclusively contour information are very difficult to re-cognize (specifically, they found that subjects could recog-nize only 47% of the line drawings compared to 90% of theoriginal photographs; see Fig. 3). How can we reconcilesuch findings with the observed recognizability of linedrawings in everyday experience? Bruce and colleagues[6], [7] have convincingly argued that such depictions do,in fact, contain significant photometric cues and that thecontours included in such a depiction by an accomplishedartist correspond not just to a low-level edge map, but in

Fig. 2. Frames from video-sequences used in Burton et al. [10] study.

(a) Original input. (b) Body obscured. (c) Face obscured. Based on

results from such manipulations, researchers concluded that

recognition of familiar individuals in low-resolution video is based

largely on facial information.

Fig. 1. Unlike current machine-based systems, human observers are able to handle significant degradations in face images. For instance,

subjects are able to recognize more than half of all familiar faces shown to them at the resolution depicted here. Individuals shown in

order are: Michael Jordan, Woody Allen, Goldie Hawn, Bill Clinton, Tom Hanks, Saddam Hussein, Elvis Presley, Jay Leno,

Dustin Hoffman, Prince Charles, Cher, and Richard Nixon.

Sinha et al. : Face Recognition by Humans: Nineteen Results Researchers Should Know About

1950 Proceedings of the IEEE | Vol. 94, No. 11, November 2006

Our work

Our work

!

Let’s play the game!

http://tinyurl.com/guessthatface

Going back to our original question…

Can you trust what you see?

Source:Torralba (MIT)

Thank you!

• CheckmyFacebookpageforrelatedresources– https://www.facebook.com/ProfessorOgeMarques

can you trust what you see? the magic of visual perception

Technology