invariant recognition drives neural representations of ... · we use artificial systems for action...

1

Invariantrecognitiondrivesneuralrepresentations

ofactionsequences

AndreaTacchetti*,LeylaIsik*andTomasoPoggio

CenterforBrainsMindsandMachines,MIT

*denotesequalcontribution

Abstract

Recognizingtheactionsofothersfromvisualstimuliisacrucialaspectofhumanvisualperceptionthatallows

individualstorespondtosocialcues.Humansareabletoidentifysimilarbehaviorsanddiscriminatebetween

distinctactionsdespitetransformations,likechangesinviewpointoractor,thatsubstantiallyalterthevisual

appearanceofascene.Thisabilitytogeneralizeacrosscomplextransformationsisahallmarkofhumanvisual

intelligence.Advances inunderstandingmotionperceptionattheneural levelhavenotalwaystranslatedin

preciseaccountsofthecomputationalprinciplesunderlyingwhatrepresentationourvisualcortexevolvedor

learned to compute. Here we test the hypothesis that invariant action discrimination might fill this gap.

Recently,thestudyofartificialsystemsforstaticobjectperceptionhasproducedmodels,ConvolutionalNeural

Networks(CNNs),thatachievehumanlevelperformanceincomplexdiscriminativetasks.Withinthisclassof

models,architecturesthatbettersupportinvariantobjectrecognitionalsoproduceimagerepresentationsthat

match those implied by human and primate neural data. However, whether these models produce

representationsofactionsequencesthatsupportrecognitionacrosscomplextransformationsandcloselyfollow

neuralrepresentationsremainsunknown.HereweshowthatspatiotemporalCNNsappropriatelycategorize

videostimuliintoactions,andthatdeliberatemodelmodificationsthatimproveperformanceonaninvariant

actionrecognitiontaskleadtodatarepresentationsthatbettermatchhumanneuralrecordings.Ourresults

supportourhypothesis thatperformanceon invariantdiscriminationdictates theneural representationsof

actions computed by human visual cortex. Moreover, these results broaden the scope of the invariant

recognitionframeworkforunderstandingvisualintelligencefromperceptionofinanimateobjectsandfacesin

staticimagestothestudyofhumanperceptionofactionsequences.

2

AuthorSummary

Recognizingtheactionsofothers fromvideosequencesacrosschanges inviewpoint,actor,gaitor

illuminationisahallmarkofhumanvisualintelligence.Alargenumberofstudieshavehighlightedwhichareas

inthehumanbrainareinvolvedintheprocessingofbiologicalmotion,andmanymorehavedescribedhow

single neurons behave in response to videos of human actions. However, little is known about the

computationalnecessitiesthatshapedtheseneuralmechanismseitherthroughevolutionorexperience. In

thispaper,wetestthehypothesisthatthiscomputationalgoalisthediscriminationofcomplexvideostimuli

according to their action content.We show that,within the class of Spatiotemporal Convolutional Neural

Networks(ST-CNN),deliberatemodelmodificationsleadingtorepresentationsofvideosthatbettersupport

robust action discrimination, also produce representations that bettermatch humanneural data. Because

modifying a naïvemodel to increase its performance in invariant recognition resulted in a representation

similartohumanneuraldata,theseresultssuggestthat,similarlytowhatisknownforobjectrecognition,a

processofperformanceoptimization for invariantdiscrimination, constrained toahierarchical ST-CNN-like

architecture,shapedtheneuralmechanismsunderlyingourabilitytoperceivetheactionsofothers.

Introduction

Humans’abilitytorecognizetheactionsofothersisacrucialaspectofvisualperception.Remarkably,

theaccuracywithwhichwecanfinelydiscernwhatothersaredoingislargelyunaffectedbytransformations

that,whilesubstantiallychangingthevisualappearanceofagivenscene,donotchangethesemanticsofwhat

weobserve(e.g.achangeinviewpoint).Recognizingactions,themiddlegroundbetweenactionprimitivesand

activities[1],acrossthesetransformationsisahallmarkofhumanvisualintelligencewhichhasprovenelusive

toreplicateinartificialsystems.Becauseofthis,invariancetotransformationsthatareorthogonaltoalearning

taskhasbeenthesubjectofextensivetheoreticalandempiricalinvestigationinbothartificialandbiological

perceptionandrecognition[2,3].

Recentstudieshavedescribedtheneuralprocessingmechanismsunderlyingourabilitytorecognize

theactionofothersfromvisualstimuli.Specificbrainareashavebeenimplicatedintheprocessingofbiological

motionand theperceptionofactions.Forexample, inhumansandotherprimates, theSuperiorTemporal

Sulcus,andparticularlyitsposteriorportion,isbelievedtoparticipateintheprocessingofbiologicalmotion

andactions[4–11].Inadditiontostudyingentirebrainregionsthatengageduringactionrecognition,anumber

ofstudieshavecharacterizedtheresponsesofsingleneurons.Thepreferredstimuliofneuronsinvisualareas

V1andMTarewellapproximatedbymovingedge-detection filtersandenergy-basedpoolingmechanisms

[12,13].NeuronsintheSTSregionofmacaquemonkeysrespondselectivelytoactions,areinvarianttochanges

inactorsandviewpoint[14]andtheirtuningcurvesarewellmodeledbysimplesnippet-matchingmodels[15].

3

More broadly, the neural computations underlying action recognition in visual cortex are organized as a

hierarchicalsuccessionofspatiotemporalfeaturedetectorsofincreasingsizeandcomplexity[16,17].Despite

precise characterization of the organizational, regional and single-unit data involved in the processing of

biologicalmotion,littleinformationisknownaboutwhatcomputationaltasksmightberelevanttoexplaining

andrecapitulatingitscomputations.

Fueledbyadvancesincomputervisionmethodsforobjectandscenecategorization,recentstudies

havemadeprogresstowardslinkingcomputationaloutcomestoneuralsignalsthroughquantitativelyaccurate

modelsofsingleneurons in InferiorTemporalcortex (IT) [18]. Inaddition, thesestudieshavehighlighteda

correlationbetweenperformanceoptimizationondiscriminativeobjectrecognitiontasksandtheaccuracyof

neuralpredictionsbothatthesinglerecordingsiteandneuralrepresentationlevel[18–21].

However, these results, have not been extended to action perception and dynamic stimuli.What specific

computational goals relate to thebiological structure thatunderliesour ability to recognize theactionsof

others,andinparticulartotherepresentationsofactionsequenceshumanvisualcortexlearned,orevolved

tocompute,remainsunknown.Herewetestourhypothesisthatinvariantrecognitionmightfillthisgap.

Weuseartificial systems for action recognitionand compare theirdata representations tohuman

Magnetoencephalography (MEG) recordings [22].We show that, within the Spatiotemporal Convolutional

NeuralNetworksmodelclass[16,17,23,24],deliberatemodificationsthatresultinbetterperformingmodels

on invariant action recognition tasks, also lead to empirical dissimilaritymatrices that bettermatch those

obtainedbyhumanneuralrecordings.Ourresultssuggestthatperformanceoptimizationondiscriminative

tasks,especiallythosethatrequiregeneralizationacrosscomplextransformations,alongsidetheconstraints

imposedbythehierarchicalorganizationofmotionprocessinginvisualcortex,determinedtherepresentation

ofactionsequencescomputedbyvisualcortex.Moreover,byhighlightingtheroleofrobustnesstonuisances

that are orthogonal to the discrimination task, our results extend the scope of invariant recognition as a

computationalframeworkforunderstandinghumanvisualintelligencetothestudyofactionrecognitionfrom

videosequences.

Results

ActiondiscriminationwithSpatiotemporalConvolutionalrepresentations

Wefilmedavideodatasetshowingfiveactors,performingfiveactions(drink,eat,jump,runandwalk)

atfivedifferentviewpoints[Figure1].Wethendevelopedfourvariantsoffeedforwardhierarchicalmodelsof

visualcortexandusedthemtoextractfeaturerepresentationsofvideosshowingtwodifferentviewpoints,

frontalandside.Subsequently,wetrainedamachinelearningclassifiertodiscriminatevideosequencesinto

4

differentactionclassesbasedonthemodeloutput.Wethenevaluatedtheclassifieraccuracyinpredictingthe

actiondepictedinnew,unseenvideos.

Figure1:Actionrecognitionstimulusset

Sampleframesfromactionrecognitiondatasetconsistingof2svideoclipsdepictingfiveactorsperformingfiveactions(toprow:drink,eat,jump,

runandwalk).Actionswererecordedatfivedifferentviewpoints(bottomrow:0-frontal,45,90-side,135and180degreeswithrespecttothe

normaltothefocalplane),theywereallperformedonatreadmillandactorsheldawaterbottleandanappleintheirhandregardlessofthe

actiontheyperformedinordertominimizelow-levelobject/actionconfounds.Actorswerecenteredintheframeandthebackgroundwasheld

constantregardlessofviewpoint.

The fourmodelswedeveloped toextract video representationswere instancesof Spatiotemporal

Convolutional Neural Networks (ST-CNNs), currently the best performing artificial perception systems for

actionrecognition[23].ThesearchitecturesaredirectextensionsoftheConvolutionalNeuralNetworksused

torecognizeobjectsorfacesinstaticimages[25,26],toinputstimulithatextendbothinspaceandtime.ST-

CNNsarehierarchicalmodelsthatbuildselectivitytospecificstimulithroughtemplatematchingoperations

and robustness to transformations through pooling operations [Figure 2]. Qualitatively, Spatiotemporal

ConvolutionalNeuralNetworksdetect thepresenceof a certain video segment (a template) in their input

stimulus; detections for various templates are then aggregated, following a hierarchical architecture, to

constructvideorepresentations.Nuisancesthatshouldnotbereflectedinthemodeloutput,likechangesin

position,arediscardedthroughapoolingmechanism[27].

5

Figure2:SpatiotemporalConvolutionalNeuralNetworks

Schematicoverviewoftheclassofmodelsweused:SpatiotemporalConvolutionalNeuralNetworks(ST-CNNs).ST-CNNsarehierarchicalfeature

extractionarchitectures.Inputvideosgothroughlayersofcomputationandtheoutputofeachlayerservesasinputtothenextlayer.Theoutput

of the last layerconstitutes thevideo representationused indownstreamtasks.Themodelsweconsideredconsistedof twoconvolutional-

pooling layers’ pairs, denoted as Conv1, Pool1, Conv2 and Pool2. Convolutional layers performed templatematching with a shared set of

templates at all positions in space and time (spatiotemporal convolution), and pooling layers increased robustness through max-pooling

operations.Convolutionallayers’templatescanbeeitherfixedapriori,sampledorlearned.Inthisexample,templatesinthefirstlayerConv1

are fixedanddepictmovingGabor-like receptive fields,while templates in thesecondsimple layerConv2aresampled fromasetofvideos

containingactionsandfilmedatdifferentviewpoints.

Weconsideredabasic,purelyconvolutionalmodel,andsubsequentlyintroducedmodificationstoits

poolingmechanismandtemplatelearningruletoimproveperformanceoninvariantactionrecognition[26].

6

Thefirst,purelyconvolutionalmodel,consistedofconvolutional layerswithfixedtemplates, interleavedby

poolinglayersthatcomputedmax-operationsacrosscontiguousregionsofspace.Inparticular,templatesin

thefirstconvolutionallayer,containedmovingGaborfilterswhiletemplatesinthesecondconvolutionallayer

were sampled from a set of action sequences collected at various viewpoints. The second, Unstructured

Poolingmodel, allowed pooling layer units to span random sets of templates aswell as contiguous space

regions[Figure3b].Thethird,StructuredPoolingmodel,allowedpoolingovercontiguousregionsofspaceas

wellasacrosstemplatesdepictingthesameactionatvariousviewpoints.The3Dorientationofeachtemplate

wasdiscardedthroughthispoolingmechanism,similarlytohowpositioninspaceisdiscardedintraditional

CNNs[Figure3a][2,28].Thefourthandfinalmodelemployedbackpropagation,agradientbasedoptimization

method, to learn convolutional layers’ templates by iteratively maximizing performance on an action

recognition task [26].We used eachmodel to extract feature representations of action sequences at two

differentviewpoints,frontalandside.Wethensoughttoevaluatethefourmodelsbasedonhowwellthey

couldsupportdiscriminationbetweenthefiveactions inourvideodataset.Wetrainedamachine learning

classifier todiscriminate stimulibasedonactionand laterassessed classificationaccuracyonnew,unseen

videos. In Experiment 1,we trained and tested the classifier usingmodel features computed from videos

capturedatthesameviewpoint. InExperiment2,wetrainedandtestedtheclassifierusingmodelfeatures

computedfromvideosatmismatchingviewpoints(e.g.iftheclassifierwastrainedusingvideoscapturedat

thefrontalviewpoint,thentestingwouldbeconductedusingvideosatthesideviewpoint).

7

Figure3:StructuredandUnstructuredPooling

WeintroducedmodificationstothebasicST-CNNtoincreaserobustnesstochangesin3D-viewpoint.QualitativelySpatiotemporalConvolutional

NeuralNetworksdetect thepresenceofacertainvideosegment (a template) in their inputstimulus.The3Dorientationof this template is

discardedbythepoolingmechanisminourstructuredpoolingmodel,analogoustohowpositioninspaceisdiscardedinatraditionalCNN.a)In

modelswithStructuredPooling,thetemplatesetforConv2layercellswassampledfromasetofvideoscontainingfouractorsperformingfive

actionsatfivedifferentviewpoints(seeMethods).Alltemplatessampledfromvideosofaspecificactorandperformingaspecificactionwere

pooledtogetherbyonePool2layerunit.b)ModelsemployingUnstructuredPoolingallowedPool2cellstopoolovertheentirespatialextentof

their inputaswellasacrosschannels.Thesemodelsusedtheexactsametemplatesemployedbymodels relyingonStructuredPoolingand

matchedthesemodelsinthenumberoftemplateswiredtoapoolingunit.However,theassignmentoftemplatestopoolingwasrandomized

(uniformwithoutreplacement)anddidnotreflectanysemanticstructure.

Experiment1:Actiondiscrimination-viewpointmatchcondition

InExperiment1,wetrainedandtestedtheactionclassifierusingfeaturerepresentationsofvideos

acquiredat the sameviewpoint, and thereforedidnot investigate robustness to changes in viewpoint.All

modelsproducedrepresentationsthatsuccessfullyclassifiedvideosbasedontheactiontheydepicted[Figure

4].Weobservedasignificantdifferenceinperformancebetweenmodel4,theend-to-endtrainablemodel,

andfixedtemplatemodels1,2and3(seeMethodsSection).However,thetaskconsideredinExperiment1

wasnotsufficienttorankthefourtypesofST-CNNmodels.

8

Figure4:Actionrecognition:viewpointmatchcondition

Wetrainedasupervisedmachine learningclassifiertodiscriminatevideosbasedontheiractioncontentbyusingthefeaturerepresentation

computedbyeachoftheSpatiotemporalConvolutionalNeuralNetworkmodelsweconsidered.Thisfigureshowsthepredictionaccuracyofa

machinelearningclassifiertrainedandtestedusingvideosrecordedatthesameviewpoint.Theclassifierwastrainedonvideosdepictingfour

actorsperformingfiveactionsateitherthefrontalorsideview.Themachinelearningclassifieraccuracywasthenassessedusingnew,unseen

videosofanew,unseenactorperformingthosesamefiveactions.Nogeneralizationacrosschangesin3Dviewpointswasrequiredofthefeature

extractionandclassificationsystem.Herewereportthemeanandstandarderroroftheclassificationaccuracyoverthefivepossiblechoicesof

testactor.Modelswithlearnedtemplatesoutperformmodelswithfixedtemplatessignificantlyonthistask.Chanceis1/5andisindicatedbya

horizontalline.Horizontallinesatthetopindicatesignificantdifferencebetweentwoconditions(p<0.05)basedongroupANOVAorBonferroni

correctedpairedt-test(seeMethodssection).

Experiment2:Actiondiscrimination-viewpointmismatchcondition

ThefourST-CNNmodelswedevelopedweredesignedtohavevaryingdegreesoftolerancetochanges

inviewpoint.InExperiment2,weinvestigatedhowwellthesemodelrepresentationscouldsupportlearning

to discriminate video sequences based on their action content, across changes in viewpoint. The general

experimentalprocedurewasidenticaltotheoneoutlinedforExperiment1,exceptweusedfeaturesextracted

fromvideosacquiredatmismatchingviewpointsfortrainingandtesting(e.g.,aclassifiertrainedusingvideos

captured at the frontal viewpoint, would be tested on videos at the side viewpoint). All the models we

consideredproducedrepresentationsthatwere,atleasttoaminimaldegree,usefultodiscriminateactions

invariantlytochangesinviewpoint[Figure5].UnlikewhatweobservedinExperiment1,itwaspossibletorank

themodelsweconsideredbasedonperformanceonthistask.Thiswasexpected,sincethevariousmodels

weredesignedtoexhibitvariousdegreesofrobustnesstochangesinviewpoint(seeMethodsSection).The

end-to-endtrainablemodels(model4)performedbetterthanmodels1,2and3,whichusedfixedtemplates,

9

on this task. Within the fixed templates models group, as expected, models that employed a Structured

ChannelPoolingmechanismtoincreaserobustnessperformedbest[29].

Figure5:Actionrecognition:viewpointmismatchcondition

Thisfigureshowsthepredictionaccuracyofamachinelearningclassifiertrainedandtestedusingvideosrecordedatopposedviewpoints.When

theclassifierwastrainedusingvideosat,say,thefrontalviewpoint,itsaccuracyindiscriminatingnew,unseenvideoswouldbeestablishedusing

videosrecordedatthesideviewpoint.Herewereportthemeanandstandarderroroftheclassificationaccuracyoverthefivepossiblechoices

of test actor. Models with learned templates resulted in significantly higher accuracy in this task. Among models with fixed templates,

SpatiotemporalConvolutionalNeuralNetworksemployingStructuredpoolingoutperformedbothpurelyconvolutionalandUnstructuredPooling

models.Chanceis1/5indicatedwithhorizontalline.Horizontallinesatthetopindicatesignificantdifferencebetweentwoconditions(p<0.05)

basedongroupANOVAorBonferronicorrectedpairedt-test(seeMethodssection).

Comparisonofmodelrepresentationsandneuralrecordings

We used Representational Similarity Analysis (RSA) to assess how well each model feature

representationmatched thehumanneural data. RSAproduces ameasureof agreementbetweenartificial

modelsandbrainrecordingsbasedonthecorrelationbetweenempiricaldissimilaritymatricesconstructed

usingeitherthemodelrepresentationofasetofstimuli,orrecordingsoftheneuralresponsesthesestimuli

elicit[Figure6][21].Weusedvideofeaturerepresentationsextractedbyeachmodelfromasetofnew,unseen

stimulitoconstructmodeldissimilaritymatricesandMagnetoencephalograpy(MEG)sensorrecordingsofthe

neural activity elicited by those same stimuli to compute a neural dissimilarity matrix [22]. Finally, we

constructedadissimilaritymatrixusing anaction categoricaloracle. In this case, thedissimilaritybetween

videosofthesameactionwaszeroandthedistanceacrossactionswasone.

10

Figure6:Featurerepresentationempiricaldissimilaritymatrices

Weusedfeaturerepresentations,extractedwiththefourSpatiotemporalConvolutionalNeuralNetworkmodels,from50videosdepictingfive

actorsperformingfiveactionsattwodifferentviewpoints,frontalandside.Moreover,weobtainedMagnetoencephalography(MEG)recordings

ofhumansubjects’brainactivitywhiletheywerewatchingthesesamevideos,andusedtheserecordingsasaproxyfortheneuralrepresentation

ofthesevideos.Thesevideoswerenotusedtoconstructor learnanyofthemodels.Foreachofthesixrepresentationsofeachvideo(four

artificialmodels,acategoricaloracleandoneneuralrecordings)weconstructedanempiricaldissimilaritymatrixusinglinearcorrelationand

normalizeditbetween0and1.Empiricaldissimilaritymatricesonthesamesetofstimuliconstructedwithvideorepresentationsfroma)Model

1:PurelyConvolutionalmodel,b)Model2:Unstructuredpoolingmodel,c)Model3:Structuredpoolingmodeld)Model4:Learnedtemplates

modele)Categoricaloracleandf)Magnetoencephalographybrainrecordings.

Weobservedthatend-to-endtrainablemodel(model4)produceddissimilaritystructuresthatbetter

agreedwiththoseconstructedfromneuraldatathanmodelswithfixedtemplates[Figure7].Withinmodels

with fixed templates, model 3, constructed using a Structured Pooling mechanism to build invariance to

changesinviewpoint,producedrepresentationsthatagreebetterwiththeneuraldatathanmodelsemploying

UnstructuredPooling(model2)andpurelyconvolutionalmodels(model1).

11

Figure7:RepresentationalSimilarityAnalysisbetweenModelrepresentationsandHuman

Neuraldata

WecomputedtheSpearmanCorrelationCoefficient(SCC)betweenthelowertriangularportionofthedissimilaritymatrixconstructedwitheach

oftheartificialmodelsweconsideredandthedissimilaritymatrixconstructedwithneuraldata(shownanddescribedinFigure6).Weassessed

theuncertaintyofthismeasurebyresamplingtherowsandcolumnsofthematricesweconstructed.Thenoiseceilingwasassessedbycomputing

theSCCbetweeneachindividualhumansubjects’dissimilaritymatrixandtheaveragedissimilaritymatrixovertherestofthesubjects.Thenoise

floor was computed by assessing the SCC between the lower portion of the dissimilarity matrix constructed using each of the model

representationandascrambledversionoftheneuraldissimilaritymatrix.Thescorereportedhereisnormalizedsothatthenoiseceilingis1and

thenoisefloor is0.Modelswith learnedtemplatesagreewiththeneuraldatasignificantlybetterthanmodelswithfixedtemplates.Among

these,modelswith StructuredPoolingoutperformbothpurelyConvolutional andUnstructuredmodels.Horizontal lines at the top indicate

significantdifferencebetweentwoconditions(p<0.05)basedongroupANOVAorBonferronicorrectedpairedt-test(seeMethodssection).

Discussion

We have shown that, within the Spatiotemporal Convolutional Neural Networksmodel class, and

acrossadeliberatesetofmodelmodifications,featurerepresentationsthataremoreusefultodiscriminate

actions in video sequences, in a manner that is robust to changes in viewpoint, also produce empirical

dissimilaritystructures thataremoresimilar to thoseconstructedusinghumanneuraldata.Discrimination

performanceonasimplertask,thatdoesnotrequiregeneralizationacrosscomplextransformations,wasnot

sufficienttofullyrankthemodelrepresentations.Theseresultssupportourhypothesisthatperformanceon

invariantdiscriminativetasksshapedtheneuralrepresentationsofactionsthatarecomputedbyourvisual

cortex.Moreover, dissimilaritymatrices constructedwith ST-CNNs representationsmatch thosebuildwith

neuraldatabetter thanapurelycategoricaldissimilaritymatrix.Thishighlights the importanceofboththe

12

computationaltaskandthearchitecturalconstraints,describedinpreviousaccountsoftheneuralprocessing

ofactionandmotions,tobuildquantitativelyaccuratemodelsofneuraldatarepresentations[30].Ourfindings

areinagreementwithwhathasbeenreportedfortheperceptionofobjectsfromstaticimages,bothatthe

singlerecordingsiteandatthewholebrainlevel[18–20],andidentifyacomputationaltaskthatexplainsand

recapitulatestherepresentationsofhumanactioninvisualcortex.

WedevelopedthefourST-CNNmodelsusingdeliberatemodificationstoimprovethemodels’feature

representationstoinvariantactionrecognition.Insodoing,weverifiedthatstructuredpoolingarchitectures

and memory based learning (model 3), as previously described and theoretically motivated [2,3], can be

appliedtobuildrepresentationsofvideosequencesthatsupportrecognitioninvarianttocomplex,non-affine

transformations.However,empirically,wefoundthatlearningmodeltemplatesusinggradientbasedmethods

andafullysupervisedactionrecognitiontask(model4),ledtobetterresults,bothintermsofclassification

accuracyandagreementwithneuralrecordings[20].

A limitation of the methods we employed is that the extent of the match between a model

representation and the neural data is appraised solely based on the correlation between the empirical

dissimilaritystructuresconstructedwithneuralrecordingsandmodelrepresentations.Thisrelativelyabstract

comparisonprovidesnoguidanceinestablishingaone-to-onemappingbetweenmodelunitsandbrainregions

orsub-regionsandthereforecannotexcludemodelsonthebasisofbiologicalimplausibility[19].Inthiswork,

we mitigated this limitation by constraining the model class to reflect previous accounts of neural

computationalunitsandmechanismsthatareinvolvedintheperceptionofmotion[12,13,17,31,32].

Furthermore,theclassofmodelswedevelopedinourexperimentsispurelyfeedforward,however,

theneuralrecordingsweselectedwereacquired470msafterstimulusonset.Thislateinthevisualprocessing,

itislikelythatfeedbacksignalsareamongtheenergysourcescapturedbytherecordings.Thesesignalsare

notaccountedforinourmodels.Weprovideevidencethataddingafeedbackmechanism,throughrecursion,

doesnotimproverecognitionperformancenorcorrelationwiththeneuraldata[SupplementaryFigure1].We

cannot, however, exclude that this is due to the stimuli and discrimination task we designed, which only

consideredpre-segmented,relativelyshortactionsequences.

Recognizingtheactionsofothersfromcomplexvisualstimuliisacrucialaspectofhumanperception.

We investigated the relevance of invariant action discrimination to improving model representations’

agreement with neural recordings and showed that it is one of the computational principles shaping the

representationofhumanactionsequenceshumanvisualcortexevolved,orlearnedtocompute.Ourdeliberate

approach to model design underlined the relevance of both supervised, gradient based, performance

optimization methods and memory based, structured pooling methods to the modeling of neural data

representations.Ifandhowprimatevisualcortexcouldimplementgradientbasedoptimizationoracquirethe

necessarysupervisionremains,despiterecentefforts,anunsettledmatter[33–35].Memory-basedlearning

13

andstructuredpoolinghavebeeninvestigatedextensivelyasabiologicallyplausiblelearningalgorithms[2,36–

38]. Irrespective of the precise biological mechanisms that could carry out performance optimization on

invariant discriminative tasks, computational studies point to its relevance to understanding neural

representations of visual scenes [18–20]. Recognizing the semantic category of visual stimuli across

photometric,geometricormorecomplexchanges,inverylowsampleregimesisahallmarkofhumanvisual

intelligence.Bybuildingdatarepresentationsthatsupportthiskindofrobustrecognition,wehaveshownhere,

oneobtainsempiricaldissimilaritystructuresthatmatchthoseconstructedusinghumanneuraldata.Inthe

wider context of the study of perception, our results strengthen the claim that the computational goal of

humanvisualcortexistosupportinvariantrecognitionbybroadeningittothestudyofactionperception.

Materialsandmethods

ActionRecognitionDataset

Wecollectedadatasetof fiveactorsperforming fiveactions (drink,eat, jump, runandwalk)ona

treadmillatfivedifferentviewpoints(0,45,90,135and180degreesbetweenthelineacrossthecenterofthe

treadmillandthelinenormaltothefocalplaneofthevideo-camera).Werotatedthetreadmillratherthanthe

cameratokeepthebackgroundconstantacrosschangesinviewpoint[Figure1].Theactorswereinstructedto

holdanappleandabottleintheirhandregardlessoftheactiontheywereperforming,sothatobjectsand

backgroundwouldnotdifferbetweenactions.Eachaction/actor/viewwasfilmedforatleast52s.Subsequently

theoriginalvideoswerecutinto26clips,each2slongresultinginadatasetof3,250videoclips.Videoclips

startedatrandompointsintheactioncycle(forexampleajumpmightstartmid-airorbeforetheactor’sfeet

lefttheground)andeach2sclipcontainedafullactioncycle.Theauthorsmanuallyidentifiedonesinglespatial

boundingboxthatcontainedtheentirebodyofeachactorandcroppedallvideosaccordingtothisbounding

box.

Recognizingactionswithspatiotemporalconvolutionalrepresentations

GeneralExperimentalProcedure

Experiment1andExperiment2weredesignedtoquantifytheamountofactioninformationextracted

fromvideosequencesbyfourcomputationalmodelsofprimatevisualcortex.InExperiment1,wetestedbasic

actionrecognition.InExperiment2,inparticular,wefurtherquantifiedwhetherthisactioninformationcould

supportactionrecognitionrobustlytochangesinviewpoint.Themotivatingideabehindourdesignisthat,ifa

machinelearningclassifierisabletodiscriminateunseenvideosequencesbasedontheiractioncontent,using

the output of a computational model, then this model representation contains some action information.

14

Moreover, if the classifier is able todiscriminatevideosbasedonactionatnew,unseenviewpoints,using

modeloutputsthenitmustbethatthesemodelrepresentationsnotonlycarryactioninformation,butthat

changesinviewpointarenotreflectedinthemodeloutput.Thisprocedureisanalogoustoneuraldecoding

techniqueswiththeimportantdifferencethattheoutputofanartificialmodelisusedinlieuofbrainrecordings

[39,40].

The general experimental procedure is as follows: we constructed feedforward hierarchical

spatiotemporalconvolutionalmodelsandusedthemtoextractfeaturerepresentationsofanumberofvideo

sequences.Wethentrainedamachinelearningclassifiertopredicttheactionlabelofavideosequencebased

onitsfeaturerepresentation.Finally,wequantifiedtheperformanceoftheclassifier,bymeasuringprediction

accuracyonasetofnew,unseenvideos.

Theprocedureoutlinedabovewasperformedusingthreeseparatesubsetsoftheactionrecognition

dataset described in the previous section. In particular, constructing spatiotemporal convolutional model

requiresaccesstovideosequencesdepictingactionstosampleorlearnconvolutionallayers’templates.The

subsetofvideosequencesusedtolearnorsampletemplateswascalled“embeddingset”.Trainingandtesting

theclassifierrequiredextractingmodelresponsesfromanumberofvideosequences;thesesequenceswere

organizedintwosubsets:“trainingset”and“testset”.Therewasneveranyoverlapbetweenthe“testset”

andtheunionof“trainingset”and“embeddingset”.

Experiment1

ThepurposeofExperiment1wastoassesshowwellthedatarepresentationsproducedbyeachof

thefourmodels,supportedanon-invariantactionrecognitiontask.Inparticular,theembeddingsetusedto

sampleorlearntemplatescontainedvideosshowingallfiveactionsatallfiveviewpointsperformedbyfourof

thefiveactors.Thetrainingsetwasasubsetoftheembeddingset,andcontainedonlyvideosateitherthe

frontalviewpointorthesideviewpoint.Lastlythetestsetcontainedvideosofallfiveactions,performedby

thefifthleft-outactorandperformedateitherthefrontalorsideviewpoint.Weobtainedfivedifferentsplits

by choosing each of the five actors exactly once for test. After the templates had either been learned or

sampledweusedeachmodeltoextractrepresentationsofthetrainandtestsetsvideos.Weaveragedthe

performanceoverthetwopossiblechoicesoftrainingviewpoint, frontalorside. Wereportthemeanand

standarderroroftheclassificationaccuracyacrossthefivepossiblechoicesofthetestactor.

Experiment2

Experiment 2 was designed to assess the performance of each model in producing data

representationsthatwereusefultoclassifyvideosaccordingtotheiractioncontent,whenageneralization

acrosschanges inviewpointwasrequired.Theexperiment is identicaltoExperiment1,andusedtheexact

15

samemodels.However,whenthetrainingsetcontainedvideosrecordedatthefrontalviewpoint,thetestset

wouldcontainvideosatsideviewpointandvice-versa.Wereportthemeanandstandarddeviationoverthe

choiceofthetestactoroftheaverageaccuracyoverthechoiceoftrainingviewpoint.

FeedforwardSpatiotemporalConvolutionalNeuralNetworks

FeedforwardSpatiotemporalConvolutionalNeuralNetworks(ST-CNNs)arehierarchicalmodels:input

videosequencesgothroughlayersofcomputationsandtheoutputofeachlayerservesasinputtothenext

layer [Figure2].Thesemodelsaredirectgeneralizationsofmodelsof theneuralmechanismsthatsupport

recognizingobjectsinstaticimages[27,41],tostimulithatextendinbothspaceandtime(i.e.videostimuli).

Withineachlayer,singlecomputationalunitsprocessaportionoftheinputvideosequencethatiscompact

bothinspaceandtime.Theoutputsofeachlayer’sunitsarethenprocessedandaggregatedbyunitsinthe

subsequent layerstoconstructa finalsignaturerepresentationforthewhole inputvideo.Thesequenceof

layersweadoptedalternateslayersofunitswhichperformtemplatematching(orconvolutionallayers),and

layersofunitswhichperformmaxpoolingoperations[15,17,23].Units’receptivefieldsizesincreasesasthe

signalpropagatesthroughthehierarchyoflayers.

Allconvolutionalunitswithinalayersharethesamesetoftemplates(filterbank)andoutputthedot-

productbetweeneachfilterandtheirinput.Qualitatively,thesemodelsworkbydetectingthepresenceofa

certainvideosegment(atemplate)intheinputstimulus.Theexactpositioninspaceandtimeofthedetection

isdiscardedby thepoolingmechanism.Thespecificmodelswepresenthereconsistof twoconvolutional-

poolinglayers’pairs.ThelayersaredenotedasConv1,Pool1,Conv2andPool2[Figure2].Convolutionallayers

arecompletelycharacterizedbythesize,contentandstrideoftheirunits’receptivefieldsandpoolinglayers

arecompletelycharacterizedbytheoperationtheyperform(inthecasesweconsidered,outputthemaximum

valueoftheirinput)andtheirpoolingregions(whichcanextendacrossspace,timeandfilters).

Model1:Purelyconvolutionalmodelswithsampledtemplates

Thepurelyconvolutionalmodelswithfixedandsampledtemplatesweconsideredwereimplemented

usingtheCorticalNetworkSimulatorpackage[42].

The input videos were (128x76 pixel) x 60 frames; the model received the original input videos

alongsidetwoscaled-downversionsofit(scalingoffactors½and¼ineachspatialdimensionrespectively).

Thefirstlayer,Conv1,consistedofconvolutionalunitswith72templatesofsize(7x7pixel)x3frames,

(9x9pixel)x4framesand(11x11pixel)x5frames.Convolutionwascarriedoutwithastrideof1pixel(no

spatial subsampling). Conv1 filterswere obtained by letting Gabor-like receptive fields shift in space over

frames(asdescribedinpreviousstudiesdescribingthereceptivefieldsofV1andMTcells[12,13,31]).Thefull

expressionforeachfilterwasasfollows:

16

! ", $, %, &, ', (, ), * = - % ."/ −"′(&, ', %)4 + $′(&, ', %)4

2(4cos(

2:$;

))

Where"; &, ', % and$;(&, ', %),aretransformedcoordinatesthattakeintoaccountarotationby&

andashiftby'%inthedirectionorthogonalto&.TheGaborfiltersweconsideredhadaspatialaperture(in

bothspatialdirections)of( = 0.6?,with?representingthespatialreceptivefieldsizeandawavelength) =

44([43].Eachfilterhadapreferredorientation&chosenamong8possibleorientations(0,45,90,135,180,

225,270,315degreeswithrespecttovertical).EachtemplatewasobtainedbylettingtheGabor-likereceptive

field justdescribed,shift intheorthogonaldirectionto itspreferredorientation(e.g.averticaledgewould

movesideways)withaspeed'chosenfromalineargridof3pointsbetween4/3and4pixelsperframe(the

shift inthefirstframeofthetemplatewaschosensothatthemeanofGabor-likereceptivefield’senvelop

wouldbecenteredinthemiddleframe).Lastly,Conv1templateshadtimemodulation- % = @% 4.ABCD[ FG!−

BC D

GI4 !]with* = 3and% = 0, … , LwithLthetemporalreceptivefieldsize[13,17].

The second layer, Pool1, performed max pooling operations on its input by simply finding and

outputtingthemaximumvalueofeachpoolingregion.ResponsestoeachchannelintheConv1filterbankwas

pooled independentlyandunitspooledacrossregionsofspace: (4x4units inspace)x1unit intimewitha

strideof2unitsinspace,and1unitintime,andtwoscalechannels.

AsecondsimplelayerConv2,followedPool1.Templatesinthiscaseweresampledrandomlyfromthe

Pool1responsestovideosintheembeddingset.Weusedamodelwith512Conv2unitswithsizes(9x9units

inspace)x3unitsintime,(17x17unitsinspace)x7unitsintimeand(25x25unitsinspace)x11unitsintime,

andstrideof1inalldirections.

Finally,thePool2layerunitsperformedmaxpooling.Poolingregionsextendedovertheentirespatial

input,onetemporalunit,allremainingscales,andasingleConv2channel.

Models2and3:StructuredandUnstructuredPoolingmodelswithsampledtemplates

Structured and Unstructured Pooling models (model 2 and 3, respectively) were constructed by

modifyingthePool2layerofthepurelyconvolutionalmodels.Specifically,inthesemodelsPool2unitspooled

overtheentirespatialinput,onetemporalunit,allremainingscalesand8or9Conv2(512channelsand60

Pool2units)channels.

In themodelsemployingaStructuredPoolingmechanism,all templates sampled fromvideosofa

particular actor performing a particular action, regardless of viewpointwere pooled together [Figure 3b].

Templatesofdifferentsizesandcorrespondingtodifferentscalechannelswerepooledindependently.This

resulted in 6 Pool2 units per action/actor pair, one for each receptive-field-size/scale-channel pair. The

17

intuitionbehindtheStructuredPoolingmechanismisthattheresultingPool2unitswillrespondstronglytothe

presenceofacertaintemplate(e.g.thetorsoofsomeonerunning)regardlessofits3Dpose[2,28,29,44–48].

ThemodelsemployinganUnstructuredPoolingmechanismfollowedasimilarpatternhowever,the

wiringbetweensimpleandcomplexcellswasrandom[Figure3a].Thefixedtemplatesmodels(model1,2and

3)employedtheexactsamesetoftemplates(wesampledthetemplatesfromtheembeddingsetsonlyonce

andusedtheminallthreemodels)anddifferedonlyintheirpoolingmechanisms.

Model4:Learnedtemplatemodels

Modelswith learnedtemplateswere implementedusingTorchpackages.Thesemodels’ templates

weretrainedtorecognizeactionsfromvideosintheembeddingsetusingaCrossEntropyLossfunction,full

supervisionandbackpropagation[49].Themodels’generalarchitecturewassimilartotheoneweusedfor

modelswithstructuredandunstructuredpooling.Specifically,duringtemplatelearningweusedtwostacked

Convolution-BatchNorm-MaxPooling-BatchNorm modules [50] followed by two Linear-ReLU-BatchNorm

modules(ReLUunitsarehalf-rectifiers)andafinalLog-Soft-Maxlayer.Duringfeatureextraction,theLinear

andLogSoftMax layerswerediscarded. Inputvideoswereresizedto (128x76pixel)x60 frames, like in the

fixed-templatemodels.The first convolutional layer’s filterbankcomprised72 filtersof size (9x9pixel) x9

framesandconvolutionwasappliedwithstrideof2inalldirections.Thefirstmax-poolinglayerusedpooling

regionsof size (4x4units in space) x 1unit in timeandwereappliedwith strideof 2units inboth spatial

directionsand1unitintime.Thesecondconvolutionallayer’sfilterbankwasmadeupof60templatesofsize

(17x17unitsinspace)x3unitsintime,responseswerecomputedwithastrideof2unitsintimeand1unitin

all spatial directions. The second MaxPooling layer’s units pooled over the full extent of both spatial

dimensions,1unitintimeand5channels.Lastly,theLinearlayershad256and128unitsrespectivelyandfree

bias terms.Model trainingwas carried out using StochasticGradientDescent [49] andmini-batches of 10

videos.

Machinelearningclassifier

We used the GURLS package [51] to train and test a Regularized Least Squares Gaussian-Kernel

classifierusing featuresextracted from the trainingand test set respectivelyand the corresponding action

labels.TheapertureoftheGaussianKernelaswellasthel2regularizationparameterwerechosenwithaLeave-

One-Outcross-validationprocedureonthetrainingset.Accuracywasevaluatedseparatelyforeachclassand

thenaveragedoverclasses.

Significancetesting:Modelaccuracy

Weusedagroupone-wayANOVAtoassessthesignificanceofthedifferenceinperformancebetween

allthefixed-templatemethodsandthemodelswithlearnedtemplates.Wethenusedapaired-samplet-test

18

with Bonferroni correction to assess the significance level of the difference between the performance of

individualmodels.Differenceweredeemedsignificantp<0.05.

Quantifyingagreementbetweenmodelrepresentationsandneuralrecordings

Neuralrecordings

Thebrainactivityof8humanparticipantswithnormalorcorrectedtonormalvisionwasrecorded

withanElektaNeuromagTriuxMagnetoencephalography(MEG)scannerwhiletheywatched50videos(five

actors,fiveactions,twoviewpoints:frontandside)acquiredwiththesameprocedureoutlinedabove,butnot

included in the dataset used formodel template sampling or training. TheMEG recordings datawas first

presented in [22] (the reference also details all acquisition, preprocessing and decodingmethods). In the

originalneuralrecordingstudyMEGrecordingswereusedtotrainapatternclassifiertodiscriminatevideo

stimulionthebasisoftheneuralresponsetheyelicited.Theperformanceofthepatternclassifierwasthen

assessed on a separate set of recordings from the same subjects. This train/test decoding procedurewas

repeated every 10ms and individually for each subject both in a non-invariant (train and test at the same

viewpoint)andaninvariant(trainatoneviewpointandtestatthedifferentviewpoint)case.Itwaspossibleto

discriminatevideosbasedontheiractioncontentbasedontheneuralresponsetheyelicited[22].

WeusedthefilteredMEGrecordings(all306sensors)elicitedbyeachofthe50videosmentioned

above,averagedacrosssubjectsandaveragedovera100mswindowcenteredaround470msafterstimulus

onsetasaproxytotheneuralrepresentationofthevideo(maximumaccuracyforactiondecoding).

RepresentationalSimilarityAnalysis

We computed the pairwise correlation-based dissimilarity matrix for each of the model

representationsofthe50videosthatwereshowntohumansubjectsintheMEG.Likewise,wecomputedthe

empirical dissimilarity matrix computed using MEG neural recordings. We then performed 50 rounds of

bootstrap,ineachroundwerandomlysampled30videosoutoftheoriginal50(correspondingto30rowsand

columnsofthedissimilaritymatrices).Foreach30-videossample,weassessedthelevelofagreementofthe

dissimilarity matrix induced by each model representation, with the one computed using neural data by

calculating the Spearman Correlation Coefficient (SCC) between the lower triangular portions of the two

matrices.

Wecomputedanestimateforthenoiseceilingintheneuraldatabyrepeatingthebootstrapprocedure

outlinedabovetoassessthelevelofagreementbetweenanindividualhumansubjectandtheaverageofthe

rest.Wethenselectedthehighestpossiblematchscoreacrosssubjectsandacross100roundsofbootstrapto

serveasnoiseceiling.

19

Similarly,weassessedachancelevelfortheRepresentationalSimilarityscorebycomputingthematch

between each model and a scrambled version of the neural data matrix. We performed 100 rounds of

bootstrappermodel(reshufflingtheneuraldissimilaritymatrixrowsandcolumnseachtime)andselectedthe

maximumscoreacrossroundsofbootstrapandmodelstoserveasbaselinescore[21].

WenormalizedtheSCCobtainedbycomparingeachmodelrepresentationtotheneuralrecordings,

by re-scaling themto fallbetween0 (chance level)and1 (noiseceiling). In thisnormalized scale,anything

positivematchesneuraldatabetterthanchancewithp<0.01.

Significancetesting:Matchingneuraldata

We used a one-way group ANOVA to assess the difference between the Spearman Correlation

Coefficient(SCC)obtainedusingmodelsthatemployedfixedtemplatesandmodelswithlearnedtemplates.

Subsequently,weassessedthesignificanceofthedifferencebetweentheSCCofeachmodelbyperforminga

pairedt-testbetweenthesamplesobtainedthroughthebootstrapprocedure.Wedeemeddifferencestobe

significantwhenp<0.05(Bonferronicorrected).

Acknowledgements

WewouldliketothankGeorgiosEvangelopoulos,CharlesFrogner,PatrickWinston,GabrielKreiman,

MartinGiese,CharlesJennings,HeuihanJhuang,andChestonTanfortheirfeedbackonthiswork.Thismaterial

isbaseduponwork supportedby theCenter forBrains,MindsandMachines (CBMM), fundedbyNSFSTC

awardCCF-1231216.WegratefullyacknowledgethesupportofNVIDIAwiththedonationoftheTeslaK40GPU

usedforthisresearch.WearegratefultotheMartinosImagingCenteratMIT,wheretheneuralrecordings

usedinthisworkwereacquiredandtotheMcGovernInstituteforBrainResearchatMITforsupportingthis

research.

References

1. MoeslundTB,GranumE.ASurveyofComputerVision-BasedHumanMotionCapture.ComputVisImageUnderst.2001;81:231–268.doi:10.1006/cviu.2000.0897

2. AnselmiF,LeiboJZ,RosascoL,MutchJ,TacchettiA,PoggioT.Unsupervisedlearningofinvariantrepresentations.TheorComputSci.2015;doi:http://dx.doi.org/10.1016/j.tcs.2015.06.048

3. PoggioTA,AnselmiF.VisualCortexandDeepNetworks:LearningInvariantRepresentations.MITPress;2016.4. PerrettDI,SmithPa,MistlinaJ,ChittyaJ,HeadaS,PotterDD,etal.Visualanalysisofbodymovementsbyneuronesinthe

temporalcortexofthemacaquemonkey:apreliminaryreport.BehavBrainRes.1985;16:153–170.doi:10.1016/0166-4328(85)90089-0

5. VangeneugdenJ,PollickF,VogelsR.Functionaldifferentiationofmacaquevisualtemporalcorticalneuronsusingaparametricactionspace.CerebCortex.2009;19:593–611.doi:10.1093/cercor/bhn109

6. GrossmanED,BlakeR.Brainareasactiveduringvisualperceptionofbiologicalmotion.Neuron.2002;35:1167–1175.doi:10.1016/S0896-6273(02)00897-8

7. VainaLM,SolomonJ,ChowdhuryS,SinhaP,BelliveauJW.Functionalneuroanatomyofbiologicalmotionperceptioninhumans.ProcNatlAcadSciUSA.2001;98:11656–61.doi:10.1073/pnas.191374198

8. BeauchampMS,LeeKE,HaxbyJV,MartinA.FMRIresponsestovideoandpoint-lightdisplaysofmovinghumansandmanipulableobjects.JCognNeurosci.2003;15:991–1001.doi:10.1162/089892903770007380

9. PeelenMV,DowningPE.Selectivityforthehumanbodyinthefusiformgyrus.JNeurophysiol.2005;93:603–8.

20

doi:10.1152/jn.00513.200410. GrossmanED,JardineNL,PylesJA.fMR-AdaptationRevealsInvariantCodingofBiologicalMotionontheHumanSTS.FrontHum

Neurosci.2010;4:15.doi:10.3389/neuro.09.015.201011. VangeneugdenJ,PeelenMV,TadinD,BattelliL.Distinctneuralmechanismsforbodyformandbodymotiondiscriminations.J

Neurosci.2014;34:574–85.doi:10.1523/JNEUROSCI.4032-13.201412. AdelsonEH,BergenJR.Spatiotemporalenergymodelsfortheperceptionofmotion.JOptSocAmImageSciVis.1985;2:284–299.

Available:http://www.opticsinfobase.org/abstract.cfm?URI=josaa-2-2-28413. SimoncelliEP,HeegerDJ.AmodelofneuronalresponsesinvisualareaMT.VisionRes.1998;38:743–761.14. SingerJM,SheinbergDL.Temporalcortexneuronsencodearticulatedactionsasslowsequencesofintegratedposes.JNeurosci.

2010;30:3133–3145.15. TanC,SingerJM,SerreT,SheinbergD,PoggioT.Neuralrepresentationofactionsequences:howfarcanasimplesnippet-matching

modeltakeus?In:BurgesCJC,BottouL,WellingM,GhahramaniZ,WeinbergerKQ,editors.AdvancesinNeuralInformationProcessingSystems26.CurranAssociates,Inc.;2013.pp.593–601.Available:http://papers.nips.cc/paper/5052-neural-representation-of-action-sequences-how-far-can-a-simple-snippet-matching-model-take-us.pdf

16. GieseMA,PoggioT.NeuralMechanismsfortheRecognitionofBiologicalMovements.NatRevNeurosci.2003;4:179–192.doi:10.1038/nrn1057

17. JhuangH,SerreT,WolfL,Jhuang.ABiologicallyInspiredSystemforActionRecognition.ComputerVision,2007ICCV2007IEEE11thInternationalConferenceon.2007.pp.1–8.Available:http://ieeexplore.ieee.org/search/srchabstract.jsp?tp=&arnumber=4408988&queryText=(jhuang+biologically)&openedRefinements=*&filter=AND(NOT(4283010803))&matchBoolean=true&rowsPerPage=30&searchField=Search+All

18. YaminsDLK,HongH,CadieuCF,SolomonEA,SeibertD,DiCarloJJ.Performance-optimizedhierarchicalmodelspredictneuralresponsesinhighervisualcortex.ProcNatlAcadSciUSA.2014;111:8619–24.doi:10.1073/pnas.1403112111

19. YaminsDLK,DiCarloJJ.Usinggoal-drivendeeplearningmodelstounderstandsensorycortex.NatNeurosci.2016;19:356–365.doi:10.1038/nn.4244

20. Khaligh-RazaviSM,KriegeskorteN.DeepSupervised,butNotUnsupervised,ModelsMayExplainITCorticalRepresentation.PLoSComputBiol.2014;10.doi:10.1371/journal.pcbi.1003915

21. KriegeskorteN,MurM,BandettiniP.Representationalsimilarityanalysis–connectingthebranchesofsystemsneuroscience.2008;doi:10.3389/neuro.06.004.2008

22. IsikL,TacchettiA,PoggioT.Fast,invariantrepresentationforhumanactioninthevisualsystem.arXivPreprarXiv.2016;23. KarpathyA,TodericiG,ShettyS,LeungT,SukthankarR,LiFF.Large-scalevideoclassificationwithconvolutionalneuralnetworks.

ProceedingsoftheIEEEComputerSocietyConferenceonComputerVisionandPatternRecognition.2014.pp.1725–1732.doi:10.1109/CVPR.2014.223

24. JiS,YangM,YuK,XuW.3Dconvolutionalneuralnetworksforhumanactionrecognition.IEEETransPatternAnalMachIntell.2013;35:221–31.doi:10.1109/TPAMI.2012.59

25. RiesenhuberM,PoggioT.Hierarchicalmodelsofobjectrecognitionincortex.NatNeurosci.1999;2:1019–25.doi:10.1038/1481926. LeCunY,BengioY,HintonG.Deeplearning.Nature.2015;521:436–444.doi:10.1038/nature1453927. TSerreAOTP.afeedforwardarchitechtureaccountsforrapidcategorization.PNAS.2007;104:6424–6429.

doi:10.1073/pnas.070062210428. LeiboJZ,LiaoQ,AnselmiF,PoggioT.Theinvariancehypothesisimpliesdomain-specificregionsinvisualcortex.PLoSComputBiol.

PublicLibraryofScience;2015;11:e1004390.29. LeiboJZ,MutchJ,PoggioT.Learningtodiscounttransformationsasthecomputationalgoalofvisualcortex?PresentFGVC/CVPR

2011,ColorSprings,CO.2011;Available:NaturePrecedingsatdx.doi.org/10.1038/npre.2011.6078.130. JhuangH.Abiologicallyinspiredsystemforactionrecognition.2008;58.Available:http://hdl.handle.net/1721.1/4224731. RustNC,SchwartzO,MovshonJA,SimoncelliEP.Spatiotemporalelementsofmacaquev1receptivefields.Neuron.2005;46:945–

956.Available:http://linkinghub.elsevier.com/retrieve/pii/S089662730500468X32. RustNC,ManteV,SimoncelliEP,MovshonJA.HowMTcellsanalyzethemotionofvisualpatterns.NatNeurosci.2006;9:1421–1431.33. BengioY,LeeD-H,BornscheinJ,LinZ.TowardsBiologicallyPlausibleDeepLearning.arXivPreprarxiv15020415.2015;18.

doi:10.1007/s13398-014-0173-7.234. LiaoQ,LeiboJZ,PoggioT.HowImportantisWeightSymmetryinBackpropagation?arXiv.2015;1837–1844.Available:

http://arxiv.org/abs/1510.0506735. MazzoniP,AndersenRa.,JordanMI.Amorebiologicallyplausiblelearningruleforneuralnetworks.ProcNatlAcadSci.1991;88:

4433–4437.doi:10.1073/pnas.88.10.443336. LeiboJZ,LiaoQ,AnselmiF,PoggioTA.TheInvarianceHypothesisImpliesDomain-SpecificRegionsinVisualCortex.PLOSComputBiol.

2015;11:e1004390.doi:10.1371/journal.pcbi.100439037. LeiboJZ,LiaoQ,AnselmiF,FreiwaldWA,PoggioT.View-TolerantFaceRecognitionandHebbianLearningImplyMirror-Symmetric

NeuralTuningtoHeadOrientation.CurrBiol.2016;1–6.doi:10.1016/j.cub.2016.10.01538. WiskottL,SejnowskiTJ.Slowfeatureanalysis:unsupervisedlearningofinvariances.NeuralComput.2002;14:715–770.

doi:10.1162/08997660231731893839. JacobsAL,FridmanG,DouglasRM,AlamNM,LathamPE,PruskyGT,etal.Rulingoutandrulinginneuralcodes.ProcNatlAcadSci.

NationalAcadSciences;2009;106:5936–5941.40. IsikL,MeyersEM,LeiboJZ,PoggioT.Thedynamicsofinvariantobjectrecognitioninthehumanvisualsystem.JNeurophysiol.

2014;111:91–102.doi:10.1152/jn.00394.201341. RiesenhuberM,PoggioT.Hierarchicalmodelsofobjectrecognitionincortex.NatNeurosci.1999;2:1019–1025.Available:

http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.46.7843%5C&amp42. MutchJ,KnoblichU,PoggioT.CNS:aGPU-basedframeworkforsimulatingcortically-organizednetworks.MIT-CSAIL-TR.2010;2010–

13.Available:http://dspace.mit.edu/handle/1721.1/51839

21

43. MutchJ,AnselmiF,TacchettiA,RosascoL,LeiboJZ,PoggioT.InvariantRecognitionPredictsTuningofNeuronsinSensoryCortex.In:ZhaoQ,editor.ComputationalandCognitiveNeuroscienceofVision.Singapore:SpringerSingapore;2017.pp.85–104.doi:10.1007/978-981-10-0213-7_5

44. LiaoQ,LeiboJ,PoggioT.Unsupervisedlearningofclutter-resistantvisualrepresentationsfromnaturalvideos.arXivPreprarXiv14093879.2014;

45. StringerSM,PerryG,RollsET,ProskeJH.Learninginvariantobjectrecognitioninthevisualsystemwithcontinuoustransformations.BiolCybern.2006;94:128–142.doi:10.1007/s00422-005-0030-z

46. WiskottL,SejnowskiTJ.Slowfeatureanalysis:Unsupervisedlearningofinvariances.NeuralComput.2002;14:715–770.Available:http://www.mitpressjournals.org/doi/abs/10.1162/089976602317318938

47. WallisG,BulthoffHH.Effectsoftemporalassociationonrecognitionmemory.ProcNatlAcadSciUSA.2001;98:4800–4804.Available:http://www.pnas.org/cgi/content/abstract/98/8/4800

48. FöldiákP.LearningInvariancefromTransformationSequences.NeuralComput.1991;3:194–200.doi:10.1162/neco.1991.3.2.19449. KingmaDP,BaJL.Adam:aMethodforStochasticOptimization.IntConfLearnRepresent.2015;1–13.50. IoffeS,SzegedyC.BatchNormalization:AcceleratingDeepNetworkTrainingbyReducingInternalCovariateShift.arXiv:150203167.

2015;1–11.doi:10.1007/s13398-014-0173-7.251. TacchettiA,MallapragadaPK,RosascoL,SantoroM.Gurls:Aleastsquareslibraryforsupervisedlearning.JMachLearnRes.2013;14:

3201–3205.Available:http://www.scopus.com/inward/record.url?eid=2-s2.0-84887422612&partnerID=tZOtx3y152. DonahueJ,HendricksLA,GuadarramaS,RohrbachM,VenugopalanS,DarrellT,etal.Long-termrecurrentconvolutionalnetworksfor

visualrecognitionanddescription.ProceedingsoftheIEEEComputerSocietyConferenceonComputerVisionandPatternRecognition.2015.pp.2625–2634.doi:10.1109/CVPR.2015.7298878

22

Supplementaryinformation

RecurrentNeuralNetworks

WeconstructedaRecurrentversionofourhierarchicalmodelandassessedhowwell theresulting

representationsupportsinvariantandnon-invariantactionrecognitionaswellashowcloselyitmatchesneural

data,inordertotryandaccountforthesefeedbackmechanisms.Thenetworkarchitectureisidenticaltothe

ModelwithLearnedTemplatesusedforExperiment1,Experiment2andExperiment3(seeMethodssection).

TheLinearlayers,however,werereplacedbyaRecurrentstructurewithahiddenstateof128units,resulting

in an architecture similar to the one described in [52]. Classification accuracy on an invariant and a non-

invariantactionrecognitiontasksaswellasqualityofmatchbetweentherepresentationandneuraldataare

presentedin[SupplementalFigure1].Allproceduresanddatasplitsareidenticaltothosedescribedinthe

Methods section in themain text. Recurrent neural networks do not exceed the performance of simpler

Feedforwardmodelsontheparticularactionrecognitiontaskpresented inthiswork,however it ishardto

drawdefinitiveconclusionsabouttheirrelevancetoneuralmodeling,giventhefactthatthestimuliweused

wererelativelyshortandcontainedpre-segmentedactions.

SupplementaryFigure1

a)Classificationaccuracy,withinandacrosschangesin3DviewpointforaRecurrentConvolutionalNeuralNetwork.Thisarchitecturedoesnot

outperformapurelyfeedforwardbaseline.b)ARecurrentConvolutionalNeuralNetworkdoesnotproduceadissimilaritystructurethatbetter

agreeswiththeneuraldatathanapurelyfeedforwardbaseline.

23

RSAovertime

For completeness, we report here the absolute value of the normalized Spearman Correlation

Coefficientbetweenthemodelwith learnedtemplatesandtheneuraldata.Thetimeseries isobtainedby

repeating theprocedureoutlined in themethodssectionusingneuraldata fromeachtimepoint (stimulus

onsetisattime0)andtimepointusedforfiguresinthemaintextcorrespondstotheverticalblackline.Results

arepresentedin[SupplementaryFigure2].

SupplementaryFigure2

SpearmanCorrelationCoefficientbetween thedissimilarity structure constructedusing the representationof50videos computed from the

SpatiotemporalConvolutionalNeuralNetworkwithlearnedtemplatesandtheneuraldataoverallpossiblechoicesoftheneuraldatatimebin.

Neuraldataismostinformativeforactioncontentofthestimulusatthetimeindicatedbytheverticalblackline[22].

invariant recognition drives neural representations of ... · we use artificial systems for action...

Documents