invariant recognition drives neural representations of ... · we use artificial systems for action...
TRANSCRIPT
1
Invariantrecognitiondrivesneuralrepresentations
ofactionsequences
AndreaTacchetti*,LeylaIsik*andTomasoPoggio
CenterforBrainsMindsandMachines,MIT
*denotesequalcontribution
Abstract
Recognizingtheactionsofothersfromvisualstimuliisacrucialaspectofhumanvisualperceptionthatallows
individualstorespondtosocialcues.Humansareabletoidentifysimilarbehaviorsanddiscriminatebetween
distinctactionsdespitetransformations,likechangesinviewpointoractor,thatsubstantiallyalterthevisual
appearanceofascene.Thisabilitytogeneralizeacrosscomplextransformationsisahallmarkofhumanvisual
intelligence.Advances inunderstandingmotionperceptionattheneural levelhavenotalwaystranslatedin
preciseaccountsofthecomputationalprinciplesunderlyingwhatrepresentationourvisualcortexevolvedor
learned to compute. Here we test the hypothesis that invariant action discrimination might fill this gap.
Recently,thestudyofartificialsystemsforstaticobjectperceptionhasproducedmodels,ConvolutionalNeural
Networks(CNNs),thatachievehumanlevelperformanceincomplexdiscriminativetasks.Withinthisclassof
models,architecturesthatbettersupportinvariantobjectrecognitionalsoproduceimagerepresentationsthat
match those implied by human and primate neural data. However, whether these models produce
representationsofactionsequencesthatsupportrecognitionacrosscomplextransformationsandcloselyfollow
neuralrepresentationsremainsunknown.HereweshowthatspatiotemporalCNNsappropriatelycategorize
videostimuliintoactions,andthatdeliberatemodelmodificationsthatimproveperformanceonaninvariant
actionrecognitiontaskleadtodatarepresentationsthatbettermatchhumanneuralrecordings.Ourresults
supportourhypothesis thatperformanceon invariantdiscriminationdictates theneural representationsof
actions computed by human visual cortex. Moreover, these results broaden the scope of the invariant
recognitionframeworkforunderstandingvisualintelligencefromperceptionofinanimateobjectsandfacesin
staticimagestothestudyofhumanperceptionofactionsequences.
2
AuthorSummary
Recognizingtheactionsofothers fromvideosequencesacrosschanges inviewpoint,actor,gaitor
illuminationisahallmarkofhumanvisualintelligence.Alargenumberofstudieshavehighlightedwhichareas
inthehumanbrainareinvolvedintheprocessingofbiologicalmotion,andmanymorehavedescribedhow
single neurons behave in response to videos of human actions. However, little is known about the
computationalnecessitiesthatshapedtheseneuralmechanismseitherthroughevolutionorexperience. In
thispaper,wetestthehypothesisthatthiscomputationalgoalisthediscriminationofcomplexvideostimuli
according to their action content.We show that,within the class of Spatiotemporal Convolutional Neural
Networks(ST-CNN),deliberatemodelmodificationsleadingtorepresentationsofvideosthatbettersupport
robust action discrimination, also produce representations that bettermatch humanneural data. Because
modifying a naïvemodel to increase its performance in invariant recognition resulted in a representation
similartohumanneuraldata,theseresultssuggestthat,similarlytowhatisknownforobjectrecognition,a
processofperformanceoptimization for invariantdiscrimination, constrained toahierarchical ST-CNN-like
architecture,shapedtheneuralmechanismsunderlyingourabilitytoperceivetheactionsofothers.
Introduction
Humans’abilitytorecognizetheactionsofothersisacrucialaspectofvisualperception.Remarkably,
theaccuracywithwhichwecanfinelydiscernwhatothersaredoingislargelyunaffectedbytransformations
that,whilesubstantiallychangingthevisualappearanceofagivenscene,donotchangethesemanticsofwhat
weobserve(e.g.achangeinviewpoint).Recognizingactions,themiddlegroundbetweenactionprimitivesand
activities[1],acrossthesetransformationsisahallmarkofhumanvisualintelligencewhichhasprovenelusive
toreplicateinartificialsystems.Becauseofthis,invariancetotransformationsthatareorthogonaltoalearning
taskhasbeenthesubjectofextensivetheoreticalandempiricalinvestigationinbothartificialandbiological
perceptionandrecognition[2,3].
Recentstudieshavedescribedtheneuralprocessingmechanismsunderlyingourabilitytorecognize
theactionofothersfromvisualstimuli.Specificbrainareashavebeenimplicatedintheprocessingofbiological
motionand theperceptionofactions.Forexample, inhumansandotherprimates, theSuperiorTemporal
Sulcus,andparticularlyitsposteriorportion,isbelievedtoparticipateintheprocessingofbiologicalmotion
andactions[4–11].Inadditiontostudyingentirebrainregionsthatengageduringactionrecognition,anumber
ofstudieshavecharacterizedtheresponsesofsingleneurons.Thepreferredstimuliofneuronsinvisualareas
V1andMTarewellapproximatedbymovingedge-detection filtersandenergy-basedpoolingmechanisms
[12,13].NeuronsintheSTSregionofmacaquemonkeysrespondselectivelytoactions,areinvarianttochanges
inactorsandviewpoint[14]andtheirtuningcurvesarewellmodeledbysimplesnippet-matchingmodels[15].
3
More broadly, the neural computations underlying action recognition in visual cortex are organized as a
hierarchicalsuccessionofspatiotemporalfeaturedetectorsofincreasingsizeandcomplexity[16,17].Despite
precise characterization of the organizational, regional and single-unit data involved in the processing of
biologicalmotion,littleinformationisknownaboutwhatcomputationaltasksmightberelevanttoexplaining
andrecapitulatingitscomputations.
Fueledbyadvancesincomputervisionmethodsforobjectandscenecategorization,recentstudies
havemadeprogresstowardslinkingcomputationaloutcomestoneuralsignalsthroughquantitativelyaccurate
modelsofsingleneurons in InferiorTemporalcortex (IT) [18]. Inaddition, thesestudieshavehighlighteda
correlationbetweenperformanceoptimizationondiscriminativeobjectrecognitiontasksandtheaccuracyof
neuralpredictionsbothatthesinglerecordingsiteandneuralrepresentationlevel[18–21].
However, these results, have not been extended to action perception and dynamic stimuli.What specific
computational goals relate to thebiological structure thatunderliesour ability to recognize theactionsof
others,andinparticulartotherepresentationsofactionsequenceshumanvisualcortexlearned,orevolved
tocompute,remainsunknown.Herewetestourhypothesisthatinvariantrecognitionmightfillthisgap.
Weuseartificial systems for action recognitionand compare theirdata representations tohuman
Magnetoencephalography (MEG) recordings [22].We show that, within the Spatiotemporal Convolutional
NeuralNetworksmodelclass[16,17,23,24],deliberatemodificationsthatresultinbetterperformingmodels
on invariant action recognition tasks, also lead to empirical dissimilaritymatrices that bettermatch those
obtainedbyhumanneuralrecordings.Ourresultssuggestthatperformanceoptimizationondiscriminative
tasks,especiallythosethatrequiregeneralizationacrosscomplextransformations,alongsidetheconstraints
imposedbythehierarchicalorganizationofmotionprocessinginvisualcortex,determinedtherepresentation
ofactionsequencescomputedbyvisualcortex.Moreover,byhighlightingtheroleofrobustnesstonuisances
that are orthogonal to the discrimination task, our results extend the scope of invariant recognition as a
computationalframeworkforunderstandinghumanvisualintelligencetothestudyofactionrecognitionfrom
videosequences.
Results
ActiondiscriminationwithSpatiotemporalConvolutionalrepresentations
Wefilmedavideodatasetshowingfiveactors,performingfiveactions(drink,eat,jump,runandwalk)
atfivedifferentviewpoints[Figure1].Wethendevelopedfourvariantsoffeedforwardhierarchicalmodelsof
visualcortexandusedthemtoextractfeaturerepresentationsofvideosshowingtwodifferentviewpoints,
frontalandside.Subsequently,wetrainedamachinelearningclassifiertodiscriminatevideosequencesinto
4
differentactionclassesbasedonthemodeloutput.Wethenevaluatedtheclassifieraccuracyinpredictingthe
actiondepictedinnew,unseenvideos.
Figure1:Actionrecognitionstimulusset
Sampleframesfromactionrecognitiondatasetconsistingof2svideoclipsdepictingfiveactorsperformingfiveactions(toprow:drink,eat,jump,
runandwalk).Actionswererecordedatfivedifferentviewpoints(bottomrow:0-frontal,45,90-side,135and180degreeswithrespecttothe
normaltothefocalplane),theywereallperformedonatreadmillandactorsheldawaterbottleandanappleintheirhandregardlessofthe
actiontheyperformedinordertominimizelow-levelobject/actionconfounds.Actorswerecenteredintheframeandthebackgroundwasheld
constantregardlessofviewpoint.
The fourmodelswedeveloped toextract video representationswere instancesof Spatiotemporal
Convolutional Neural Networks (ST-CNNs), currently the best performing artificial perception systems for
actionrecognition[23].ThesearchitecturesaredirectextensionsoftheConvolutionalNeuralNetworksused
torecognizeobjectsorfacesinstaticimages[25,26],toinputstimulithatextendbothinspaceandtime.ST-
CNNsarehierarchicalmodelsthatbuildselectivitytospecificstimulithroughtemplatematchingoperations
and robustness to transformations through pooling operations [Figure 2]. Qualitatively, Spatiotemporal
ConvolutionalNeuralNetworksdetect thepresenceof a certain video segment (a template) in their input
stimulus; detections for various templates are then aggregated, following a hierarchical architecture, to
constructvideorepresentations.Nuisancesthatshouldnotbereflectedinthemodeloutput,likechangesin
position,arediscardedthroughapoolingmechanism[27].
5
Figure2:SpatiotemporalConvolutionalNeuralNetworks
Schematicoverviewoftheclassofmodelsweused:SpatiotemporalConvolutionalNeuralNetworks(ST-CNNs).ST-CNNsarehierarchicalfeature
extractionarchitectures.Inputvideosgothroughlayersofcomputationandtheoutputofeachlayerservesasinputtothenextlayer.Theoutput
of the last layerconstitutes thevideo representationused indownstreamtasks.Themodelsweconsideredconsistedof twoconvolutional-
pooling layers’ pairs, denoted as Conv1, Pool1, Conv2 and Pool2. Convolutional layers performed templatematching with a shared set of
templates at all positions in space and time (spatiotemporal convolution), and pooling layers increased robustness through max-pooling
operations.Convolutionallayers’templatescanbeeitherfixedapriori,sampledorlearned.Inthisexample,templatesinthefirstlayerConv1
are fixedanddepictmovingGabor-like receptive fields,while templates in thesecondsimple layerConv2aresampled fromasetofvideos
containingactionsandfilmedatdifferentviewpoints.
Weconsideredabasic,purelyconvolutionalmodel,andsubsequentlyintroducedmodificationstoits
poolingmechanismandtemplatelearningruletoimproveperformanceoninvariantactionrecognition[26].
6
Thefirst,purelyconvolutionalmodel,consistedofconvolutional layerswithfixedtemplates, interleavedby
poolinglayersthatcomputedmax-operationsacrosscontiguousregionsofspace.Inparticular,templatesin
thefirstconvolutionallayer,containedmovingGaborfilterswhiletemplatesinthesecondconvolutionallayer
were sampled from a set of action sequences collected at various viewpoints. The second, Unstructured
Poolingmodel, allowed pooling layer units to span random sets of templates aswell as contiguous space
regions[Figure3b].Thethird,StructuredPoolingmodel,allowedpoolingovercontiguousregionsofspaceas
wellasacrosstemplatesdepictingthesameactionatvariousviewpoints.The3Dorientationofeachtemplate
wasdiscardedthroughthispoolingmechanism,similarlytohowpositioninspaceisdiscardedintraditional
CNNs[Figure3a][2,28].Thefourthandfinalmodelemployedbackpropagation,agradientbasedoptimization
method, to learn convolutional layers’ templates by iteratively maximizing performance on an action
recognition task [26].We used eachmodel to extract feature representations of action sequences at two
differentviewpoints,frontalandside.Wethensoughttoevaluatethefourmodelsbasedonhowwellthey
couldsupportdiscriminationbetweenthefiveactions inourvideodataset.Wetrainedamachine learning
classifier todiscriminate stimulibasedonactionand laterassessed classificationaccuracyonnew,unseen
videos. In Experiment 1,we trained and tested the classifier usingmodel features computed from videos
capturedatthesameviewpoint. InExperiment2,wetrainedandtestedtheclassifierusingmodelfeatures
computedfromvideosatmismatchingviewpoints(e.g.iftheclassifierwastrainedusingvideoscapturedat
thefrontalviewpoint,thentestingwouldbeconductedusingvideosatthesideviewpoint).
7
Figure3:StructuredandUnstructuredPooling
WeintroducedmodificationstothebasicST-CNNtoincreaserobustnesstochangesin3D-viewpoint.QualitativelySpatiotemporalConvolutional
NeuralNetworksdetect thepresenceofacertainvideosegment (a template) in their inputstimulus.The3Dorientationof this template is
discardedbythepoolingmechanisminourstructuredpoolingmodel,analogoustohowpositioninspaceisdiscardedinatraditionalCNN.a)In
modelswithStructuredPooling,thetemplatesetforConv2layercellswassampledfromasetofvideoscontainingfouractorsperformingfive
actionsatfivedifferentviewpoints(seeMethods).Alltemplatessampledfromvideosofaspecificactorandperformingaspecificactionwere
pooledtogetherbyonePool2layerunit.b)ModelsemployingUnstructuredPoolingallowedPool2cellstopoolovertheentirespatialextentof
their inputaswellasacrosschannels.Thesemodelsusedtheexactsametemplatesemployedbymodels relyingonStructuredPoolingand
matchedthesemodelsinthenumberoftemplateswiredtoapoolingunit.However,theassignmentoftemplatestopoolingwasrandomized
(uniformwithoutreplacement)anddidnotreflectanysemanticstructure.
Experiment1:Actiondiscrimination-viewpointmatchcondition
InExperiment1,wetrainedandtestedtheactionclassifierusingfeaturerepresentationsofvideos
acquiredat the sameviewpoint, and thereforedidnot investigate robustness to changes in viewpoint.All
modelsproducedrepresentationsthatsuccessfullyclassifiedvideosbasedontheactiontheydepicted[Figure
4].Weobservedasignificantdifferenceinperformancebetweenmodel4,theend-to-endtrainablemodel,
andfixedtemplatemodels1,2and3(seeMethodsSection).However,thetaskconsideredinExperiment1
wasnotsufficienttorankthefourtypesofST-CNNmodels.
8
Figure4:Actionrecognition:viewpointmatchcondition
Wetrainedasupervisedmachine learningclassifiertodiscriminatevideosbasedontheiractioncontentbyusingthefeaturerepresentation
computedbyeachoftheSpatiotemporalConvolutionalNeuralNetworkmodelsweconsidered.Thisfigureshowsthepredictionaccuracyofa
machinelearningclassifiertrainedandtestedusingvideosrecordedatthesameviewpoint.Theclassifierwastrainedonvideosdepictingfour
actorsperformingfiveactionsateitherthefrontalorsideview.Themachinelearningclassifieraccuracywasthenassessedusingnew,unseen
videosofanew,unseenactorperformingthosesamefiveactions.Nogeneralizationacrosschangesin3Dviewpointswasrequiredofthefeature
extractionandclassificationsystem.Herewereportthemeanandstandarderroroftheclassificationaccuracyoverthefivepossiblechoicesof
testactor.Modelswithlearnedtemplatesoutperformmodelswithfixedtemplatessignificantlyonthistask.Chanceis1/5andisindicatedbya
horizontalline.Horizontallinesatthetopindicatesignificantdifferencebetweentwoconditions(p<0.05)basedongroupANOVAorBonferroni
correctedpairedt-test(seeMethodssection).
Experiment2:Actiondiscrimination-viewpointmismatchcondition
ThefourST-CNNmodelswedevelopedweredesignedtohavevaryingdegreesoftolerancetochanges
inviewpoint.InExperiment2,weinvestigatedhowwellthesemodelrepresentationscouldsupportlearning
to discriminate video sequences based on their action content, across changes in viewpoint. The general
experimentalprocedurewasidenticaltotheoneoutlinedforExperiment1,exceptweusedfeaturesextracted
fromvideosacquiredatmismatchingviewpointsfortrainingandtesting(e.g.,aclassifiertrainedusingvideos
captured at the frontal viewpoint, would be tested on videos at the side viewpoint). All the models we
consideredproducedrepresentationsthatwere,atleasttoaminimaldegree,usefultodiscriminateactions
invariantlytochangesinviewpoint[Figure5].UnlikewhatweobservedinExperiment1,itwaspossibletorank
themodelsweconsideredbasedonperformanceonthistask.Thiswasexpected,sincethevariousmodels
weredesignedtoexhibitvariousdegreesofrobustnesstochangesinviewpoint(seeMethodsSection).The
end-to-endtrainablemodels(model4)performedbetterthanmodels1,2and3,whichusedfixedtemplates,
9
on this task. Within the fixed templates models group, as expected, models that employed a Structured
ChannelPoolingmechanismtoincreaserobustnessperformedbest[29].
Figure5:Actionrecognition:viewpointmismatchcondition
Thisfigureshowsthepredictionaccuracyofamachinelearningclassifiertrainedandtestedusingvideosrecordedatopposedviewpoints.When
theclassifierwastrainedusingvideosat,say,thefrontalviewpoint,itsaccuracyindiscriminatingnew,unseenvideoswouldbeestablishedusing
videosrecordedatthesideviewpoint.Herewereportthemeanandstandarderroroftheclassificationaccuracyoverthefivepossiblechoices
of test actor. Models with learned templates resulted in significantly higher accuracy in this task. Among models with fixed templates,
SpatiotemporalConvolutionalNeuralNetworksemployingStructuredpoolingoutperformedbothpurelyconvolutionalandUnstructuredPooling
models.Chanceis1/5indicatedwithhorizontalline.Horizontallinesatthetopindicatesignificantdifferencebetweentwoconditions(p<0.05)
basedongroupANOVAorBonferronicorrectedpairedt-test(seeMethodssection).
Comparisonofmodelrepresentationsandneuralrecordings
We used Representational Similarity Analysis (RSA) to assess how well each model feature
representationmatched thehumanneural data. RSAproduces ameasureof agreementbetweenartificial
modelsandbrainrecordingsbasedonthecorrelationbetweenempiricaldissimilaritymatricesconstructed
usingeitherthemodelrepresentationofasetofstimuli,orrecordingsoftheneuralresponsesthesestimuli
elicit[Figure6][21].Weusedvideofeaturerepresentationsextractedbyeachmodelfromasetofnew,unseen
stimulitoconstructmodeldissimilaritymatricesandMagnetoencephalograpy(MEG)sensorrecordingsofthe
neural activity elicited by those same stimuli to compute a neural dissimilarity matrix [22]. Finally, we
constructedadissimilaritymatrixusing anaction categoricaloracle. In this case, thedissimilaritybetween
videosofthesameactionwaszeroandthedistanceacrossactionswasone.
10
Figure6:Featurerepresentationempiricaldissimilaritymatrices
Weusedfeaturerepresentations,extractedwiththefourSpatiotemporalConvolutionalNeuralNetworkmodels,from50videosdepictingfive
actorsperformingfiveactionsattwodifferentviewpoints,frontalandside.Moreover,weobtainedMagnetoencephalography(MEG)recordings
ofhumansubjects’brainactivitywhiletheywerewatchingthesesamevideos,andusedtheserecordingsasaproxyfortheneuralrepresentation
ofthesevideos.Thesevideoswerenotusedtoconstructor learnanyofthemodels.Foreachofthesixrepresentationsofeachvideo(four
artificialmodels,acategoricaloracleandoneneuralrecordings)weconstructedanempiricaldissimilaritymatrixusinglinearcorrelationand
normalizeditbetween0and1.Empiricaldissimilaritymatricesonthesamesetofstimuliconstructedwithvideorepresentationsfroma)Model
1:PurelyConvolutionalmodel,b)Model2:Unstructuredpoolingmodel,c)Model3:Structuredpoolingmodeld)Model4:Learnedtemplates
modele)Categoricaloracleandf)Magnetoencephalographybrainrecordings.
Weobservedthatend-to-endtrainablemodel(model4)produceddissimilaritystructuresthatbetter
agreedwiththoseconstructedfromneuraldatathanmodelswithfixedtemplates[Figure7].Withinmodels
with fixed templates, model 3, constructed using a Structured Pooling mechanism to build invariance to
changesinviewpoint,producedrepresentationsthatagreebetterwiththeneuraldatathanmodelsemploying
UnstructuredPooling(model2)andpurelyconvolutionalmodels(model1).
11
Figure7:RepresentationalSimilarityAnalysisbetweenModelrepresentationsandHuman
Neuraldata
WecomputedtheSpearmanCorrelationCoefficient(SCC)betweenthelowertriangularportionofthedissimilaritymatrixconstructedwitheach
oftheartificialmodelsweconsideredandthedissimilaritymatrixconstructedwithneuraldata(shownanddescribedinFigure6).Weassessed
theuncertaintyofthismeasurebyresamplingtherowsandcolumnsofthematricesweconstructed.Thenoiseceilingwasassessedbycomputing
theSCCbetweeneachindividualhumansubjects’dissimilaritymatrixandtheaveragedissimilaritymatrixovertherestofthesubjects.Thenoise
floor was computed by assessing the SCC between the lower portion of the dissimilarity matrix constructed using each of the model
representationandascrambledversionoftheneuraldissimilaritymatrix.Thescorereportedhereisnormalizedsothatthenoiseceilingis1and
thenoisefloor is0.Modelswith learnedtemplatesagreewiththeneuraldatasignificantlybetterthanmodelswithfixedtemplates.Among
these,modelswith StructuredPoolingoutperformbothpurelyConvolutional andUnstructuredmodels.Horizontal lines at the top indicate
significantdifferencebetweentwoconditions(p<0.05)basedongroupANOVAorBonferronicorrectedpairedt-test(seeMethodssection).
Discussion
We have shown that, within the Spatiotemporal Convolutional Neural Networksmodel class, and
acrossadeliberatesetofmodelmodifications,featurerepresentationsthataremoreusefultodiscriminate
actions in video sequences, in a manner that is robust to changes in viewpoint, also produce empirical
dissimilaritystructures thataremoresimilar to thoseconstructedusinghumanneuraldata.Discrimination
performanceonasimplertask,thatdoesnotrequiregeneralizationacrosscomplextransformations,wasnot
sufficienttofullyrankthemodelrepresentations.Theseresultssupportourhypothesisthatperformanceon
invariantdiscriminativetasksshapedtheneuralrepresentationsofactionsthatarecomputedbyourvisual
cortex.Moreover, dissimilaritymatrices constructedwith ST-CNNs representationsmatch thosebuildwith
neuraldatabetter thanapurelycategoricaldissimilaritymatrix.Thishighlights the importanceofboththe
12
computationaltaskandthearchitecturalconstraints,describedinpreviousaccountsoftheneuralprocessing
ofactionandmotions,tobuildquantitativelyaccuratemodelsofneuraldatarepresentations[30].Ourfindings
areinagreementwithwhathasbeenreportedfortheperceptionofobjectsfromstaticimages,bothatthe
singlerecordingsiteandatthewholebrainlevel[18–20],andidentifyacomputationaltaskthatexplainsand
recapitulatestherepresentationsofhumanactioninvisualcortex.
WedevelopedthefourST-CNNmodelsusingdeliberatemodificationstoimprovethemodels’feature
representationstoinvariantactionrecognition.Insodoing,weverifiedthatstructuredpoolingarchitectures
and memory based learning (model 3), as previously described and theoretically motivated [2,3], can be
appliedtobuildrepresentationsofvideosequencesthatsupportrecognitioninvarianttocomplex,non-affine
transformations.However,empirically,wefoundthatlearningmodeltemplatesusinggradientbasedmethods
andafullysupervisedactionrecognitiontask(model4),ledtobetterresults,bothintermsofclassification
accuracyandagreementwithneuralrecordings[20].
A limitation of the methods we employed is that the extent of the match between a model
representation and the neural data is appraised solely based on the correlation between the empirical
dissimilaritystructuresconstructedwithneuralrecordingsandmodelrepresentations.Thisrelativelyabstract
comparisonprovidesnoguidanceinestablishingaone-to-onemappingbetweenmodelunitsandbrainregions
orsub-regionsandthereforecannotexcludemodelsonthebasisofbiologicalimplausibility[19].Inthiswork,
we mitigated this limitation by constraining the model class to reflect previous accounts of neural
computationalunitsandmechanismsthatareinvolvedintheperceptionofmotion[12,13,17,31,32].
Furthermore,theclassofmodelswedevelopedinourexperimentsispurelyfeedforward,however,
theneuralrecordingsweselectedwereacquired470msafterstimulusonset.Thislateinthevisualprocessing,
itislikelythatfeedbacksignalsareamongtheenergysourcescapturedbytherecordings.Thesesignalsare
notaccountedforinourmodels.Weprovideevidencethataddingafeedbackmechanism,throughrecursion,
doesnotimproverecognitionperformancenorcorrelationwiththeneuraldata[SupplementaryFigure1].We
cannot, however, exclude that this is due to the stimuli and discrimination task we designed, which only
consideredpre-segmented,relativelyshortactionsequences.
Recognizingtheactionsofothersfromcomplexvisualstimuliisacrucialaspectofhumanperception.
We investigated the relevance of invariant action discrimination to improving model representations’
agreement with neural recordings and showed that it is one of the computational principles shaping the
representationofhumanactionsequenceshumanvisualcortexevolved,orlearnedtocompute.Ourdeliberate
approach to model design underlined the relevance of both supervised, gradient based, performance
optimization methods and memory based, structured pooling methods to the modeling of neural data
representations.Ifandhowprimatevisualcortexcouldimplementgradientbasedoptimizationoracquirethe
necessarysupervisionremains,despiterecentefforts,anunsettledmatter[33–35].Memory-basedlearning
13
andstructuredpoolinghavebeeninvestigatedextensivelyasabiologicallyplausiblelearningalgorithms[2,36–
38]. Irrespective of the precise biological mechanisms that could carry out performance optimization on
invariant discriminative tasks, computational studies point to its relevance to understanding neural
representations of visual scenes [18–20]. Recognizing the semantic category of visual stimuli across
photometric,geometricormorecomplexchanges,inverylowsampleregimesisahallmarkofhumanvisual
intelligence.Bybuildingdatarepresentationsthatsupportthiskindofrobustrecognition,wehaveshownhere,
oneobtainsempiricaldissimilaritystructuresthatmatchthoseconstructedusinghumanneuraldata.Inthe
wider context of the study of perception, our results strengthen the claim that the computational goal of
humanvisualcortexistosupportinvariantrecognitionbybroadeningittothestudyofactionperception.
Materialsandmethods
ActionRecognitionDataset
Wecollectedadatasetof fiveactorsperforming fiveactions (drink,eat, jump, runandwalk)ona
treadmillatfivedifferentviewpoints(0,45,90,135and180degreesbetweenthelineacrossthecenterofthe
treadmillandthelinenormaltothefocalplaneofthevideo-camera).Werotatedthetreadmillratherthanthe
cameratokeepthebackgroundconstantacrosschangesinviewpoint[Figure1].Theactorswereinstructedto
holdanappleandabottleintheirhandregardlessoftheactiontheywereperforming,sothatobjectsand
backgroundwouldnotdifferbetweenactions.Eachaction/actor/viewwasfilmedforatleast52s.Subsequently
theoriginalvideoswerecutinto26clips,each2slongresultinginadatasetof3,250videoclips.Videoclips
startedatrandompointsintheactioncycle(forexampleajumpmightstartmid-airorbeforetheactor’sfeet
lefttheground)andeach2sclipcontainedafullactioncycle.Theauthorsmanuallyidentifiedonesinglespatial
boundingboxthatcontainedtheentirebodyofeachactorandcroppedallvideosaccordingtothisbounding
box.
Recognizingactionswithspatiotemporalconvolutionalrepresentations
GeneralExperimentalProcedure
Experiment1andExperiment2weredesignedtoquantifytheamountofactioninformationextracted
fromvideosequencesbyfourcomputationalmodelsofprimatevisualcortex.InExperiment1,wetestedbasic
actionrecognition.InExperiment2,inparticular,wefurtherquantifiedwhetherthisactioninformationcould
supportactionrecognitionrobustlytochangesinviewpoint.Themotivatingideabehindourdesignisthat,ifa
machinelearningclassifierisabletodiscriminateunseenvideosequencesbasedontheiractioncontent,using
the output of a computational model, then this model representation contains some action information.
14
Moreover, if the classifier is able todiscriminatevideosbasedonactionatnew,unseenviewpoints,using
modeloutputsthenitmustbethatthesemodelrepresentationsnotonlycarryactioninformation,butthat
changesinviewpointarenotreflectedinthemodeloutput.Thisprocedureisanalogoustoneuraldecoding
techniqueswiththeimportantdifferencethattheoutputofanartificialmodelisusedinlieuofbrainrecordings
[39,40].
The general experimental procedure is as follows: we constructed feedforward hierarchical
spatiotemporalconvolutionalmodelsandusedthemtoextractfeaturerepresentationsofanumberofvideo
sequences.Wethentrainedamachinelearningclassifiertopredicttheactionlabelofavideosequencebased
onitsfeaturerepresentation.Finally,wequantifiedtheperformanceoftheclassifier,bymeasuringprediction
accuracyonasetofnew,unseenvideos.
Theprocedureoutlinedabovewasperformedusingthreeseparatesubsetsoftheactionrecognition
dataset described in the previous section. In particular, constructing spatiotemporal convolutional model
requiresaccesstovideosequencesdepictingactionstosampleorlearnconvolutionallayers’templates.The
subsetofvideosequencesusedtolearnorsampletemplateswascalled“embeddingset”.Trainingandtesting
theclassifierrequiredextractingmodelresponsesfromanumberofvideosequences;thesesequenceswere
organizedintwosubsets:“trainingset”and“testset”.Therewasneveranyoverlapbetweenthe“testset”
andtheunionof“trainingset”and“embeddingset”.
Experiment1
ThepurposeofExperiment1wastoassesshowwellthedatarepresentationsproducedbyeachof
thefourmodels,supportedanon-invariantactionrecognitiontask.Inparticular,theembeddingsetusedto
sampleorlearntemplatescontainedvideosshowingallfiveactionsatallfiveviewpointsperformedbyfourof
thefiveactors.Thetrainingsetwasasubsetoftheembeddingset,andcontainedonlyvideosateitherthe
frontalviewpointorthesideviewpoint.Lastlythetestsetcontainedvideosofallfiveactions,performedby
thefifthleft-outactorandperformedateitherthefrontalorsideviewpoint.Weobtainedfivedifferentsplits
by choosing each of the five actors exactly once for test. After the templates had either been learned or
sampledweusedeachmodeltoextractrepresentationsofthetrainandtestsetsvideos.Weaveragedthe
performanceoverthetwopossiblechoicesoftrainingviewpoint, frontalorside. Wereportthemeanand
standarderroroftheclassificationaccuracyacrossthefivepossiblechoicesofthetestactor.
Experiment2
Experiment 2 was designed to assess the performance of each model in producing data
representationsthatwereusefultoclassifyvideosaccordingtotheiractioncontent,whenageneralization
acrosschanges inviewpointwasrequired.Theexperiment is identicaltoExperiment1,andusedtheexact
15
samemodels.However,whenthetrainingsetcontainedvideosrecordedatthefrontalviewpoint,thetestset
wouldcontainvideosatsideviewpointandvice-versa.Wereportthemeanandstandarddeviationoverthe
choiceofthetestactoroftheaverageaccuracyoverthechoiceoftrainingviewpoint.
FeedforwardSpatiotemporalConvolutionalNeuralNetworks
FeedforwardSpatiotemporalConvolutionalNeuralNetworks(ST-CNNs)arehierarchicalmodels:input
videosequencesgothroughlayersofcomputationsandtheoutputofeachlayerservesasinputtothenext
layer [Figure2].Thesemodelsaredirectgeneralizationsofmodelsof theneuralmechanismsthatsupport
recognizingobjectsinstaticimages[27,41],tostimulithatextendinbothspaceandtime(i.e.videostimuli).
Withineachlayer,singlecomputationalunitsprocessaportionoftheinputvideosequencethatiscompact
bothinspaceandtime.Theoutputsofeachlayer’sunitsarethenprocessedandaggregatedbyunitsinthe
subsequent layerstoconstructa finalsignaturerepresentationforthewhole inputvideo.Thesequenceof
layersweadoptedalternateslayersofunitswhichperformtemplatematching(orconvolutionallayers),and
layersofunitswhichperformmaxpoolingoperations[15,17,23].Units’receptivefieldsizesincreasesasthe
signalpropagatesthroughthehierarchyoflayers.
Allconvolutionalunitswithinalayersharethesamesetoftemplates(filterbank)andoutputthedot-
productbetweeneachfilterandtheirinput.Qualitatively,thesemodelsworkbydetectingthepresenceofa
certainvideosegment(atemplate)intheinputstimulus.Theexactpositioninspaceandtimeofthedetection
isdiscardedby thepoolingmechanism.Thespecificmodelswepresenthereconsistof twoconvolutional-
poolinglayers’pairs.ThelayersaredenotedasConv1,Pool1,Conv2andPool2[Figure2].Convolutionallayers
arecompletelycharacterizedbythesize,contentandstrideoftheirunits’receptivefieldsandpoolinglayers
arecompletelycharacterizedbytheoperationtheyperform(inthecasesweconsidered,outputthemaximum
valueoftheirinput)andtheirpoolingregions(whichcanextendacrossspace,timeandfilters).
Model1:Purelyconvolutionalmodelswithsampledtemplates
Thepurelyconvolutionalmodelswithfixedandsampledtemplatesweconsideredwereimplemented
usingtheCorticalNetworkSimulatorpackage[42].
The input videos were (128x76 pixel) x 60 frames; the model received the original input videos
alongsidetwoscaled-downversionsofit(scalingoffactors½and¼ineachspatialdimensionrespectively).
Thefirstlayer,Conv1,consistedofconvolutionalunitswith72templatesofsize(7x7pixel)x3frames,
(9x9pixel)x4framesand(11x11pixel)x5frames.Convolutionwascarriedoutwithastrideof1pixel(no
spatial subsampling). Conv1 filterswere obtained by letting Gabor-like receptive fields shift in space over
frames(asdescribedinpreviousstudiesdescribingthereceptivefieldsofV1andMTcells[12,13,31]).Thefull
expressionforeachfilterwasasfollows:
16
! ", $, %, &, ', (, ), * = - % ."/ −"′(&, ', %)4 + $′(&, ', %)4
2(4cos(
2:$;
))
Where"; &, ', % and$;(&, ', %),aretransformedcoordinatesthattakeintoaccountarotationby&
andashiftby'%inthedirectionorthogonalto&.TheGaborfiltersweconsideredhadaspatialaperture(in
bothspatialdirections)of( = 0.6?,with?representingthespatialreceptivefieldsizeandawavelength) =
44([43].Eachfilterhadapreferredorientation&chosenamong8possibleorientations(0,45,90,135,180,
225,270,315degreeswithrespecttovertical).EachtemplatewasobtainedbylettingtheGabor-likereceptive
field justdescribed,shift intheorthogonaldirectionto itspreferredorientation(e.g.averticaledgewould
movesideways)withaspeed'chosenfromalineargridof3pointsbetween4/3and4pixelsperframe(the
shift inthefirstframeofthetemplatewaschosensothatthemeanofGabor-likereceptivefield’senvelop
wouldbecenteredinthemiddleframe).Lastly,Conv1templateshadtimemodulation- % = @% 4.ABCD[ FG!−
BC D
GI4 !]with* = 3and% = 0, … , LwithLthetemporalreceptivefieldsize[13,17].
The second layer, Pool1, performed max pooling operations on its input by simply finding and
outputtingthemaximumvalueofeachpoolingregion.ResponsestoeachchannelintheConv1filterbankwas
pooled independentlyandunitspooledacrossregionsofspace: (4x4units inspace)x1unit intimewitha
strideof2unitsinspace,and1unitintime,andtwoscalechannels.
AsecondsimplelayerConv2,followedPool1.Templatesinthiscaseweresampledrandomlyfromthe
Pool1responsestovideosintheembeddingset.Weusedamodelwith512Conv2unitswithsizes(9x9units
inspace)x3unitsintime,(17x17unitsinspace)x7unitsintimeand(25x25unitsinspace)x11unitsintime,
andstrideof1inalldirections.
Finally,thePool2layerunitsperformedmaxpooling.Poolingregionsextendedovertheentirespatial
input,onetemporalunit,allremainingscales,andasingleConv2channel.
Models2and3:StructuredandUnstructuredPoolingmodelswithsampledtemplates
Structured and Unstructured Pooling models (model 2 and 3, respectively) were constructed by
modifyingthePool2layerofthepurelyconvolutionalmodels.Specifically,inthesemodelsPool2unitspooled
overtheentirespatialinput,onetemporalunit,allremainingscalesand8or9Conv2(512channelsand60
Pool2units)channels.
In themodelsemployingaStructuredPoolingmechanism,all templates sampled fromvideosofa
particular actor performing a particular action, regardless of viewpointwere pooled together [Figure 3b].
Templatesofdifferentsizesandcorrespondingtodifferentscalechannelswerepooledindependently.This
resulted in 6 Pool2 units per action/actor pair, one for each receptive-field-size/scale-channel pair. The
17
intuitionbehindtheStructuredPoolingmechanismisthattheresultingPool2unitswillrespondstronglytothe
presenceofacertaintemplate(e.g.thetorsoofsomeonerunning)regardlessofits3Dpose[2,28,29,44–48].
ThemodelsemployinganUnstructuredPoolingmechanismfollowedasimilarpatternhowever,the
wiringbetweensimpleandcomplexcellswasrandom[Figure3a].Thefixedtemplatesmodels(model1,2and
3)employedtheexactsamesetoftemplates(wesampledthetemplatesfromtheembeddingsetsonlyonce
andusedtheminallthreemodels)anddifferedonlyintheirpoolingmechanisms.
Model4:Learnedtemplatemodels
Modelswith learnedtemplateswere implementedusingTorchpackages.Thesemodels’ templates
weretrainedtorecognizeactionsfromvideosintheembeddingsetusingaCrossEntropyLossfunction,full
supervisionandbackpropagation[49].Themodels’generalarchitecturewassimilartotheoneweusedfor
modelswithstructuredandunstructuredpooling.Specifically,duringtemplatelearningweusedtwostacked
Convolution-BatchNorm-MaxPooling-BatchNorm modules [50] followed by two Linear-ReLU-BatchNorm
modules(ReLUunitsarehalf-rectifiers)andafinalLog-Soft-Maxlayer.Duringfeatureextraction,theLinear
andLogSoftMax layerswerediscarded. Inputvideoswereresizedto (128x76pixel)x60 frames, like in the
fixed-templatemodels.The first convolutional layer’s filterbankcomprised72 filtersof size (9x9pixel) x9
framesandconvolutionwasappliedwithstrideof2inalldirections.Thefirstmax-poolinglayerusedpooling
regionsof size (4x4units in space) x 1unit in timeandwereappliedwith strideof 2units inboth spatial
directionsand1unitintime.Thesecondconvolutionallayer’sfilterbankwasmadeupof60templatesofsize
(17x17unitsinspace)x3unitsintime,responseswerecomputedwithastrideof2unitsintimeand1unitin
all spatial directions. The second MaxPooling layer’s units pooled over the full extent of both spatial
dimensions,1unitintimeand5channels.Lastly,theLinearlayershad256and128unitsrespectivelyandfree
bias terms.Model trainingwas carried out using StochasticGradientDescent [49] andmini-batches of 10
videos.
Machinelearningclassifier
We used the GURLS package [51] to train and test a Regularized Least Squares Gaussian-Kernel
classifierusing featuresextracted from the trainingand test set respectivelyand the corresponding action
labels.TheapertureoftheGaussianKernelaswellasthel2regularizationparameterwerechosenwithaLeave-
One-Outcross-validationprocedureonthetrainingset.Accuracywasevaluatedseparatelyforeachclassand
thenaveragedoverclasses.
Significancetesting:Modelaccuracy
Weusedagroupone-wayANOVAtoassessthesignificanceofthedifferenceinperformancebetween
allthefixed-templatemethodsandthemodelswithlearnedtemplates.Wethenusedapaired-samplet-test
18
with Bonferroni correction to assess the significance level of the difference between the performance of
individualmodels.Differenceweredeemedsignificantp<0.05.
Quantifyingagreementbetweenmodelrepresentationsandneuralrecordings
Neuralrecordings
Thebrainactivityof8humanparticipantswithnormalorcorrectedtonormalvisionwasrecorded
withanElektaNeuromagTriuxMagnetoencephalography(MEG)scannerwhiletheywatched50videos(five
actors,fiveactions,twoviewpoints:frontandside)acquiredwiththesameprocedureoutlinedabove,butnot
included in the dataset used formodel template sampling or training. TheMEG recordings datawas first
presented in [22] (the reference also details all acquisition, preprocessing and decodingmethods). In the
originalneuralrecordingstudyMEGrecordingswereusedtotrainapatternclassifiertodiscriminatevideo
stimulionthebasisoftheneuralresponsetheyelicited.Theperformanceofthepatternclassifierwasthen
assessed on a separate set of recordings from the same subjects. This train/test decoding procedurewas
repeated every 10ms and individually for each subject both in a non-invariant (train and test at the same
viewpoint)andaninvariant(trainatoneviewpointandtestatthedifferentviewpoint)case.Itwaspossibleto
discriminatevideosbasedontheiractioncontentbasedontheneuralresponsetheyelicited[22].
WeusedthefilteredMEGrecordings(all306sensors)elicitedbyeachofthe50videosmentioned
above,averagedacrosssubjectsandaveragedovera100mswindowcenteredaround470msafterstimulus
onsetasaproxytotheneuralrepresentationofthevideo(maximumaccuracyforactiondecoding).
RepresentationalSimilarityAnalysis
We computed the pairwise correlation-based dissimilarity matrix for each of the model
representationsofthe50videosthatwereshowntohumansubjectsintheMEG.Likewise,wecomputedthe
empirical dissimilarity matrix computed using MEG neural recordings. We then performed 50 rounds of
bootstrap,ineachroundwerandomlysampled30videosoutoftheoriginal50(correspondingto30rowsand
columnsofthedissimilaritymatrices).Foreach30-videossample,weassessedthelevelofagreementofthe
dissimilarity matrix induced by each model representation, with the one computed using neural data by
calculating the Spearman Correlation Coefficient (SCC) between the lower triangular portions of the two
matrices.
Wecomputedanestimateforthenoiseceilingintheneuraldatabyrepeatingthebootstrapprocedure
outlinedabovetoassessthelevelofagreementbetweenanindividualhumansubjectandtheaverageofthe
rest.Wethenselectedthehighestpossiblematchscoreacrosssubjectsandacross100roundsofbootstrapto
serveasnoiseceiling.
19
Similarly,weassessedachancelevelfortheRepresentationalSimilarityscorebycomputingthematch
between each model and a scrambled version of the neural data matrix. We performed 100 rounds of
bootstrappermodel(reshufflingtheneuraldissimilaritymatrixrowsandcolumnseachtime)andselectedthe
maximumscoreacrossroundsofbootstrapandmodelstoserveasbaselinescore[21].
WenormalizedtheSCCobtainedbycomparingeachmodelrepresentationtotheneuralrecordings,
by re-scaling themto fallbetween0 (chance level)and1 (noiseceiling). In thisnormalized scale,anything
positivematchesneuraldatabetterthanchancewithp<0.01.
Significancetesting:Matchingneuraldata
We used a one-way group ANOVA to assess the difference between the Spearman Correlation
Coefficient(SCC)obtainedusingmodelsthatemployedfixedtemplatesandmodelswithlearnedtemplates.
Subsequently,weassessedthesignificanceofthedifferencebetweentheSCCofeachmodelbyperforminga
pairedt-testbetweenthesamplesobtainedthroughthebootstrapprocedure.Wedeemeddifferencestobe
significantwhenp<0.05(Bonferronicorrected).
Acknowledgements
WewouldliketothankGeorgiosEvangelopoulos,CharlesFrogner,PatrickWinston,GabrielKreiman,
MartinGiese,CharlesJennings,HeuihanJhuang,andChestonTanfortheirfeedbackonthiswork.Thismaterial
isbaseduponwork supportedby theCenter forBrains,MindsandMachines (CBMM), fundedbyNSFSTC
awardCCF-1231216.WegratefullyacknowledgethesupportofNVIDIAwiththedonationoftheTeslaK40GPU
usedforthisresearch.WearegratefultotheMartinosImagingCenteratMIT,wheretheneuralrecordings
usedinthisworkwereacquiredandtotheMcGovernInstituteforBrainResearchatMITforsupportingthis
research.
References
1. MoeslundTB,GranumE.ASurveyofComputerVision-BasedHumanMotionCapture.ComputVisImageUnderst.2001;81:231–268.doi:10.1006/cviu.2000.0897
2. AnselmiF,LeiboJZ,RosascoL,MutchJ,TacchettiA,PoggioT.Unsupervisedlearningofinvariantrepresentations.TheorComputSci.2015;doi:http://dx.doi.org/10.1016/j.tcs.2015.06.048
3. PoggioTA,AnselmiF.VisualCortexandDeepNetworks:LearningInvariantRepresentations.MITPress;2016.4. PerrettDI,SmithPa,MistlinaJ,ChittyaJ,HeadaS,PotterDD,etal.Visualanalysisofbodymovementsbyneuronesinthe
temporalcortexofthemacaquemonkey:apreliminaryreport.BehavBrainRes.1985;16:153–170.doi:10.1016/0166-4328(85)90089-0
5. VangeneugdenJ,PollickF,VogelsR.Functionaldifferentiationofmacaquevisualtemporalcorticalneuronsusingaparametricactionspace.CerebCortex.2009;19:593–611.doi:10.1093/cercor/bhn109
6. GrossmanED,BlakeR.Brainareasactiveduringvisualperceptionofbiologicalmotion.Neuron.2002;35:1167–1175.doi:10.1016/S0896-6273(02)00897-8
7. VainaLM,SolomonJ,ChowdhuryS,SinhaP,BelliveauJW.Functionalneuroanatomyofbiologicalmotionperceptioninhumans.ProcNatlAcadSciUSA.2001;98:11656–61.doi:10.1073/pnas.191374198
8. BeauchampMS,LeeKE,HaxbyJV,MartinA.FMRIresponsestovideoandpoint-lightdisplaysofmovinghumansandmanipulableobjects.JCognNeurosci.2003;15:991–1001.doi:10.1162/089892903770007380
9. PeelenMV,DowningPE.Selectivityforthehumanbodyinthefusiformgyrus.JNeurophysiol.2005;93:603–8.
20
doi:10.1152/jn.00513.200410. GrossmanED,JardineNL,PylesJA.fMR-AdaptationRevealsInvariantCodingofBiologicalMotionontheHumanSTS.FrontHum
Neurosci.2010;4:15.doi:10.3389/neuro.09.015.201011. VangeneugdenJ,PeelenMV,TadinD,BattelliL.Distinctneuralmechanismsforbodyformandbodymotiondiscriminations.J
Neurosci.2014;34:574–85.doi:10.1523/JNEUROSCI.4032-13.201412. AdelsonEH,BergenJR.Spatiotemporalenergymodelsfortheperceptionofmotion.JOptSocAmImageSciVis.1985;2:284–299.
Available:http://www.opticsinfobase.org/abstract.cfm?URI=josaa-2-2-28413. SimoncelliEP,HeegerDJ.AmodelofneuronalresponsesinvisualareaMT.VisionRes.1998;38:743–761.14. SingerJM,SheinbergDL.Temporalcortexneuronsencodearticulatedactionsasslowsequencesofintegratedposes.JNeurosci.
2010;30:3133–3145.15. TanC,SingerJM,SerreT,SheinbergD,PoggioT.Neuralrepresentationofactionsequences:howfarcanasimplesnippet-matching
modeltakeus?In:BurgesCJC,BottouL,WellingM,GhahramaniZ,WeinbergerKQ,editors.AdvancesinNeuralInformationProcessingSystems26.CurranAssociates,Inc.;2013.pp.593–601.Available:http://papers.nips.cc/paper/5052-neural-representation-of-action-sequences-how-far-can-a-simple-snippet-matching-model-take-us.pdf
16. GieseMA,PoggioT.NeuralMechanismsfortheRecognitionofBiologicalMovements.NatRevNeurosci.2003;4:179–192.doi:10.1038/nrn1057
17. JhuangH,SerreT,WolfL,Jhuang.ABiologicallyInspiredSystemforActionRecognition.ComputerVision,2007ICCV2007IEEE11thInternationalConferenceon.2007.pp.1–8.Available:http://ieeexplore.ieee.org/search/srchabstract.jsp?tp=&arnumber=4408988&queryText=(jhuang+biologically)&openedRefinements=*&filter=AND(NOT(4283010803))&matchBoolean=true&rowsPerPage=30&searchField=Search+All
18. YaminsDLK,HongH,CadieuCF,SolomonEA,SeibertD,DiCarloJJ.Performance-optimizedhierarchicalmodelspredictneuralresponsesinhighervisualcortex.ProcNatlAcadSciUSA.2014;111:8619–24.doi:10.1073/pnas.1403112111
19. YaminsDLK,DiCarloJJ.Usinggoal-drivendeeplearningmodelstounderstandsensorycortex.NatNeurosci.2016;19:356–365.doi:10.1038/nn.4244
20. Khaligh-RazaviSM,KriegeskorteN.DeepSupervised,butNotUnsupervised,ModelsMayExplainITCorticalRepresentation.PLoSComputBiol.2014;10.doi:10.1371/journal.pcbi.1003915
21. KriegeskorteN,MurM,BandettiniP.Representationalsimilarityanalysis–connectingthebranchesofsystemsneuroscience.2008;doi:10.3389/neuro.06.004.2008
22. IsikL,TacchettiA,PoggioT.Fast,invariantrepresentationforhumanactioninthevisualsystem.arXivPreprarXiv.2016;23. KarpathyA,TodericiG,ShettyS,LeungT,SukthankarR,LiFF.Large-scalevideoclassificationwithconvolutionalneuralnetworks.
ProceedingsoftheIEEEComputerSocietyConferenceonComputerVisionandPatternRecognition.2014.pp.1725–1732.doi:10.1109/CVPR.2014.223
24. JiS,YangM,YuK,XuW.3Dconvolutionalneuralnetworksforhumanactionrecognition.IEEETransPatternAnalMachIntell.2013;35:221–31.doi:10.1109/TPAMI.2012.59
25. RiesenhuberM,PoggioT.Hierarchicalmodelsofobjectrecognitionincortex.NatNeurosci.1999;2:1019–25.doi:10.1038/1481926. LeCunY,BengioY,HintonG.Deeplearning.Nature.2015;521:436–444.doi:10.1038/nature1453927. TSerreAOTP.afeedforwardarchitechtureaccountsforrapidcategorization.PNAS.2007;104:6424–6429.
doi:10.1073/pnas.070062210428. LeiboJZ,LiaoQ,AnselmiF,PoggioT.Theinvariancehypothesisimpliesdomain-specificregionsinvisualcortex.PLoSComputBiol.
PublicLibraryofScience;2015;11:e1004390.29. LeiboJZ,MutchJ,PoggioT.Learningtodiscounttransformationsasthecomputationalgoalofvisualcortex?PresentFGVC/CVPR
2011,ColorSprings,CO.2011;Available:NaturePrecedingsatdx.doi.org/10.1038/npre.2011.6078.130. JhuangH.Abiologicallyinspiredsystemforactionrecognition.2008;58.Available:http://hdl.handle.net/1721.1/4224731. RustNC,SchwartzO,MovshonJA,SimoncelliEP.Spatiotemporalelementsofmacaquev1receptivefields.Neuron.2005;46:945–
956.Available:http://linkinghub.elsevier.com/retrieve/pii/S089662730500468X32. RustNC,ManteV,SimoncelliEP,MovshonJA.HowMTcellsanalyzethemotionofvisualpatterns.NatNeurosci.2006;9:1421–1431.33. BengioY,LeeD-H,BornscheinJ,LinZ.TowardsBiologicallyPlausibleDeepLearning.arXivPreprarxiv15020415.2015;18.
doi:10.1007/s13398-014-0173-7.234. LiaoQ,LeiboJZ,PoggioT.HowImportantisWeightSymmetryinBackpropagation?arXiv.2015;1837–1844.Available:
http://arxiv.org/abs/1510.0506735. MazzoniP,AndersenRa.,JordanMI.Amorebiologicallyplausiblelearningruleforneuralnetworks.ProcNatlAcadSci.1991;88:
4433–4437.doi:10.1073/pnas.88.10.443336. LeiboJZ,LiaoQ,AnselmiF,PoggioTA.TheInvarianceHypothesisImpliesDomain-SpecificRegionsinVisualCortex.PLOSComputBiol.
2015;11:e1004390.doi:10.1371/journal.pcbi.100439037. LeiboJZ,LiaoQ,AnselmiF,FreiwaldWA,PoggioT.View-TolerantFaceRecognitionandHebbianLearningImplyMirror-Symmetric
NeuralTuningtoHeadOrientation.CurrBiol.2016;1–6.doi:10.1016/j.cub.2016.10.01538. WiskottL,SejnowskiTJ.Slowfeatureanalysis:unsupervisedlearningofinvariances.NeuralComput.2002;14:715–770.
doi:10.1162/08997660231731893839. JacobsAL,FridmanG,DouglasRM,AlamNM,LathamPE,PruskyGT,etal.Rulingoutandrulinginneuralcodes.ProcNatlAcadSci.
NationalAcadSciences;2009;106:5936–5941.40. IsikL,MeyersEM,LeiboJZ,PoggioT.Thedynamicsofinvariantobjectrecognitioninthehumanvisualsystem.JNeurophysiol.
2014;111:91–102.doi:10.1152/jn.00394.201341. RiesenhuberM,PoggioT.Hierarchicalmodelsofobjectrecognitionincortex.NatNeurosci.1999;2:1019–1025.Available:
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.46.7843%5C&42. MutchJ,KnoblichU,PoggioT.CNS:aGPU-basedframeworkforsimulatingcortically-organizednetworks.MIT-CSAIL-TR.2010;2010–
13.Available:http://dspace.mit.edu/handle/1721.1/51839
21
43. MutchJ,AnselmiF,TacchettiA,RosascoL,LeiboJZ,PoggioT.InvariantRecognitionPredictsTuningofNeuronsinSensoryCortex.In:ZhaoQ,editor.ComputationalandCognitiveNeuroscienceofVision.Singapore:SpringerSingapore;2017.pp.85–104.doi:10.1007/978-981-10-0213-7_5
44. LiaoQ,LeiboJ,PoggioT.Unsupervisedlearningofclutter-resistantvisualrepresentationsfromnaturalvideos.arXivPreprarXiv14093879.2014;
45. StringerSM,PerryG,RollsET,ProskeJH.Learninginvariantobjectrecognitioninthevisualsystemwithcontinuoustransformations.BiolCybern.2006;94:128–142.doi:10.1007/s00422-005-0030-z
46. WiskottL,SejnowskiTJ.Slowfeatureanalysis:Unsupervisedlearningofinvariances.NeuralComput.2002;14:715–770.Available:http://www.mitpressjournals.org/doi/abs/10.1162/089976602317318938
47. WallisG,BulthoffHH.Effectsoftemporalassociationonrecognitionmemory.ProcNatlAcadSciUSA.2001;98:4800–4804.Available:http://www.pnas.org/cgi/content/abstract/98/8/4800
48. FöldiákP.LearningInvariancefromTransformationSequences.NeuralComput.1991;3:194–200.doi:10.1162/neco.1991.3.2.19449. KingmaDP,BaJL.Adam:aMethodforStochasticOptimization.IntConfLearnRepresent.2015;1–13.50. IoffeS,SzegedyC.BatchNormalization:AcceleratingDeepNetworkTrainingbyReducingInternalCovariateShift.arXiv:150203167.
2015;1–11.doi:10.1007/s13398-014-0173-7.251. TacchettiA,MallapragadaPK,RosascoL,SantoroM.Gurls:Aleastsquareslibraryforsupervisedlearning.JMachLearnRes.2013;14:
3201–3205.Available:http://www.scopus.com/inward/record.url?eid=2-s2.0-84887422612&partnerID=tZOtx3y152. DonahueJ,HendricksLA,GuadarramaS,RohrbachM,VenugopalanS,DarrellT,etal.Long-termrecurrentconvolutionalnetworksfor
visualrecognitionanddescription.ProceedingsoftheIEEEComputerSocietyConferenceonComputerVisionandPatternRecognition.2015.pp.2625–2634.doi:10.1109/CVPR.2015.7298878
22
Supplementaryinformation
RecurrentNeuralNetworks
WeconstructedaRecurrentversionofourhierarchicalmodelandassessedhowwell theresulting
representationsupportsinvariantandnon-invariantactionrecognitionaswellashowcloselyitmatchesneural
data,inordertotryandaccountforthesefeedbackmechanisms.Thenetworkarchitectureisidenticaltothe
ModelwithLearnedTemplatesusedforExperiment1,Experiment2andExperiment3(seeMethodssection).
TheLinearlayers,however,werereplacedbyaRecurrentstructurewithahiddenstateof128units,resulting
in an architecture similar to the one described in [52]. Classification accuracy on an invariant and a non-
invariantactionrecognitiontasksaswellasqualityofmatchbetweentherepresentationandneuraldataare
presentedin[SupplementalFigure1].Allproceduresanddatasplitsareidenticaltothosedescribedinthe
Methods section in themain text. Recurrent neural networks do not exceed the performance of simpler
Feedforwardmodelsontheparticularactionrecognitiontaskpresented inthiswork,however it ishardto
drawdefinitiveconclusionsabouttheirrelevancetoneuralmodeling,giventhefactthatthestimuliweused
wererelativelyshortandcontainedpre-segmentedactions.
SupplementaryFigure1
a)Classificationaccuracy,withinandacrosschangesin3DviewpointforaRecurrentConvolutionalNeuralNetwork.Thisarchitecturedoesnot
outperformapurelyfeedforwardbaseline.b)ARecurrentConvolutionalNeuralNetworkdoesnotproduceadissimilaritystructurethatbetter
agreeswiththeneuraldatathanapurelyfeedforwardbaseline.
23
RSAovertime
For completeness, we report here the absolute value of the normalized Spearman Correlation
Coefficientbetweenthemodelwith learnedtemplatesandtheneuraldata.Thetimeseries isobtainedby
repeating theprocedureoutlined in themethodssectionusingneuraldata fromeachtimepoint (stimulus
onsetisattime0)andtimepointusedforfiguresinthemaintextcorrespondstotheverticalblackline.Results
arepresentedin[SupplementaryFigure2].
SupplementaryFigure2
SpearmanCorrelationCoefficientbetween thedissimilarity structure constructedusing the representationof50videos computed from the
SpatiotemporalConvolutionalNeuralNetworkwithlearnedtemplatesandtheneuraldataoverallpossiblechoicesoftheneuraldatatimebin.
Neuraldataismostinformativeforactioncontentofthestimulusatthetimeindicatedbytheverticalblackline[22].