ebook img

Deep Neural Networks predict Hierarchical Spatio-temporal Cortical Dynamics of Human Visual Object Recognition PDF

2.3 MB·
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Deep Neural Networks predict Hierarchical Spatio-temporal Cortical Dynamics of Human Visual Object Recognition

Deep Neural Networks predict Hierarchical Spatio-temporal Cortical Dynamics of Human Visual Object Recognition RadoslawM.Cichy†∗ AdityaKhosla† DimitriosPantazis† AntonioTorralba† AudeOliva† †MassachusettsInstituteofTechnology ∗FreeUniversityBerlin {rmcichy, khosla, pantazis, torralba, oliva}@mit.edu 6 1 0 Abstract Herewetakeanalternativeroute,constructingandcom- 2 paring against brain signals a visual computational model n The complex multi-stage architecture of cortical visual basedondeepneuralnetworks(DNNs)[30,32],i.e.,com- a pathways provides the neural basis for efficient visual ob- puter vision models in which model neuron tuning prop- J ject recognition in humans. However, the stage-wise com- erties are set by supervised learning without manual inter- 2 putations therein remain poorly understood. Here, we vention[30,37]. DNNsarethebestperformingmodelson 1 comparedtemporal(magnetoencephalography)andspatial computer vision object recognition benchmarks and yield ] (functional MRI) visual brain representations with repre- humanperformancelevelsonobjectcategorization[38,19]. V sentationsinanartificialdeepneuralnetwork(DNN)tuned We used a tripartite strategy to reveal the spatio-temporal C tothestatisticsofreal-worldvisualrecognition.Weshowed processing cascade underlying human visual object recog- . thattheDNNcapturedthestagesofhumanvisualprocess- nitionbyDNNmodelcomparisons. s c inginbothtimeandspacefromearlyvisualareastowards First, as object recognition is a process rapidly unfold- [ the dorsal and ventral streams. Further investigation of ing over time [5, 9, 41], we compared DNN visual rep- crucial DNN parameters revealed that while model archi- 1 resentationstomillisecondresolvedmagnetoencephalogra- v tecture was important, training on real-world categoriza- phy(MEG)braindata. Ourresultsdelineate,toourknowl- 0 tionwasnecessarytoenforcespatio-temporalhierarchical edgeforthefirsttime,anorderedrelationshipbetweenthe 7 relationships with the brain. Together our results provide 9 stagesofprocessingincomputervisionmodelandthetime an algorithmically informed view on the spatio-temporal 2 coursewithwhichobjectrepresentationsemergeinthehu- dynamics of visual object recognition in the human visual 0 manbrain. . brain. 1 Second,asobjectrecognitionrecruitsamultitudeofdis- 0 6 tributed brain regions, a full account of object recognition 1 1.Introduction needstogobeyondtheanalysisofafewpre-definedbrain : regions [1, 6, 18, 21, 48], determining the relationship be- v i Visualobjectrecognitioninhumansismediatedbycom- tween DNNs and the whole brain. Using a spatially unbi- X plex multi-stage processing of visual information emerg- ased approach, we revealed a hierarchical relationship be- r ingrapidlyinadistributednetworkofcorticalregions[43, tweenDNNsandtheprocessingcascadeofboththeventral a 15, 5, 31, 23, 26, 14]. Understanding visual object recog- anddorsalvisualpathway. nition in cortex thus requires a predictive and quantitative Third, interpretation of a DNN-brain comparison de- modelthatcapturesthecomplexityoftheunderlyingspatio- pends on the factors shaping the DNN fundamentally: the temporaldynamics[35,36,34]. pre-specified model architecture, the training procedure, A major impediment in creating such a model is the and the learned task (e.g. object categorization). By com- highly nonlinear and sparse nature of neural tuning prop- paring different DNN models to brain data, we demon- ertiesinmid-andhigh-levelvisualareas[11,44,47]thatis strated the influence of each of these factors on the emer- difficulttocaptureexperimentally,andthusunknown. Pre- gence of similarity relations between DNNs and brains in vious approaches to modeling object recognition in cortex bothspaceandtime. relied on extrapolation of principles from well understood lower visual areas such as V1 [35, 36] and strong manual Together, our results provide an algorithmically in- intervention,achievingonlymodesttaskperformancecom- formedperspectiveofthespatio-temporaldynamicsunder- paredtohumans. lyingvisualobjectrecognitioninthehumanbrain. 1 2.Results viewingrandomsequencesofthesame118real-worldob- jectimagesetwhileconductinganorthogonaltask. Theex- 2.1. Construction of a DNN performing at human perimentaldesignwasadaptedtothespecificsofthemea- levelinobjectcategorization surementtechnique(Suppl.Fig.1). To be a plausible model of object recognition in cor- WecomparedfMRIandMEGbrainmeasurementswith tex,acomputationalmodelmustprovidehighperformance the DNN in a common analysis framework with represen- onvisualobjectcategorization. Latestgenerationsofcom- tationalsimilarityanalysis[28](Fig.2(b)). Thebasicidea putervisionmodels,termeddeepneuralnetworks(DNNs), isthatiftwoimagesaresimilarlyrepresentedinthebrain, have achieved extraordinary performance, thus raising the they should also be similarly represented in the DNN. To questionwhethertheiralgorithmicrepresentationsbearre- quantify, we first obtained signal measurements in tempo- semblanceoftheneuralcomputationsunderlyinghumanvi- rally specific MEG sensor activation patterns (1ms steps sion.Toinvestigatewecreatedan8-layerDNNarchitecture from−100to+1000ms), inspatiallyspecificfMRIvoxel (Fig. 1(a)) that corresponds to the best-performing model patterns, and in layer-specific model neuron activations of inobjectclassificationintheImageNetLargeScaleVisual theDNN.Tomakethedifferentsignalspaces(fMRI,MEG, Recognition Challenge 2012 [29]. Each DNN layer per- DNN) comparable, we abstracted signals to a similarity formssimpleoperationsthatareimplementableinbiologi- space. In detail, for each signal space we computed dis- calcircuits,suchasconvolution,poolingandnormalization. similarities (1−Spearman’s ρ for DNN and fMRI, percent WetrainedtheDNNtoperformobjectcategorizationonev- decodingaccuracyinpair-wiseclassificationforMEG)be- erydayobjectcategories(683categories,with 1300images tweeneverypairofconditions(images),asexemplifiedby ineachcategory)usingbackpropagation,i.e.,thenetwork images1and2inFig.2(b). Thisyielded118×118repre- learnedneuronaltuningfunctionsbyitself. Wetermedthis sentational dissimilarity matrices (RDMs) indexed in rows neural network object deep neural network (object DNN). and columns by the compared conditions. These RDMs TheobjectDNNperformedequallywellonobjectcatego- weretime-resolvedforMEG,space-resolvedforfMRI,and rizationaspreviousimplementations(Suppl.Table1). We layer-resolvedinDNN.ComparingDNNRDMswithMEG investigated the coding of visual information in the object RDMsresultedintimecourseshighlightinghowDNNpre- DNNbydeterminingthereceptivefield(RF)selectivityof dicted emerging visual representations. Comparing DNN themodelneuronsusinganeuroscience-inspiredreduction RDMswithfMRIRDMsresultedinspatialmapsindicative method[49]. ofhowtheobjectDNNpredictedbrainactivity. We found that neurons in early layers had Gabor filter 2.3. The object DNN predicted temporal dynam- orcolorpatch-likesensitivity, whilethoseofdeeperlayers ics of emerging visual representations in the hadlargerRFsandsensitivitytocomplexforms(Fig.1(b)). humanbrain Thus the object DNN learned representations in a hierar- chyofincreasingcomplexity,akintorepresentationsinthe Visual information processing in the brain is a process primate visual brain hierarchy [23, 14]. Figure 1(c) ex- thatrapidlyevolvesovertime[5,9,41],andamodelofob- emplifies the connectivity and receptive field selectivity of ject recognition in cortex should mirror this temporal evo- themoststronglyconnectedneuronsstartingfromasample lution. While the DNN used here does not model time, neuron in layer 1. An online tool offering visualization of it has a clear sequential structure: information flows from RFselectivityofallneuronsinlayers1through5isavail- one layer to the next in strict order. We thus investigated ableathttp://brainmodels.csail.mit.edu. whether the object DNN predicted emerging visual repre- sentationsinthefirstfewhundredmillisecondsofvisionin 2.2. Representational similarity analysis was used sequential order. For this we determined representational as the integrative framework for DNN-brain similarity between layer-specific DNN representations and comparison MEG data in millisecond steps from −100 to +1000ms To compare representations in the object DNN and hu- withrespecttoimageonsetandlayer-specificDNNrepre- man brains, we used a 118-image set of natural objects sentations. WefoundthatalllayersoftheobjectDNNwere on real-world backgrounds (Fig. 2(a)). Note that these representationally similar to human brain activity, indicat- 118 images were not used for training the object DNN ingthatthemodelcapturesemergingbrainvisualrepresen- to avoid circular inference. With 94% correct perfor- tations (Fig. 3(a), P < 0.05 cluster definition threshold, mance in a top-five categorization task on this 118 image P < 0.05clusterthreshold, linesabovedatacurvescolor- set, the network performed at a level comparable to hu- coded same as those indicate significant time points, for mans [38] (voting on each of the 118 images is available details see Suppl. Table 2). We next investigated whether athttp://brainmodels.csail.mit.edu). thehierarchyofthelayeredarchitectureoftheobjectDNN, We also recorded fMRI and MEG in 15 participants as characterized by an increasing size and complexity of Figure1: Deepneuralnetworkarchitectureandproperties. (a)TheDNNarchitecturecomprised8layers. Eachoflayers 1−5containedacombinationofconvolution,max-poolingandnormalizationstages,whereasthelastthreelayerswerefully connected. The DNN takes pixel values as inputs and propagates information feed-forward through the layers, activating modelneuronswithparticularactivationvaluessuccessivelyateachlayer. (b)Visualizationofmodelreceptivefields(RFs) selectivity. Each row shows the 4 images most strongly activating two exemplary model neurons for layers 1 through 5, withshadedregionshighlightingtheimageareaprimarilydrivingtheneuronresponse. (c)VisualizationofexampleDNN connectionsandneuronRFselectivity. Thethicknessofhighlightedlines(coloredtoeasevisualization)indicatestheweight ofthestrongestconnectionsgoinginandoutofneurons,startingfromasampleneuroninlayer1. Combinedvisualization ofneuronRFselectivityandconnectionsbetweenneurons,herestartingfromasampleneuroninlayer1(onlypartsofthe networkforvisualization).Neuronsinlayer1arerepresentedbytheirfilters,andinlayers2−5bygraydots.Inlaysshowthe 4imagesthatmoststronglyactivateeachneuron. Acompletevisualizationofallneuronsinlayers1through5isavailable athttp://brainmodels.csail.mit.edu. modelRFsfeatureselectivity,correspondedtothehierarchy DNN relationships on the real-world object categorization of temporal processing in the brain. That is, we examined task.Tothisgoal,wecreatedseveraldifferentDNNmodels whether low and high layers of the object DNN predicted (Fig.5(a)). Wereasonedthatacomparisonofbrainwith1) earlyandlatebrainrepresentations,respectively. Wefound anuntrainedDNNwouldrevealtheeffectofDNNarchitec- this to be the case: There was a positive hierarchical re- turealone,2)aDNNtrainedonanalternatecategorization lationship (n = 15, Spearman’s ρ = 0.35, P = 0.0007) task, scene categorization, would reveal the effect of spe- between the layer number of the object DNN and posi- cifictask, and3)aDNNtrainedonanimagesetwithran- tion in the hierarchy of the deep object network and the domunecologicalassignmentofimagestocategorylabels, peak latency of the correlation time courses between ob- oraDNNtrainedonnoiseimages, wouldrevealtheeffect ject DNN and MEG RDMs and deep object network layer ofthetrainingprocedureperse. RDMs(Fig.3(b)). To evaluate the hierarchy of temporal and spatial rela- Together these analyses established, to our knowledge tionships between the human brain and DNNs, we com- forthefirsttime,acorrespondenceinthesequenceofpro- puted layer-specific RDMs for each DNN. To allow di- cessing steps of a computational model of vision and the rectcomparisonsacrossmodels,wealsocomputedasingle timecoursewithwhichvisualrepresentationsemergeinthe summaryRDMforeachDNNmodelbasedonconcatenated humanbrain. layer-specificactivationvectors. Concerning the role of architecture, we found the un- 2.4. The object DNN predicted the hierarchical to- trained DNN significantly predicted emerging brain rep- pography of visual representations in the hu- resentations (Fig. 5(b)), but worse than the object DNN manventralanddorsalvisualstreams (Fig.5(c)). Asupplementarylayer-specificanalysisidenti- To localize visual representations common to brain and fiedeverylayerasasignificantcontributortotothispredic- theobjectDNN,weusedaspatiallyunbiasedsurface-based tion(Suppl.Fig.3a). Eventhoughtherelationshipbetween searchlightapproach. Comparisonofrepresentationalsimi- layernumberandthepeaklatencyofbrain-DNNsimilarity laritiesbetweenfMRIdataandobjectDNNRDMsyielded timeserieswashierarchical,itwasnegative(ρ = 0.6,P = 8layer-specificspatialmapsidentifyingthecorticalregions 0.0003, Suppl. Fig. 3b) and thus reversed and statistically wheretheobjectDNNpredictedbrainactivity(Fig.4,clus- differentfromtheobjectDNN(∆ρ = 0.96,P = 0.0003). ter definition threshold P < 0.05, cluster-threshold P < This shows that DNN architecture alone, independent of 0.05;differentviewinganglesavailableinSuppl.Movie1). task constraints or training procedures, induces represen- The results indicate a hierarchical correspondence be- tationalsimilaritytoemergingvisualrepresentationsinthe tweenmodelnetworklayersandthehumanvisualsystem. brain, but that constraints imposed by training on a real- For low DNN layers, similarities of visual representations world categorization task significantly increases this effect wereconfinedtotheoccipitallobe,i.e.,low-andmid-level andreversesthedirectionofthehierarchicalrelationship. visual regions, and for high DNN layers in more anterior Concerning the role of task, we found the scene DNN regionsinboththeventralanddorsalvisualstream. Asup- also predicted emerging brain representations, but worse plementary volumetric searchlight analysis (Suppl. Text 1, than the object DNN (Fig. 5(b,c); Suppl. Fig. 3c). This Suppl.Fig.2;usingafalsediscoveryratecorrectionallow- suggeststhattaskconstraintsinfluencethemodelandpos- ing voxel-wise inference reproduced these findings, yield- siblyalsobraininapartlyoverlapping,andpartlydissocia- ingcorroborativeevidenceacrossanalysismethodologies. ble manner. Further, the relationship between layer num- Theseresultssuggestthathierarchicalsystemsofvisual ber and brain-DNN similarity time series was positively representationsemergeinboththehumanventralanddor- hierarchical for the scene DNN (ρ = 0.44,P = 0.001, sal visual stream as the result of task constraints of object Suppl. Fig. 3(d)), and not different from the object DNN categorization posed in everyday life, and provide strong (∆ρ = 0.09,P = 0.41), further suggesting overlapping evidenceforobjectrepresentationsinthedorsalstreamin- neuralmechanismsforobjectandsceneperception. dependentofattentionormotorintention. Concerningtheroleofthetrainingoperation, wefound boththeunecologicalandnoiseDNNspredictedbrainrep- 2.5. Factors determining DNN’s predictability of resentations (Fig. 5(b), Suppl. Fig 3(e,g)), but worse than visualrepresentationsemergingintime the object DNN (Fig. 5(c)). Further, there was no evi- The observation of a positive and hierarchical relation- dence for a hierarchical relationship between layer num- shipbetweentheobjectDNNandbraintemporaldynamics ber and brain-DNN similarity time series for either DNN poses the fundamental question of the origin of this rela- (unecological DNN: ρ = 0.01,P = 0.94; noise DNN: tionship. ThreefundamentalfactorsshapeDNNs: architec- ρ = 0.04,P = 0.68; Suppl. Fig. 3(f,h)), and both had a ture,task,andtrainingprocedure. Determiningtheeffectof weaker hierarchical relationship than the object DNN (un- eachiscrucialtounderstandingtheemergenceofthebrain- ecological DNN: ∆ρ = 0.39,P = 0.0107; noise DNN: Figure2:StimulussetandcomparisonofbrainandDNNrepresentations.(a)Thestimulussetconsistedof118imagesof distinctobjectcategories. (b)RepresentationalsimilarityanalysisbetweenMEG,fMRIandDNNdata. Ineachsignalspace (fMRI,MEG,DNN)wesummarizedrepresentationalstructurebycalculatingthedissimilaritybetweenactivationpatternsof differentpairsofconditions(hereexemplifiedfortwoobjects: busandorange). Thisyieldedrepresentationaldissimilarity matrices (RDMs) indexed in rows and columns by the compared conditions. We calculated millisecond resolved MEG RDMs from 100ms to +1000ms with respect to image onset, layer-specific DNN RDMs (layers 1 through 8) and voxel- specificfMRIRDMsinaspatiallyunbiasedcorticalsurface-basedsearchlightprocedure. RDMsweredirectlycomparable (Spearman’sρ),facilitatingintegrationacrosssignalspaces. ComparisonofDNNwithMEGRDMsyieldedtimecoursesof similaritybetweenemergingvisualrepresentationsinthebrainandDNN.ComparisonoftheDNNwithfMRIRDMsyielded spatialmapsofvisualrepresentationscommontothehumanbrainandtheDNN. Figure3: TheobjectDNNpredictedtheorderoftemporallyemergingvisualrepresentationsinthehumanbrain. (a) Timecourseswithwhichrepresentationalsimilarityinthebrainandlayersofthedeepobjectnetworkemerged. Color-coded lines above data curves indicate significant time points (n = 15, cluster definition threshold P = 0.05, cluster threshold P =0.05;foronsetandpeaklatenciesseeSuppl.Table2). Grayverticallineindicatesimageonset. (b)Peaklatencyoftime coursesincreasedwithlayernumber(n = 15,ρ = 0.35,P = 0.0007,signpermutationtest),indicatingthatdeeperlayers predictedlaterbrainsignals. Errorbarsindicatestandarderrorofthemeandeterminedby10,000bootstrapsamplesofthe participantpool. ∆ρ = 0.36,P = 0.0052). Thusthetrainingoperationper above bars). Thus depending on cortical region the DNN se has an effect on the relationship to the brain, but only architecturealoneisenoughtoinducesimilaritybetweena training on real-world categorization increases brain-DNN DNNandthebrain,butthehierarchyabsent(V1,IPS1&2) similarityandhierarchy. orreversed(IT)withoutproperDNNtraining. In summary, we found that although architecture alone Concerning the role of task, we found the scene DNN predictedthetemporalemergenceofvisualrepresentations, had largely similar, albeit weaker, similarity to the brain training on real-world categorization was necessary for a than the object DNN for all ROIs (Fig. 6(a,c)), with a sig- hierarchical relationship to emerge. Thus, both architec- nificant hierarchical relationship in V1 (ρ = 0.68,P = tureandtrainingcruciallyinfluencethepredictionpowerof 0.002), but not in IT (ρ = 0.26,P = 0.155) or IPS1&2 DNNsoverthefirstfewhundredmillisecondsofvision. (ρ = 0.30,P = 0.08) (Fig. 6(b)). In addition, comparing results for the object and scene DNNs directly (Fig. 6(c)), 2.6. Factors determining DNN’s predictability of we found stronger effects for the object DNN in several the topography of visual representations in layers in all ROIs. Together these results corroborate the cortex conclusions of the MEG analysis, showing that task con- The observation of a positive and hierarchical relation- straints shape brain representations along both ventral vi- ship between the object DNN structure and the brain vi- sualstreamsinapartlyoverlapping,andpartlydissociable sualpathwaysmotivatesaninquiry,akintothetemporaldy- manner. namics analysis in the previous section, regarding the role Concerningtheroleofthetrainingoperation, wefound of architecture, task demands and training operation. For boththeunecologicalandnoiseDNNspredictedvisualrep- thiswesystematicallyinvestigatedthreeregions-of-interest resentations in V1 and IT, but not IPS1&2 (Fig. 6(a)), and (ROIs):theearlyvisualareaV1,andtworegionsup-stream with less predictive power than the object DNN in all re- in the ventral and dorsal stream, the inferior temporal cor- gions (Fig. 6(c)). A hierarchical relationship was present texITandaregionencompassingintraparietalsulcus1and and negative in V1 and IT, but not IPS1&2 (Fig. 6(b), 2 (IPS1&2), respectively. We examined whether DNNs unecological DNN: V1 ρ = 0.40,P = 0.001, IT ρ = predicted brain activity in these ROIs (Fig. 6(a)), and also 0.38,P =0.001,IPS1&2ρ=0.03,P =0.77;noiseDNN: whetherthispredictionwashierarchical(Fig.6,Suppl.Ta- V1ρ = 0.08,P = 0.42,ITρ = 0.29,P = 0.012,IPS1&2 ble4(a)). ρ=0.08,P =0.42). Concerning the role of architecture, we found the un- trained DNN predicted brain representations better than Therefore the training on a real-world categorization the object DNN in V1, but worse in IT and IPS1&2 task, but not the training operation per se, increases the (Fig.6(a,c)).Further,therelationshipwashierarchical(neg- brain-DNN similarity while inducing a hierarchical rela- ative) only in IT (ρ = 0.47,P = 0.002) (Fig. 6(b); stars tionship. Figure 4: Spatial maps of visual representations common to brain and object DNN. The object DNN predicted the hierarchicaltopographyofvisualrepresentationsinthehumanbrain. Lowlayershadsignificantrepresentationalsimilarities confinedtotheoccipitallobeofthebrain,i.e.low-andmid-levelvisualregions.Higherlayershadsignificantrepresentational similarities with more anterior regions in the temporal and parietal lobe, with layers 7 and 8 reaching far into the inferior temporal cortex and inferior parietal cortex (n = 15, cluster definition threshold P < 0.05, cluster-threshold P < 0.05, analysisseparateforeachhemisphere). Figure5: Architecture,task,andtrainingprocedureinfluencetheDNN’spredictabilityoftemporallyemergingbrain representations. (a)Wecreated5differentmodels:(1)amodeltrainedonobjectcategorization(objectDNN;Fig.1);(2)an untrainedmodelinitializedwithrandomweights(untrainedDNN)todeterminetheeffectofarchitecturealone;(3)amodel trainedonadifferentreal-worldtask, scenecategorization(sceneDNN)toinvestigatetheeffectoftask; and(4,5)amodel trainedonobjectcategorizationwithrandomassignmentofimagelabels(unecologicalDNN),orspatiallysmoothednoisy imageswithrandomassignmentofimagelabels(noiseDNN),todeterminetheeffectofthetrainingoperationindependent of task constraints. (b) All DNNs had significant representational similarities to human brains (layer-specific analysis in Suppl.Fig.3). (c)WecontrastedtheobjectDNNagainstallothermodels(subtractionofcorrespondingtimeseriesshownin (b)). RepresentationsintheobjectDNNweremoresimilartobrainrepresentationsthananyothermodel,thoughthescene DNNwasaclosesecond. Linesabovedatacurvessignificanttimepoints(n = 15, clusterdefinitionthresholdP = 0.05, clusterthresholdP =0.05;foronsetandpeaklatenciesseeSuppl.Table3(a,b)). Grayverticallinesindicatesimageonset. 3.Discussion toryanddiscoverypowerofthebrain-DNNcomparisonap- proach to understand the spatio-temporal neural dynamics By comparing the spatio-temporal dynamics in the hu- underlyingobjectrecognition.Theyprovidenovelevidence manbrainwithadeepneuralnetwork(DNN)modeltrained for a role of parietal cortex in visual object categorization, on object categorization, we provided a formal model of and give rise to the idea that the organization of the visual object recognition in cortex. We found a correspondence cortexmaybeinfluencedbyprocessingconstraintsimposed betweentheobjectDNNandthebraininbothspace(fMRI byvisualcategorizationthesamewaythatDNNrepresen- data)andtime(MEGdata).Bothcasesdemonstratedahier- tationswereinfluencedbyobjectcategorizationtasks. archy: inspacefromlow-tohigh-levelvisualareasinboth ventral and dorsal stream, in time over the visual process- 3.1.ObjectDNNpredictsahierarchyofbrainrep- ing stages in the first few hundred milliseconds of vision. resentationsinspaceandtime A systematic analysis of the fundamental determinants of thisDNN-brainrelationshipidentifiedthatthearchitecture A major impediment in modeling human object recog- alone induces similarity, but that training on a real-world nition in cortex is the lack of principled understanding of categorization task was necessary for a hierarchical rela- exactneuronaltuninginmid-andhigh-levelvisualcortex. tionship to emerge. Our results demonstrate the explana- Previous approaches thus extrapolated principles observed inlow-levelvisualcortex,withlimitedsuccessincapturing 3.2. Origin and implications of brain-DNN repre- neuronal variability and a much inferior to human behav- sentationsimilarities ioralperformance[35,36]. Investigating the influence of crucial parameters deter- Our approach allowed us to obviate this limitation by mining DNNs, we found an influence of both architecture relying on an object recognition model that learns neu- andtaskconstraintsinducedbytrainingtheDNNonareal- ronal tuning. By comparing representations between the world categorization task. This suggests that that simi- DNN and the human brain we found a hierarchical cor- lar architectural principles, i.e., convolution, max pooling respondence in both space and time: early layers of the and normalization govern both model and brains, concur- DNNpredictedvisualrepresentationsemergingearlyafter rentwiththeoriginofthoseprinciplebyobservationinthe stimulus onset, and in regions low in the cortical process- brain [35]. The stronger similarity with early rather than ing hierarchy, with progressively higher DNN layers pre- late brain regions might be explained by the fact that neu- dicting subsequent emerging representations in higher re- ral networks initialized with random weights that involve gions of both the dorsal and ventral visual pathway. Our aconvolution,nonlinearityandnormalizationstageexhibit results provide algorithmically informed evidence for the Gabor-likefilterssensitivetoorientededges,andthussimi- idea of visual processing as a step-wise hierarchical pro- larpropertiesanneuronsinearlyvisualareas[40]. cess in time [5, 9, 33] and along a system of cortical re- Although architecture alone induced similarity, training gions[15,14,16]. onareal-worldcategorizationtasksincreasedsimilarityand Inregardstothetemporal correspondenceinparticular, was necessary for a hierarchical relationship in processing ourresultsprovidefirstevidenceforahierarchicalrelation- stages between the brain and the DNN to emerge in space shipbetweencomputermodelsofvisionandthebrain.Peak and time. This demonstrates that learning constraints im- latencies between layers of the object DNN and emerging posed by a real-world categorization task crucially shape brain activations ranged between approximately 100 and therepresentationalspaceofaDNN[48],andsuggeststhat 160ms. While in agreement with prior findings about the the processing hierarchy in the human brain is a the result timenecessaryforcomplexobjectprocessing[42], ourre- ofcomputationalconstraintsimposedbyvisualobjectcat- sults go further by making explicit the step-wise transfor- egorization. Such constraints may originate in high-level mationsofrepresentationalformatthatmayunderlierapid visualregionssuchasITandIPS,bepropagatedbackwards complexobjectcategorizationbehavior. from high-level visual regions through the visual hierar- In regards to the spatial correspondence, previous stud- chies through abundantly present feedback connections in ies compared DNNs to the ventral visual stream only, thevisualstreamatalllevels[13]duringvisuallearning[2], mostlyusingaspatiallylimitedregion-of-interestapproach andprovidethebasisoflearningatallstagesoftheprocess- [18, 21, 48]. Here, using a spatially unbiased whole-brain inginvisualbrain[24]. approach[27],wediscoveredahierarchicalcorrespondence in the dorsal visual pathway. While previous studies have 3.3.Summarystatement documented object selective responses in dorsal stream in In sum, by comparing deep neural networks to human monkeys [20, 39] and humans [7, 22], it is still debated brains in space and time, we provide a spatio-temporally whether dorsal visual representations are better explained unbiased algorithmic account of visual object recognition by differential motor action associations or ability to en- inhumancortex. gage attention, rather than category membership or shape representation [17, 25]. Crucially, our results defy expla- 4.Method nation by attention or motor-related concepts, as neither playedanyroleintheDNNandthusbrain-DNNcorrespon- Participants: 15 healthy human volunteers (5 female, dence. Concurrentwiththeobservationthattemporallobe age: mean ± s.d. = 26.6±5.18 years, recruited from a resectionshowslimitedbehavioraleffectinobjectrecogni- subjectpoolatMassachusettsInstituteofTechnology)par- tion[4,46],ourresultsarguethatparietalcortexmightplay ticipatedintheexperiment. Thesamplesizewasbasedon astrongerroleinobjectrecognitionthanpreviouslyappre- methodologicalrecommendationsinliteratureforrandom- ciated. effectsfMRIandMEGanalyses. Writteninformedconsent Ourresultsthuschallengetheclassicdescriptionsofthe wasobtainedfromallsubjects. Thestudywasapprovedby dorsal pathway as a spatially- or action oriented ‘where’ the local ethics committee (Institutional Review Board of or ‘how’ pathway [43, 31], and suggest that current theo- the Massachusetts Institute of Technology) and conducted riesdescribingparietalcortexasrelatedtospatialworking according to the principles of the declaration of Helsinki. memory,visuallyguidedactionsandspatialnavigation[26] All methods were carried out in accordance with the ap- should be complemented with a role for the dorsal visual provedguidelines. streaminobjectcategorization[22]. Visual stimuli: The stimuli presented to humans and Figure6: Architecture,taskconstraints,andtrainingprocedureinfluencetheDNN’spredictabilityofthetopography of brain representations. (a) Comparison of fMRI representations in V1, IT and IPS1&2 with the layer-specific DNN representations of each model. Error bars indicate standard error of the mean as determined by bootstrapping (n = 15). (b) Correlations between layer number and brain-DNN representational similarities for the different models shown in (a). Non-zerocorrelationsindicatehierarchicalrelationships;positivecorrelationsindicateanincreaseinbrain-DNNsimilarities towardshigherlayers,andviceversafornegativecorrelations. Barscolor-codedasDNNs,starsabovebarsindicatesignifi- cance(sign-permutationtests,P < 0.05,FDR-corrected,fordetailsseeSuppl.Table4(a)). (c)ComparisonofobjectDNN againstallothermodels(subtractionofcorrespondingpointsshownin(a)). (d)Sameas(b),butforthecurvesshownin(c) (fordetailsseeSuppl.Table4b). computervisionmodelswere118colorphotographsofev- MEGacquisition: MEGsignalswereacquiredcontin- eryday objects, each from a different category, on natural uously from 306 channels (204 planar gradiometers, 102 backgrounds(Fig.2(b))fromtheImageNetdatabase[12]. magnetometers, ElektraNeuromagTRIUX,Elekta, Stock- holm)atasamplingrateof1000Hz,andfilteredonlinebe- 4.1.Experimentaldesignandtask tween0.03and330Hz.Wepreprocesseddatawithtemporal Participants viewed images presented at the center of sourcespaceseparation(maxfiltersoftware,Elekta,Stock- the screen (4◦ visual angle) for 0.5s and overlaid with a holm) before further analysis with Brainstorm1. We ex- lightgrayfixationcross. Thepresentationparameterswere tractedeachtrialwitha100msbaselineand1000mspost- adapted to the specific requirements of each acquisition stimulus recordings, removed baseline mean, smoothed technique(Suppl.Fig.1). datawitha30Hzlow-passfilter,andnormalizedeachchan- ForMEG,participantscompleted15runsof314sdura- nel with its baseline standard deviation. This yielded 30 tion. EachimagewaspresentedtwiceineachMEGrunin preprocessedtrialsperconditionandparticipant. random order with an inter-trial interval (ITI) of 0.9−1s. fMRIacquisition: Magneticresonanceimaging(MRI) Participants were asked to press a button and blink their was conducted on a 3T Trio scanner (Siemens, Erlangen, eyesinresponsetoapaperclipimageshownrandomlyev- Germany)witha32-channelheadcoil. Weacquiredstruc- ery3to5trials(average4). Thepaperclipimagewasnot tural images using a standard T1-weighted sequence (192 part of the image set, and paper clip trials were excluded sagittal slices, FOV = 256mm2, TR = 1900ms, TE = fromfurtheranalysis. 2.52ms,flipangle=9◦). For fMRI, each participant completed two independent ForfMRI,weconducted911runsinwhich648volumes sessionsof9−11runs(486sdurationeach)ontwosepa- were acquired for each participant (gradient-echo EPI se- rate days. Each run consisted of one presentation of each quence: TR=750ms,TE=30ms,flipangle=61◦,FOV imageinrandomorder,interspersedrandomlywith39null read = 192mm, FOV phase = 100% with a partial frac- trials(i.e.,25%ofalltrials)withnostimuluspresentation. tion of 6, through-plane acceleration factor 3, bandwidth 8 During the null trials the fixation cross turned darker for 1816Hz/Px,resolution=3mm3,slicegap20%,slices=33, 500ms. Participantsreportedchangesinfixationcrosshue ascendingacquisition). Theacquisitionvolumecoveredthe withabuttonpress. wholecortex. 1http://neuroimage.usc.edu/brainstorm/

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.