ebook img

Learning an attention model in an artificial visual system PDF

0.7 MB·
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Learning an attention model in an artificial visual system

2016 ICSEE International Conference on the Science of Electrical Engineering Learning an Attention Model in an Artificial Visual System Alon Hazan, Yuval Harel and Ron Meir Department of Electrical Engineering Technion – Israel Institute of Technology Technion City, Haifa, Israel Abstract—The Human visual perception of the world is of a a target is centered at the fovea, the eye fixates for a fraction 7 large fixed image that is highly detailed and sharp. However, of a second while the visual system extracts the necessary 1 receptor density in the retina is not uniform: a small central information. Most eye movements are proactive rather than 0 regioncalledthefoveaisverydenseandexhibitshighresolution, reactive,predictactionsinadvanceanddonotmerelyrespond 2 whereas a peripheral region around it has much lower spatial resolution. Thus, contrary to our perception, we are only able to visual stimuli [4]. n to observe a very small region around the line of sight with There is good evidence that much of the active vision in a high resolution. The perception of a complete and stable view humans results from Reinforcement Learning (RL) [5], as J is aided by an attention mechanism that directs the eyes to the part of and organism’s attempt to maximize its performance 4 numerous points of interest within the scene. The eyes move while interacting with the environment [6]. Accordingly, we 2 betweenthesetargetsinquick,unconsciousmovements,knownas “saccades”.Onceatargetiscenteredatthefovea,theeyesfixate train the artificial visual system within the RL paradigm. The ] for a fraction of a second while the visual system extracts the network was not explicitly engineered to perform a certain V necessaryinformation.Anartificialvisualsystemwasbuiltbased task, and does not contain an explicit memory component — C on a fully recurrent neural network set within a reinforcement rather it has memory only by virtue of its recurrent topology. learning protocol, and learned to attend to regions of interest . Learning takes place in a model-free setting using policy s while solving a classification task. The model is consistent with c several experimentally observed phenomena, and suggests novel gradient techniques. [ predictions. We find that the network displays attributes of human 1 learning such as: (a) decision making and gradual confidence v I. INTRODUCTION increase along with accumulated evidence, (b) skill transfer, 8 Neuroscientists and cognitive scientists have many tools at namely the ability to use a pre-learned skill in a certain task 9 their disposal to study the brain and neural networks in gen- in order to improve learning on a related but more difficult 3 7 eral, including Electroencephalography (EEG), Single-Photon task,(c)selectivelyattendinginformationrelevantforthetask 0 Emission Computed Tomography (SPECT), functional Mag- at hand, while ignoring irrelevant objects in the field of view. . netic Resonance Imaging (fMRI) and Microelectrode Arrays 1 0 (MEA), to name a few. However, the amount of information II. THEARTIFICIALVISUALSYSTEM 7 and level of control afforded by these tools do not remotely We designed an Artificial Visual System (AVS) with the 1 resemble what is available to an engineer working on an task of learning an attention model to control saccadic eye : v artificial neural network. The engineer can manipulate any movements,andsubsequentclassificationofdigits.Wereferto i neuron at any time, force certain excitations, intervene in thistaskastheattention-classificationtask.TheAVSissimilar X ongoingprocesses,andcollectasmuchdataaboutthenetwork in many ways to that presented in [7]. It is a simplified model r a as needed, at any level of detail. This wealth of information of the human visual system, consisting of a small region in has enabled reverse engineering research on artificial neural thecenterwithhighresolution,analogoustothehumanfovea, networks,leadingtoinsightsintotheinnerworkingsoftrained and two larger concentric regions which are sub-sampled to artificial neural networks. This suggests an indirect approach lower resolution and are analogous to the peripheral visual to studying the brain: training a biologically plausible neural system in humans. The AVS was trained and tested on the network model to exhibit complex behavior observed in real classification of handwritten digits from the MNIST data set brains, and reverse engineering the result. In line with this [8].OnlyasmallpartoftheimageisvisibletotheAVSatany approach, we designed an artificial visual system based on one time. Specifically, full resolution is only available at the a fully recurrent unlayered neural network that learns to fovea,whichis9-by-9pixels,asin[7],or5-by-5pixels(about perform saccadic eye movements. Saccadic eye movements 69% smaller). The first peripheral region is double the size of are quick, unconscious, task-dependent [1] motions following the fovea, but sub-sampled with period 2 to match the size of the demand of attention [2], that direct the eye to new targets the fovea in pixels. Similarly, the second peripheral region is that require the high resolution of the fovea. These targets are quadruple the size of the fovea but sub-sampled with period usually detected within the peripheral visual system [3]. Once 4. For comparison, a typical digit in the MNIST database 978-1-5090-2152-9/16/$31.00(cid:13)c2016IEEE 2016 ICSEE International Conference on the Science of Electrical Engineering occupies about 20-by-20 pixels of the image. The location of • α∈(0,1] is the leak rate, the observation within the image is not available to the AVS • W ∈RN×N is the internal connections weight matrix, (unlike[7]),andmovementsoftheobservationlocationarenot • Win ∈RNin×N is the input weight matrix, constrainedtoimageboundaries.Instead,locationsoutsidethe • o∈RNin is the observation (network input), image boundaries are observed as black pixels. • ζn ∈ RN and ξn = (cid:0)ξnID;ξnAtt(cid:1) ∈ RM are indepen- dent discrete-time Gaussian white noise processes with independent components, each having variance σ2,σ2 ζ ξ respectively, • yn = (cid:0)ynID;ynAtt(cid:1) ∈ RM is the state of the M output neurons, • Wout ∈ RM×N is the output weight matrix (consisting of blocks W ,W for the corresponding output com- ID Att ponents). Fig.1. ArtificialVisualSystemdesign The gradient of the expected reward J is estimated as in [11], The AVS consists of inputs (observations) projected upon thenetworkviainputweightsWin,aneuralnetworkconsisting ∇ˆJ =(cid:104)∇logp(τ)(r(τ)−b)(cid:105), of N neurons connected by the recurrent weights W, and two outputs: a classifier yID, responsible for identifying the digit whereτ =(s ,w ,(o ,s ,yAtt,yID)Ng )isarandomtrajec- 0 1 n n n n n=1 aftertheAVShasexploredtheimage,andtheattentionmodel tory of N glimpses, p(τ) is the probability of trajectory τ, g output yAtt, responsible for directing the eye to new locations r(τ) the observed (usually binary) reward, b a fixed baseline basedontheinformationrepresentedinthenetworkstate(see computed as in [11], w is the random location of the first 1 Figure 1). The output y consists of one neuron for each ID glimpse, and (cid:104)·(cid:105) indicates averaging over trajectories. Viewed possibledigit.Attheendofthetrial,theidentityofthehighest as a partially observable Markov decision process (POMDP), valued neuron is interpreted as the network’s classification. we can write the distributions describing the agent: Theprogressionofasingletrialfollowstheseprincipalstages: (cid:16) 1) A random digit is selected from the MNIST training p(s |s ,o )=α−1p α−1(s −(1−α)s ) n+1 n n+1 ζ n+1 n database. (cid:17) 2) A location across the image is randomly selected. −W tanh(s )−W o n in n+1 3) The observation (called ‘glimpse’ [7]) from the current p(y |s )=p (y −W tanh(s )), (2) location is projected upon the network through W , n n ξ n out n in along with any pre-existing information within the net- where p ,p are the probability density functions of ζ ,ξ work state through the recurrent weights W. ζ ξ t t respectively. The POMDP dynamics are deterministic: the 4) TheattentionmodeloutputyAtt isfedbackasasaccade glimpse position w evolves as w =w +yAtt. oftheeye,i.e.asthesizeofmovementfromthecurrent n n+1 n n For the AVS (1) the probability density of a trajectory τ is location in the horizontal and vertical axes. 5) If a predefined number of glimpses has passed (or by network decision), compare the classifier output yID to (cid:89)Ng p(τ)=p(s )p(w ) p(s |s ,o )p(y |s ), the true label, otherwise return to stage 3. 0 1 n+1 n n+1 n+1 n+1 n=1 6) Reward the AVS if the classification was correct, and continue to the next trial. Here only the output probabilities p(y |s ) depend on n+1 n+1 The AVS is implemented by a fully recurrent neural network. W , and, using (2) and the Gaussian distribution of the Out ItsnetworktopologyissimilartheEchoStateNetwork(ESN) noise, we find that the log likelihood gradient with respect [10] in that the recurrent neural connections are drawn ran- to W takes the form Out domlyandarenotconstrainedtoaparticulartopologysuchas in layered feedforward networks, or long short-term memory Ng networks. ∇WOutlogp(τ)= (cid:88)σξ−1ξntanh(sn)T , The network state evolves according to n=1 where N is the number of glimpses. Stochastic gradient g s =(1−α)s +α(W tanh(s )+W o +ζ ), ascent is performed only for the output weights W . Re- n+1 n n in n+1 n Out current weights are randomly selected, with spectral radius 1, y =W tanh(s )+ξ , (1) n out n n and remain fixed throughout training. The log likelihood with where respect to the internal weight matrix W takes a similar form. • sn ∈ RN is the state of network at time step n, each However, the recurrent connections were not learned in our element representing the state of a single neuron, simulations. 2016 ICSEE International Conference on the Science of Electrical Engineering III. RESULTS A. Use of memory Since information accumulated by the neural network over time is mixed into the state of the network, it is not obvious that the potential to extract useful historic information can be exploited within the attention model solution. Training uses gradient ascent to the local maxima of the estimated expected reward and therefore may converge to sub-optimal maxima that do not make use of the full potential of the system. In order to test the use of memory by the trained network, two similar AVS were trained on the attention-classification task. In the first AVS, recurrent weights were random, whereas the Fig.3. Exploitationofmemorywith25pxfovea secondAVSwassetto‘forget’historicinformation,bysetting the recurrent weights matrix to zero. classification yID depends linearly on the network state in the lasttimestep,wequantifythetask-relevantinformationasthe bestlinearseparationofthenetworkstate,betweeneachclass andtheotherclasses.Accordingly,weuseLinearDiscriminant Analysis (LDA) [13], which acts to find the projection that minimizes the distance between samples of the same cluster S while at the same time maximizes the distance between w clustersS .Thedistancewithineachclassismeasuredbythe b varianceofsamplesbelongingtothatclass,andS istakento w bethemeanofthesedistancesacrossallclasses.TheDistance betweenclassesS isdefinedasthevarianceofthesetofclass b centers. We trained an AVS on the attention-classification task with 5 glimpses per digit. After the AVS was trained, it was tested Fig.2. Exploitationofmemorywith81pxfovea intwocases.Inthefirstcase,thesystemwasrunasusualand Use of memory was found to depend on the size of the thenetworkstatevectorwasrecordedafterthelastglimpseof fovea. Fig. 2 shows the performance of the system across eachdigitinthetestset.Inthesecondcase,thelocationofthe training epochs, for the case of a large (9×9 pixels) fovea. last glimpse was chosen randomly rather than following the Initially, the AVS with memory has the advantage as the learned attention model. The results are illustrated in Figure attention model is still poor at this stage, leading to relatively 4, where the state of the network is projected on the first two uninformativeglimpses,sotheuseofinformationfromseveral eigenvectors of Sw−1Sb. Separation is significantly better with glimpses results in better classification. However, as the at- the full attention model compared to the one with the random tention mechanism improves the last glimpse becomes highly last glimpse. We conclude that, at the very least, the attention informative, so the memoryless network, where information model acts to maximize task-relevant information in the last from the last glimpse is not corrupted by memory of previous glimpse better than a random walk. glimpses,hastheadvantage.Infact,wefoundthatinformation from a well-placed glimpse suffices to classify the digit with over 90% success rate in this case, driving the network to a solution of finding a single good glimpse location across the digit, and classification based on that glimpse, without regard to the rest of the trajectory. Thesituationisdifferentwithasmallerfovea(5×5pixels), whereclassificationfromasingleglimpsebecomesharder.As seen in Figure 3, the AVS with memory outperforms the one without memory in the small fovea case. B. Gathering Information Fig.4. ResultsofLinearDiscriminantAnalysisoftheAVSstateatthelast The human visual system acts to maximize the information time step. Each dot corresponds to single trial and represents the projection relevant to the task [12]. In order to assess whether our ofthenetworkstateonthefirsttwoeigenvectorsofSw−1Sb.Dotsarecolored according to the digit presented to the network. Left: random last glimpse. AVS behaves similarly, we have to characterize the relevant Right:fullattentionmodel. information in the context of our task. Since the network 2016 ICSEE International Conference on the Science of Electrical Engineering C. Transfer learning Biological learning often displays the ability to use a skill learned on a simple task in order to improve learning of a harder yet related task, e.g., proficiency at tennis is beneficial when learning racquetball and even seemingly unrelated tasks such as skiing for example [14]. To test whether transfer learning is possible in the AVS, we trained it to learn the Fig.6. FixedDistractingObject attention model and classification of 3 digits (out of 10 in the MNIST database). The resulting solution served as an initial condition for learning the full task of classifying all 10 digits. Next, we test the network with a distracting object in a As seen in Fig. 5, not only did the AVS with pre-learned random position around the digit. The observed behavior was attention learn much faster, but it also achieved a better result similar when the first glimpse happened to fall on a location at the end of training. whereboththeobjectanddigitarewithintheperipheralview: theAVSignoredthedistractionanddirecteditselftowardsthe digit. However, in the case where the first glimpse falls on a location where only the distracting object is visible in the peripheral view, the AVS failed to locate the digit. Fig.5. TransferableSkill Fig.7. FreeDistractingObject D. Ignoring distractions Theeyesarenotdirectedtothemostvisuallysalientpoints An example is seen in Fig. 7. The lines are trajectories of inthefieldofview,butrathertotheonesthataremostrelevant the AVS each starting from a different point on a test grid forthetaskathand[6].Accordingly,weintroducedanhighly and followed until the last glimpse (blue/magenta dot). Green salient object into the training images. The object is a square, lines are trajectories that led to a correct classification while approximatelythesizeofadigitbutwithmaximumbrightness, redlinesaretrajectoriesthatledtoafalseclassification.When whereasthedigitsarehandwrittenanddisplayedingrayscale. theAVShappentofallatalocationwherethedigitisnotseen, The object is inserted at a fixed position relative to the digit, it directs its gaze towards the square, which it then chooses alwaysontherighthandsideofthedigit.Thetrainednetwork to classify as “1” thus earning 10% expected reward which is successfullyavoidsunnecessaryfixationsonthesalientobject. better than nothing. In cases where the first glimpse falls upon an area where E. Learning aided by demonstration (guidance) both the digit and the object are within the peripheral visual region, the object seems to be completely ignored. Perhaps Learning by demonstration (or learning with guidance) moreinteresting isthecasewhere onlytheobject isvisiblein was implemented in the AVS. Demonstration differs from thefirstglimpse,withintheperipheralview.Insuchacase,the supervision in two key ways. First, demonstration is not AVSlearnedtoexploitthefactthatthedigitisalwayslocated continuous,andisappliedsparselyintimeinordertosuggest on the left of the object and consistently performs saccades to new trajectories to the system. Second, demonstration is not theleft.Thus,notonlywasthepresenceofadistractingobject required to provide the best solution to the system, because not harmful to performance, but it was actually beneficial. thesystemmaintainsitsfreedomtoexploreandevenimprove In Fig. 6, the colored squares represent the foveal view of uponit.Demonstrationwasachievedbyprovidingthenetwork 5-by-5pixelsateachtimestepgoingfrombluetored.Theleft withasparseandnaivesuggestionfortheattentionmodel.For and middle images show cases where the first glimpse only example,on10%oftrajectories,thesystemwasdirectedtothe observesthedistractingobjectwithintheperipheralview.The center of the digit on the last glimpse. Such partial direction left image shows a case where the first glimpse observes both resultedinasignificantimprovementofbothspeedoflearning the distracting object and the digit within the peripheral view. and the final success rate, as can be observed in figure 8. 2016 ICSEE International Conference on the Science of Electrical Engineering Demonstration in the AVS system was made possible by The Matlab code will be made available at: manipulating the exploration noise. The exploration noise is a https://alonhazan.wordpress.com/ Gaussian white noise and as such has probability greater than REFERENCES zerotoacceptanyvalue.Sincetheoutputofthesystematany given time is a function of that noise, we can force the output [1] Alfred L. Yarbus. Eye movements and vision. Neuropsychologia, 6(4):222,1967. to a specific value by setting the exploration noise in that [2] G T Buswell. How people look at pictures: a study of the psychology particular time step to be ζ˜n = ynatt−WAtttanh(sn) where and perception in art. Chicago University of Chicago Press, page 198, yatt is now the demonstrated output of the attention model 1935. n and ζ˜ is the determined exploration noise that will bring the [3] Jeff B. Pelz, Roxanne L. Canosa, Diane Kucharczyk, Jason Babcock, n AmySilver,andDaiseiKonno.Portableeyetracking:astudyofnatural system to that desired output. As long as the demonstration eyemovements.ProceedingsofSPIE,pages566-582,2000. is kept sparse enough, it would in practice not break the [4] M F Land and S Furneaux. The knowledge base of the oculomotor system. Philosophical transactions of the Royal Society of London. assumptionthatthenoiseisaGaussianwhitenoise.Thenoise SeriesB,Biologicalsciences,352(1358):1231-9,1997. inthesystemisanessentialpartoftheloglikelihoodgradient [5] W Schultz. Multiple reward signals in the brain. Nature reviews. and therefore the system would not only arrive to the desired Neuroscience,1(3):199-207,2000. [6] Mary Hayhoe and Dana Ballard. Eye movements in natural behavior. output at that particular time step, but also learn from that TrendsinCognitiveSciences,9(4):188-194,2005. experience. [7] VolodymyrMnih,NicolasHeess,AlexGraves,andKorayKavukcuoglu. RecurrentModelsofVisualAttention.AdvancesinNeuralInformation ProcessingSystems,pages2204-2212,2014. [8] Yann LeCun, Leon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings oftheIEEE,86(11):2278-2323,1998. [9] LUngerleider.’What’and’where’inthehumanbrain.CurrentOpinion inNeurobiology,4(2):157-165,1994. [10] HJaeger.The“echostate”approachtoanalysingandtrainingrecur-rent neuralnetworks.GMD-ForschungszentrumInformationstechnikReport 148,148:43,2001. [11] Ronald J. Williams. Simple Statistical Gradient-Following Algorithms forConnectionistReinforcementLearning.MachineLearning,8(3):229- 256,1992. [12] John M. Henderson. Human gaze control during real-world scene perception.TrendsinCognitiveSciences,7(11):498-504,2003. [13] Keinosuke Fukunaga. Introduction to Statistical Pattern Recognition. PatternRecognition,22:833-834,1990. [14] Rachael D Seidler. Neural correlates of motor learning, transfer of Fig.8. Demonstration learning, and learning to learn. Exercise and sport sciences reviews, 38(1):3-9,2010. IV. CONCLUSION We have shown that a simple artificial visual system, implemented through a recurrent neural network using policy gradient reinforcement learning, can be trained to perform classification of objects that are much larger than its central regionofhighvisualacuity.Whilereceivingonlyclassification based reward, the system develops an active vision solution, whichdirectsattentiontowardsrelevantpartsoftheimageina task-dependentway.Importantly,theinternalnetworkmemory plays an essential role in maintaining information across saccades,sothatthefinalclassificationisachievedbycombing information from the current visual input and from previous inputs represented in the network state. Within a generic active vision system, without any specifically crafted features, we have been able to explain several features characteristic of biological vision: (i) Good classification performance us- ing reinforcement learning based on highly limited central vision and low resolution peripheral vision, (ii) Gathering task-relevant information through active search, (iii) Transfer learning, (iv) Ignoring task-irrelevant distractors, (v) Learning through guidance. Beyond providing a model for biological vision, our results suggest possible avenues for cost-effective image recognition in artificial vision systems.

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.