ebook img

DTIC ADA460332: Realistic Avatar Eye and Head Animation Using a Neurobiological Model of Visual Attention PDF

16 Pages·2.1 MB·English
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview DTIC ADA460332: Realistic Avatar Eye and Head Animation Using a Neurobiological Model of Visual Attention

Realistic Avatar Eye and Head Animation Using a Neurobiological Model of Visual Attention L.Itti,1 N.Dhavale1 andF.Pighin2 1 DepartmentofComputerScience,UniversityofSouthernCalifornia,LosAngeles,California 2 IntituteforCreativeTechnologies,UniversityofSouthernCalifornia,LosAngeles,California ABSTRACT We describea neurobiologicalmodelofvisualattentionandeye/headmovementsin primates, andits applicationto the automaticanimationofarealisticvirtualhumanheadwatchinganunconstrainedvarietyofvisualinputs. Thebottom-up (image-based) attention model is based on the known neurophysiologyof visual processing along the occipito-parietal pathway of the primate brain, while the eye/head movement model is derived from recordings in freely behaving Rhe- sus monkeys. The system is successful at autonomously saccading towards and tracking salient targets in a variety of video clips, includingsynthetic stimuli, real outdoors scenes and gaming console outputs. The resulting virtual human eye/headanimationyields realistic renderingof the simulation results, both suggestingapplicabilityof this approachto avataranimationandreinforcingtheplausibilityoftheneuralmodel. 1. INTRODUCTION Animating realistically the human face is one of computergraphics’greatest challenges. Realistic facial animationhas spurreda greatwealth ofresearch, frommotioncaptureto hairanimation. Most ofthe researchinfacial animationhas focused on speech,1–3 while the motions of some of the face components have been neglected, in particular the eyes andtherigidmotionofthehead. Thesetwomotionscombineddefinethegazebehavioroftheface. Animals,however, have developedan impressive ability to estimate where another animal’s gaze is pointedto. Humans, for instance, can determinegazedirectionwithanaccuracybetterthan4◦.4 Furthermore,gazeestimationisanessentialcomponentofnon- verbal communicationand social interaction, to the point that monitoringanother’s eye movementsis often believedto provideawindowontothatotherperson’smentalstateandintentions.5 Thiscrucialroleofgazeestimationindetermining intentionality and in interacting with social partners makes any inaccuracy in its rendering obvious to even the casual observer. Animatinganavatar’sgazeisachallengingtask. Thehumaneyeanditscontrolsystemsindeedareextremelycom- plex.6,7 Contrarytoconventionalvideocameraswhichuniformlysamplethevisualworld,thehumaneyeonlygathers detailedvisualinformationinasmallcentralregionofthevisualfield,thefovea.Increasinglyperipherallocationsaresam- pledatincreasinglycoarserspatialresolutions,withacomplex-logarithmicfall-offofresolutionwitheccentricity.8 This over-representationofthecentralvisualfield, with thefoveaonlysubtending2−5◦ ofvisualangle,maybeunderstood throughconsiderationsofinformationprocessingcomplexity. Indeed,iftheentire160×175◦subtendedbyeachretina9 werepopulatedbyphotoreceptorsatfovealdensity,theopticnervewouldcompriseontheorderofonebillionnervefibers, comparedto about one million fibers in humans.9 Rather than growing the optic nerve and subsequent cortical visual processingarea1,000-foldormore,naturehasdevisedanalternatestrategytocopingwithvisualinformationoverload: Theunder-representationofthevisualperipheryiscompensatedbythepresenceofhighlymobileeyesthatcanalmostin- stantaneouslypointthefoveatowardsanylocationinthevisualworld.Giventhisarchitecture,developingefficientcontrol strategiesforthedeploymentofgazeontobehaviorally-relevantvisualtargetshasbecomeaproblemofpivotalimportance tothesurvivaloftheorganism.Thesecomplexstrategiesareverydifficulttosimulateaccurately. Recordingeyemotionisalsoadifficulttask.Opticalmotioncapturesystemstraditionallyusedtorecordfacialmotions failtorobustlytrackeyemovements.Atemptshavebeenmadethroughtheuseofcontactlensescoatedwithretroreflective material. However, lenses have a tendency to move on the surface of the eye, thus preventingfrom recordingaccurate motions. Systemsspecificforrecordingeyemovementstypicallycanaccuratelytrackeyemotions,inacontrolledenvi- ronment. Unfortunatelytheytendtobeexpensive($20,000foranIscaneyetracker)andintrusive,whichprecludestheir use in a wide range of situations. For these reasons we are following a proceduralapproach. Our animationtechnique heavilyreliesonresearchperformedinthefieldofneuroscience. Report Documentation Page Form Approved OMB No. 0704-0188 Public reporting burden for the collection of information is estimated to average 1 hour per response, including the time for reviewing instructions, searching existing data sources, gathering and maintaining the data needed, and completing and reviewing the collection of information. Send comments regarding this burden estimate or any other aspect of this collection of information, including suggestions for reducing this burden, to Washington Headquarters Services, Directorate for Information Operations and Reports, 1215 Jefferson Davis Highway, Suite 1204, Arlington VA 22202-4302. Respondents should be aware that notwithstanding any other provision of law, no person shall be subject to a penalty for failing to comply with a collection of information if it does not display a currently valid OMB control number. 1. REPORT DATE 3. DATES COVERED 2003 2. REPORT TYPE 00-00-2003 to 00-00-2003 4. TITLE AND SUBTITLE 5a. CONTRACT NUMBER Realistic Avatar Eye and Head Animation Using a Neurobiological Model 5b. GRANT NUMBER of Visual Attention 5c. PROGRAM ELEMENT NUMBER 6. AUTHOR(S) 5d. PROJECT NUMBER 5e. TASK NUMBER 5f. WORK UNIT NUMBER 7. PERFORMING ORGANIZATION NAME(S) AND ADDRESS(ES) 8. PERFORMING ORGANIZATION University of California,Institute for Creative Technologies,13274 Fiji REPORT NUMBER Way,Marina del Rey,CA,90292 9. SPONSORING/MONITORING AGENCY NAME(S) AND ADDRESS(ES) 10. SPONSOR/MONITOR’S ACRONYM(S) 11. SPONSOR/MONITOR’S REPORT NUMBER(S) 12. DISTRIBUTION/AVAILABILITY STATEMENT Approved for public release; distribution unlimited 13. SUPPLEMENTARY NOTES The original document contains color images. 14. ABSTRACT 15. SUBJECT TERMS 16. SECURITY CLASSIFICATION OF: 17. LIMITATION OF 18. NUMBER 19a. NAME OF ABSTRACT OF PAGES RESPONSIBLE PERSON a. REPORT b. ABSTRACT c. THIS PAGE 15 unclassified unclassified unclassified Standard Form 298 (Rev. 8-98) Prescribed by ANSI Std Z39-18 In our approach we thus leverage studies conducted in neuroscience to build a procedural gaze animation system drivenexogenouslybyexternalvisualstimuli. Howprimates,includinghumans,selectwheretolooknextinthepresence ofvariousvisualstimuliindeedhasbeenoneofthemostactivetopicsofneuroscienceresearchoverthepastcentury.Key conceptshaveemerged,includingthatoffocalvisualattention. Attentioncomplementsexplicit(so-called“overt”)gaze shifswitharapidlyshiftable“virtualspotlight”10thatcovertlyselectsstimuliinturnfordetailedprocessingandaccessto consciousrepresentation. Thus,understandinghowgazeisdirectedrequiresfirstanunderstandingofhowattentionmay beattractedtowardsspecificvisualtargets. Whileintrospectioneasilyrevealsthatattentionmaybeguidedvoluntarily– or fromthe “top-down”–(e.g., attendingto that attractivemysteriousstrangerwhile talkingto this tired oldfriendat a cocktailparty),itsdeploymenthasalsobeenshowntobestronglyinfluencedfromthe“bottom-up,”bytheverycontentsof thevisualinput.11–13 Forinstance,anobjectthrownatus,ortheabruptonsetofabrightlight,willoftencaptureattention and elicit an almost reflex-likeeye movementtowards that object or light. In normalvision, boththe veryfast (able to scan upto 20items/s) bottom-upandmuchslower(limitedto about5 items/s) top-downmodesof attentionalguidance cooperate,inanattempttoefficientlyanalyzetheincomingvisualinformationinamannerthatisbothservingbehavioral prioritiesandkeepingtheorganismalertofpossibleunexpecteddangers. Hereweexploretheapplicabilityofaneurobiologicalmodelofvisualattentiontotheautomaticrealisticgenerationof eyeandheadmovementsgivenarbitraryvideoscenes.Technologicalapplicationsofthismodelincludeinteractivegames, characteranimationinproductionmovies,human-computerinteractionandothers. Becausetheneuralbasisoftop-down attentionremainselusive,ourmodelprimarilyimplementsabottom-upguidanceofattentiontowardsconspicuousvisual targets,augmentedbyarealisticeye/headmovementcontrollerandaneyeblinkgenerator. Weusethismodeltoanimate aphotorealistic3Dmodelofahumanface. Thesystemistestedwitharbitraryvideoclipsandautonomouslypredictsthe motionoftheeyes,head,eyelids,andanumberofaccompanyingdeformationsofthevirtualfacialtissue. 2. RELATEDWORK 2.1. ModelingVisualAttention Underthebroadumbrellaofattention,severalkeyaspectsofsensoryprocessinghavebeendefined,suchastheanatomical andfunctionalseggregationinthemonkeybrainbetweenlocalizationofsalienttargets(“where/how”dorsalpathway)and theirrecognition(“what”ventralpathway)14).Theveryexistenceofthissegregationhasmadeitareasonableenterpriseto studyattentionseparatelyfromobjectrecognition. Therearemanyotheraspectsofattention,whichwewillnotconsider here, such as its role as an informationprocessing bottleneck,15 its role in binding the different visual attributes of an object, such as color and form, into a unitary percept,11 and its spatially- and feature-selective modulatory effect on low-levelvisualprocessing16,17(wereferthereadertorecentreviewsforfurtherinformation13,18). Developmentof computationalmodelsof attention started with the Feature IntegrationTheoryof Treismanet al.,11 whichproposedthatonlyfairlysimplevisualfeaturesarecomputedinamassivelyparallelmannerovertheentireincoming visualscene. Attentionisthennecessarytobindthoseearlyfeaturesintoaunitedobjectrepresentation,andtheselected boundrepresentationistheonlypartofthevisualworldthatpassesthoughtheattentionalbottleneck. The first explicit neural architecture for the bottom-up guidance of attention was proposed by Koch and Ullman,12 andiscloselyrelatedtothefeatureintegrationtheory. Theirmodeliscenteredaroundasaliencymap,thatis,anexplicit two-dimensionaltopographicmapthatencodesforstimulusconspicuity,orsalience,ateverylocationinthevisualscene. The saliency map receivesinputs fromearly visual processing, and providesan efficient centralized controlstrategy by whichthefocusofattentionsimplyscansthesaliencymapinorderofdecreasingsaliency. Thisgeneralarchitecturehas beenfurtherdevelopedbyItti et al.19,20 toyielda fullcomputerimplementation,used as the basis forourwork. Inthis model, the earlystages ofvisual processingdecomposethe visual inputthrougha set offeature-selectivefilteringprocessesendowedwithcontextualmodulatoryeffects. Tocontrolasingleattentionalfocus basedonthismultiplicityofvisualrepresentation,featuremapsarecombinedintoauniquescalarsaliencymap. Biasing attentiontowardsthelocationofhighestsalienceisthenachievedbyawinner-take-allneuralnetwork,whichimplementsa neurallydistributedmaximumdetector. Onceattendedto,thecurrentlocationistransientlyinhibitedinthesaliencymap byaninhibition-of-returnmechanism. Thus,thewinner-take-allnaturallyconvergestothenextmostsalientlocation,and repeatingthisprocessgeneratesattentionalscanpaths. Many other models have been proposed, which typically share some of the components of that just described; in particular,asaliencymapiscentraltomostofthesemodels21–23(see13foramorein-depthoverview).Inthepresentwork, weextendthemodelofIttietal.(availableonlineinC++source-codeform)fromthepredictionofcovertattentionalshifts ontostaticscenestothatofeyeandheadmovementsontodynamicvideoclips. 2.2. ModelingEyeandHeadMovements Behaviouralstudiesofalert behavingmonkeysandhumanshaveindicatedthatgazeshiftsareaccomplishedbymoving boththeeyesandheadinthesamedirection.24,25 Sparks7 providesacomprehensivereviewonthetopic. Ofparticular interest here, Freedmanand Sparks26 specificallyinvestigatedsaccadic eyemovementsmade byRhesus monkeyswhen theirheadwas unrestrained. Remarkably,theyfoundthat therelativecontributionoftheheadandeyestowardsa given gazeshiftfollowssimplelaws. Theirmodelisatthebasisoftheeye/headsaccadesubsystemimplementedinourmodel, andisdescribedindetailsinafurthersection. Videoanalysisundervariousconditionshasrevealedthatnormaleyeblinkingrateis11.6/min,whileintheattentive statethisincreasesslightlyto13.5/mininnormalhumansubjects,withfasterratesof19.1/minand19.6/minrespectively observedinschizophrenicpatients.27,28 2.3. GazeAnimationinAvatars Severalapproacheshavebeenproposedtoendowavatarswithrealistic-lookingeyemovements.Theyprimarilyfallunder threecategories,foreachofwhichwedescribearepresentativeexamplesystembelow. A first approach consists of augmenting virtual agents with eye movements that are essentially random, but whose randomnessiscarefullymatchedtotheobserveddistributionsofactualeyemovementsproducedbyhumans. Inarecent exampleofthisapproach,Leeetal.29usedaneye-trackingdevicetogatherstatisticaldataonthefrequencyofoccurrence and spatiotemporal metric properties of human saccades. Random probability distributions derived from the data were thenusedtoaugmentthepre-existingbehaviorofanagentwitheyemovements. Thismodelmakesnoattempttodirect eyemovementstowardsspecifictargetsinspace,beyondthefactthatgazedirectioneitherdeterministicallyfollowslarge headmovementsorisrandomlydrawnfromtheempiricallymeasureddistributions.Thus,Leeetal.’sremarkablyrealistic modeladdressesafundamentallydifferentissuefromthatinvestigatedhere:Whilewefocusonautomaticallydetermining wheretheavatarshouldlook,Leeetal.addveryrealistic-lookingrandomeyemovementstoanavatarthatisalreadyfully behaving. Asecondapproachconsistsofusingmachinevisiontoestimategazeandposefromhumansandtousethoseestimated parameterstodriveavirtualagent.30–32 Variousmachinevisionalgorithmsareusedtoextractcriticalparametersfrom videoacquiredfromoneormorecameraspointedtoahuman.Theseparametersarethentransmittedtotherenderingsite andusedforavataranimation.Bydefinition,thistypeofapproachisonlyapplicabletosituationswheretheavatarsmimic orembodyhumans,suchasvideoconferencing,spatially-distributedonlinegamingandhuman-computerinterfacing.This substantiallydiffersfromourgoalhere,whichistoautonomouslyanimateagents. Alastapproachconsistsofusingmachinevisiontoattempttolocatetargetsofinterestinvirtualorrealscenes.33–35 Forinstance,TerzopoulosandRabie36haveproposedanactivevisionsystemforanimatsbehavinginvirtualenvironments, includingartificialfishandhumans. Theanimatsareequippedwithmultiscalevirtualcameras,providingbothfovealand wide-anglevisualinput.Selectionoftargetsforeyemovementsisbasedonacolorsignaturematchingalgorithm,inwhich objectsknownfromtheircolorhistogramarelocalizedintheimagethroughbackprojection. Asetofbehavioralroutines evaluatesensoryinputsagainstcurrentmentalstateandbehavioralgoals,anddecideonacourseofmotoraction,possibly includingsaccadiceyemovementstodetectedobjects.Temporaldynamicsandsharingbetweenheadandeyemovements arenotdescribedindetails.Thissystempresentslimitations,inparticularinitssensoryabilities(e.g.,itwillnotsaccadeto novelobjectsofunknowncolorsignature,norprioritizeatthesensorylevelamongmultipleobjects).Whilethisapproach is thus dedicatedto a set of knownobjects with distinct colorsignaturespresentin a relativelyunclutteredenvironment suchasdeepoceans,initsgrossarchitectureitisparticularlyrelevanttothemodelpresentedhere. Ourapproachdirectlyfollowsthethirdcategorydescribedhere,usingmachinevisiontodeterminewheretolooknext. Ourmoredetailedbiologically-inspiredmodelingoftheattentionandeyemovementsubsystemsallowustoextendthis basicparadigmtoawiderepertoireofcombinedeye/headmovementsinunconstrainedenvironments,containingarbitrary numbersofknownorunknowntargetsagainstarbitraryamountsofknownorunknownclutter. 3. NEUROBIOLOGICALMODELOFATTENTION 3.1. FoveationandRetinalFiltering Oursimulationssofarhaveusedpre-recordedvideoclipsasvisualinputs,eitherfilmedusingavideocamera,orcaptured fromthevideooutputofgamingconsoles.Interlacedvideowasdigitizedusingaconsumer-electronicsframegrabberboard (WinTVGo,HauppageInc.)ataresolutionof640×480pixelsand30frames/s. Althoughusingpre-recordedvideomaysuggestopen-loopprocessing(i.e.,eyemovementsdidnotinfluencecamera viewpoint),oursimulationsactuallyoperatedinaspatially-limitedclosed-loopduetothefirststepofprocessing:Agaze- contingentandeccentricity-dependentbluraimedatcoarselyreproducingthenon-uniformdistributionofphotoreceptors on the human retina. Efficient implementation of the foveation filter was achieved through interpolation across levels of a Gaussian pyramid37 computedfrom each input frame. To determinewhich scale to use at any image location, we computeda3/4-chamferdistancemap,38 encodingateverypixelforanapproximationtotheEuclideandistancebetween ◦ thatpixel’slocationandadiscof5 diametercenteredatthemodel’scurrenteyeposition. Thus,pixelsclosetothefovea wereinterpolatedfromhigh-resolutionscalesinthepyramid,whilemoreeccentricpixelswereinterpolatedfromcoarser- resolutionscales.Therefore,therawvideoclipmightbeconsideredasawide-fieldenvironmentalview,withthefoveation filterisolatingaportionofthatenvironmentforinputtothemodel. Thisgaze-contingentfoveationensuredmorerealistic simulations, in which small (or high spatial frequency)visual stimuli far from fixation were unlikely to strongly attract attention. 3.2. Bottom-UpGuidanceofAttention Wedevelopedacomputationalmodelthatpredictsthespatiotemporaldeploymentofgazeontoanyincomingvisualscene Fig.1).OurimplementationstartedwiththemodelofIttietal.,19,20 whichisfreelyavailable.Inadditiontothefoveation filterjustdescribed,ourextensionstothismodelincludetheadditionofaflickerfeaturethatdetectstemporalchange(e.g., onsetandoffsetoflights),ofamotionfeaturethatdetectsobjectsmovinginspecificdirections,andanumberofextensions, describedbelow, to use the modelwith videorather thanstatic inputs. Additionalextensions concernthe generationof eyeandheadmovementsfromthecovertattentionoutputofthemodel,andaredetailedinthefollowingsection. Inwhat follows,webrieflydescribetheextendedimplementationusedinoursimulations. 3.2.1.Low-LevelFeatureExtraction Givenafoveatedinputframe,thefirstprocessingstepconsistsofdecomposingitintoasetofdistinct“channels,”using linearfilterstunedtospecificstimulusdimensions.Theresponsepropertiesofthefilterswerechosenaccordingtowhatis knownoftheirneuronalequivalentsinprimates.Thisdecompositionisperformedatninespatialscales,toallowthemodel to represent smaller and larger objects in separate subdivisions of the channels. The scales are created using Gaussian pyramids37with9levels(1:1to1:256scaling). With r , g and b being the red, green and blue channels of input frame n, an intensity image I is obtained as n n n n I = (r +g +b )/3. I is used to create a Gaussian pyramid I (σ), where σ∈[0..8] is the scale. The r , g and n n n n n n n n b channels are subsequently normalized by I to decouple hue from intensity. Four broadly-tuned color channels are n n then created: R =r −(g +b )/2 forred, G =g −(r +b )/2 forgreen, B =b −(r +g )/2 for blue, andY = n n n n n n n n n n n n n r +g −2(|r −g |+b )foryellow(negativevaluesaresettozero).Byconstruction,eachcolorchannelyieldsmaximal n n n n n response to the pure hue to which it is tuned, and zero response both to black and to white. Four Gaussian pyramids R (σ),G (σ),B (σ)andY (σ)arethencreated. n n n n LocalorientationinformationisobtainedfromIusingorientedGaborpyramidsO (σ,θ),whereσ∈[0..8]represents n thescaleandθ∈{0◦,45◦,90◦,135◦}isthepreferredorientation.(Gaborfilters,whicharetheproductofacosinegrating anda2DGaussianenvelope,approximatethereceptivefieldsensitivityprofileoforientation-selectiveneuronsinprimary visualcortex.39) ThefastimplementationproposedbyGreenspanetal. isusedinourmodel.40 Flickeris computedfromtheabsolutedifferencebetweenthe luminanceIn ofthe currentframeandthatIn−1 of the previousframe,yieldingaflickerpyramidF (σ). n Finally,motioniscomputedfromspatially-shifteddifferencesbetweenGaborpyramidsfromthecurrentandprevious frames.41 ThesamefourGabororientationsasintheorientationchannelareused,andonlyshiftsofonepixelorthogonal totheGabororientationareconsidered,yieldingoneshiftedpyramidS (σ,θ)foreachGaborpyramidO (σ,θ).Because n n ofthepyramidalrepresentation,thisiscapturesawiderangeofobjectvelocities(sincea1-pixelshiftatscale8corresponds Input Video Color Motion Red / Green Blue / Yellow G a z eF-oCvoenattiinognent Motion Right MoMtiotoino n DLoewftn Motion Up Intensity Intensity Orientation 0oOrientation OOrriieennttaattiioonn 4950oo Orientation 135o Flicker Flicker Saliency Map Saccade / Blink / Smooth Pursuit Arbitrer Task-Relevance Map Head / Eye Saccade Splitter Blink Generator Attention Guidance Map Mo"vSeamcceandt iGc"e nHeeraadtor MovSeamcecnatd Gice Enyeerator Smooth Pursuit Head Smooth Pursuit Eye Movement Generator Movement Generator Saccadic Suppression Blink Suppression Spatio-Temporal Filter Realistic Rendering Winner-Take-All Predicted Eye / Head / Facial Movements Inhibition of Return (disabled) Figure1.Overviewofthemodel. Inputvideoisprocessedbyafoveationfilterfollowedbylow-levelfeatureextractionmechanisms. Allresultingfeature maps contributeto theunique saliencymap, whose maximum iswherethe model’s attentionispointed to. The resultingcovertattentionshiftsdrivetheeye/headmovementcontrollerandrealisticfacialanimation. toa256-pixelshiftatbasescale0).TheReichardtmodelformotioncomputationintheflybrain42isthenusedtocompute amotionpyramidR (σ,θ): n Rn(σ,θ)=|On(σ,θ)∗Sn−1(σ,θ)−On−1(σ,θ)∗Sn(σ,θ)| (1) where*denotesapointwiseproduct.Valuessmallerthan3.0aresettozerotoreduceoutputnoise.Afastimplementation (usedhere)usestheintensityratherthanorientationpyramidasinput(atthecostoflosingselectivityfororientation). Overall, these filters are very crude and yield many false detections, not unlike biological neurons. As we will see below,however,thecenter-surroundandcompetitionforsalienceappliedtotheiroutputswilldiscardtheseartifacts. 3.2.2.Center-SurroundReceptiveFieldProfiles Eachfeatureiscomputedinacenter-surroundstructureakintovisualreceptivefields.Thisrendersthesystemsensitiveto localfeaturecontrastratherthanrawfeatureamplitude. Center-surroundoperationsareimplementedasdifferencesbetweenafineandacoarsescaleforagivenfeature: The centerofthereceptivefieldcorrespondstoapixelatlevelc∈{2,3,4}inthepyramid,andthesurroundtothecorresponding pixelat levels=c+δ, with δ∈{3,4}. We hencecomputesix featuremapsforeach typeoffeature(atscales 2-5, 2- 6, 3-6, 3-7, 4-7, 4-8). Across-scale map subtraction, denoted “(cid:4),” is obtained by interpolation to the finer scale and pointwisesubtraction.Ninetypesoffeaturesarecomputed:on/offintensitycontrast,39 red/greenandblue/yellowdouble- opponency,43 fourorientationcontrasts,44 flickercontrast,45 andfourmotioncontrasts42: Intensity: (c,s)=|I (c)(cid:4)I (s)| (2) n n n (cid:0) R/G: (c,s)=|(R (c)−G (c))(cid:4)(R (s)−G (s))| (3) n n n n n (cid:1)(cid:2) B/Y: (c,s)=|(B (c)−Y (c))(cid:4)(B (s)−Y (s))| (4) n n n n n (cid:3)(cid:4) Orientation: (c,s,θ)=|O (c,θ)(cid:4)O (s,θ)| (5) n n n (cid:5) Flicker: (c,s)=|F (c)(cid:4)F (s)| (6) n n n (cid:6) Motion: (c,s,θ)=|R (c,θ)(cid:4)R (s,θ)| (7) n n n (cid:1) Intotal,72featuremapsarecomputed:Sixforintensity,12forcolor,24fororientation,6forflickerand24formotion. 3.2.3.CombiningInformationAcrossMultipleMaps Ateachspatiallocation,activityfromthe72featuremapsdrivesasinglescalarmeasureofsalience. Onemajordifficulty istocombineapriorinotcomparablestimulusdimensions(e.g.,howmuchshouldorientationweighagainstcolor?) The crudenessofthelow-levelfiltersprecludesasimplesummationofallfeaturemapsintothesaliencymap,asthesumwould typicallybedominatedbynoise.20 Tosolvethisproblem,weuseasimplewithin-featurespatialcompetitionscheme,directlyinspiredbyphysiological andpsychologicalstudiesoflong-rangecortico-corticalconnectionsinV1. Theseconnections,whichcanspanupto6- 8mm,arethoughttomediate“non-classical”responsemodulationbystimulioutsidethecell’sreceptivefield,andaremade byaxonalarborsofexcitatory(pyramidal)neuronsincorticallayersIIIandV.46,47 Non-classicalinteractionsresultfrom acomplexbalanceofexcitationandinhibitionbetweenneighboringneuronsasshownbyelectrophysiology,48,49 optical imaging,50 andhumanpsychophysics.51 Ourmodelreproducesthreewidelyobservedcharacteristicsofthoseinteractions: First,interactionsbetweenacenter location and its non-classical surround appear to be dominated by an inhibitory component from the surround to the center,52 although this effect is dependent on the relative contrast between center and surround.49 Hence our model focuses on non-classical surround inhibition. Second, inhibition is strongest from neurons tuned to the same stimulus propertiesas thecenter.47,48 As a consequence,ourmodelimplementsinteractionswithineachindividualfeaturemap rather than between maps. Third, inhibition appears strongest at a particular distance from the center,51 and weakens bothwithshorterandlongerdistances. Thesethreeremarkssuggestthatthestructureofnon-classicalinteractionscanbe coarselymodeledbyatwo-dimensionaldifference-of-Gaussians(DoG)connectionpattern. Our implementation is as follows: Each feature map is first normalized to a fixed dynamic range, in order to (cid:7) eliminatefeature-dependentamplitudedifferencesduetodifferentfeatureextractionmechanisms. is theniteratively (cid:7) convolvedbyalarge2DDoGfilter,theoriginalmapisaddedtotheresult,andnegativevaluesaresettozeroaftereach iteration.TheDoGfilteryieldsstronglocalexcitation,counteractedbybroadinhibitionfromneighboringlocations: ←| + ∗ o −C | (8) inh ≥0 (cid:7) (cid:7) (cid:7) (cid:8) (cid:2) where o is the2DDifferenceofGaussianfilter,|.|≥0 discardsnegativevalues,andCinh isaconstantinhibitoryterm (C =(cid:8)0.0(cid:2)2inourimplementationwiththemapinitiallyscaledbetween0and1). C introducesasmallbiastowards inh inh slowlysuppressingareasinwhichtheexcitationandinhibitionbalancealmostexactly(extendeduniformtextures). Each feature map is subjected to 10 iterations of the process described in Eq. 8 and denoted by the operator (.) in what (cid:9) follows.Thisnon-linearstagefollowinglinearfilteringisoneofthecriticalcomponentsinensuringthatthesaliencymap isnotdominatedbynoiseevenwhentheinputcontainsverystrongvariationsinindividualfeatures. 3.2.4.TheSaliencyMap After normalization by (.), the feature maps for each feature channel are summed across scales into five separate “conspicuitymaps,” at th(cid:9)e scale (σ=4)of the saliencymap. Theyare obtainedthroughacross-scalemap addition, ⊕, whichconsistsofreductiontoscale4andpointwiseaddition.Fororientationandmotion,fourintermediarymapsarefirst createdcombiningsixfeaturemapsforagivenθ,thencombinedintoasingleconspicuitymap: (cid:1)4 (cid:1)c+4 Intensity: = ( (c,s)) (9) n n (cid:0) (cid:9) (cid:0) c=2s=c+3 (cid:1)4 (cid:1)c+4 Color: = [ ( (c,s))+ ( (c,s))] (10) n n n (cid:10) (cid:9) (cid:1)(cid:2)(cid:2) (cid:9) (cid:3)(cid:4) (cid:3) c=2s=c+3 (cid:1)4 (cid:1)c+4 Orientation: =∑ ( (c,s,θ)) (11) n n (cid:5) θ (cid:9) c=2s=c+3(cid:9) (cid:5) (cid:1)4 (cid:1)c+4 Flicker: = ( (c,s)) (12) n n (cid:6) (cid:2) (cid:9) (cid:6) (cid:3) c=2s=c+3 (cid:1)4 (cid:1)c+4 Motion: =∑ ( (c,s,θ)) (13) n n (cid:1) θ (cid:9) c=2s=c+3(cid:9) (cid:1) The motivation for the creation of five separate channels and their individual normalization is the hypothesis that similar features compete strongly for salience, while different modalities contribute independentlyto the saliency map. Theconspicuitymapsarenormalizedandsummedintothefinalinput tothesaliencymap: (cid:11) (cid:4) (cid:5) 1 = ( )+ ( )+ ( )+ ( )+ ( ) (14) (cid:11) 3 (cid:9) (cid:0) (cid:9) (cid:10) (cid:9) (cid:5) (cid:9) (cid:6) (cid:9) (cid:1) Atanygiventime,themaximumofthesaliencymapcorrespondstothemostsalientstimulus,towhichattentionshould bedirectednext. Thismaximumisselectedbyawinner-take-allneuralnetwork,12 whichisabiologicalimplementation ofamaximumdetector. In the absence of any further control mechanism, the system described so far would direct its attention, in the case ofastatic scene, constantlytoonelocation,sincethe samewinnerwouldalways beselected. Toavoidthisundesirable behaviorIttietal. establishaninhibitoryfeedbackfromthewinner-take-all(WTA)arraytothesaliencymap(inhibition- of-return;53). Forourpurposesinevaluatingdynamicscenes,however,constantlyattendingtothemostsalientlocationin theincomingvideoisappropriate(butseeDiscussionforpossibleimprovements),especiallysincethereislittleevidence forinhibition-of-returnacrosseyemovementsineitherhumansormonkeys.54 Thus,inhibition-of-returnisturnedoffin ourmodel. In this work, we only make a very preliminary first pass at integrating task contingencies into this otherwise task- independentmodel. To this end,we followtheapproachrecentlyproposedbyNavalpakkamandItti55 andadda “task- relevancemap”(TRM)betweenthesaliencymapandwinner-take-all.TheTRMactsasasimplepointwisemultiplicative filter,makingthemodelmorelikelytoattendtoregionsofthevisualfieldthathavehighertaskrelevance.Theproductbe- tweenthesaliencyandtask-relevancemapsyieldsthe“attentionguidancemap”thatdrivestheWTA.WhileNavalpakkam andIttihaveproposedaconceptualframeworkforpopulatingtheTRMbasedonsequentialattentionandrecognitionof objectsandtheirinterrelationships,herewesetasfairlysimplistictasktoourmodeltogivepreferencetoimagelocations thatarechangingovertime. Thus,theTRMinourmodelisjusttheinversedifferencebetweenthecurrentframeanda slidingaverageofpreviousframes. 4. GENERATIONOFEYEANDHEADMOVEMENTS Thecovertattentionshiftsgeneratedbythebottom-upattentionmodelprovideinputstotheeye/headmovementcontroller. Sincecovertattentionmayshiftmuchmorerapidlythantheeyes,somespatiotemporalfilteringisappliedtothesequence ofcovertshifts,todecideonthenextsaccadetarget.Inourmodel,asaccadeiselicitedifthelast4covertshiftshavebeen withinapproximately10◦ofeachotherandatleast7.5◦awayfromcurrentovertfixation. 4.1. GazeDecomposition When the head is unrestrained, large amplitude gaze shifts involve coordinated movements of the eyes and head.24,25 Interestingly, the total displacements of the eyes and head during free-head gaze shifts is well summarized by simple mathematicalrelationships. Itis thuspossibletofairlyaccuratelypredicteye/headcontributionstoagivengazeshift,if theinitialeyepositionanddesiredgazeshiftareknown.26,56 Theeffectofinitialeyepositionontohead/eyemovement dependsuponthedirectionofthesubsequenteyemovement. ForgazeshiftsGsmallerthanathresholdvalueT,headdisplacementH iszero. T isgivenby,withIEPdenotingthe initialeyepositionrelativetotheheadandallangulardisplacementsindegrees(thesignofIEPispositiveiftheeyesare initiallydeviatedinthedirectionofthesubsequentmovement,negativeotherwise): (cid:6) (cid:7) −IEP H=0 if −T <G<T with T = +20 ×0.56 (15) 2 Forgazeamplitudesoutsidethiszone,totalheadmovementamplitudeH andgazeshiftarelinearlyrelatedsuchthat: (cid:6) (cid:7) IEP H=(1−k)×T+k×Gwithk= +1.1 ×0.65 (16) 35 Note that this computationis separately carriedoutfor the horizontalandvertical componentsofthe gaze shifts, rather thanforthefulltwo-dimensionalshift. 4.2. Saccadedynamics Thedynamicsofcombinedeye-headshiftshavebeenstudiedextensively.57–59 Themaximumheadandeyevelocities(Hv andEvbelow)areafunctionoftheheadandeyemovementamplitudesandaregivenby60[p.551,Fig.8]: Ev=473(1−e−E/7.8)◦/s and Hv=13H+16◦/s (17) Inourimplementation,aquadraticvelocityprofileiscomputedgiventhemaximumvelocitycomputedabove. Intheprimatebrain,visualinputisinhibitedduringsaccades,probablytopreventperceptionoflargemotiontransients astheeyesmove.61 Inourmodel,thebestplacetoimplementsuchsaccadicsuppressionisthesaliencymap;thus,during saccades,thesaliencymapisentirelyinhibited. Thishasthreeeffects: attentionispreventedfromshiftingsincethereis nomoreactivityinthesaliencymap; itwilltakeontheorderof200msforthesaliencymaptorecharge,thusenforcing someintersaccadiclatency;andallmemoryofprevioussalientvisualstimuliwillbelost. Figure2.Demonstrationof1Deye/headmovementdynam- ics using a simple synthetic input clip (top: saccade-only model,middle: saccadeandsmoothpursuitmodel;bottom: inputframes;eyepositionsarerelativetothehead).Agreen discappearsatthecenter(yieldingsteadyfixation),abruptly jumps rightwards (yielding a rightward eye/head saccade), drifts leftwards (yielding a series of saccades or a smooth pursuit),andfinallystops(yieldingafinalfixationperiod). 4.3. SmoothPursuit The modelof Freedmandoes not includeanothermodeof eyemovementsthat exists in humansandfew other animals includinginnon-humanprimates,bywhichtheeyescanaccuratelytrackaslowlymovingtargetusingmuchslower,non- saccadic, smootheyeandheadmovements.7 A verycrudeapproximationofthisso-called“smoothpursuit”modewas addedtoourmodel,usingtwosimplemass/springphysicalmodels(onefortheheadandonefortheeye): Ifagazeshift istoosmalltotriggerasaccade,itwillinsteadbecomethenewanchorpointofaspringoflengthzerolinkedonitsother endtototheeye(orhead). Theeye/headwill thusbeattractedtowardsthatnewanchorpoint,andtheadditionoffluid frictionensuressmoothmotion. Inourimplementation,theheadspringisfivetimesweakerthantheeyespring,andthe headfrictioncoefficienttentimesstrongerthantheeye’s,ensuringslowerheadmotion(Figure2). 4.4. EyeBlinks Afinaladditiontoourmodelisacrudeheuristiceyeblinkbehavior.Ablinkisinitiatedduringagivensimulationtimestep (0.1ms,i.e.,10,000fps)ifatleast1shaspassedsincethelastblinkorsaccade,andanyofthefollowing: • withauniformprobabilityof10%ateverytimestepifithasbeen3sormoresincethelastblinkorsaccade,or, • withaprobabilityof0.1%ifithasbeen2sormore,or, • withaprobabilityof0.01%ifthelast3attendedlocationshavebeenclosetoeachother(i.e.,stableobject),or, • withaprobabilityof0.005%. Blinks last fora fixed 150ms, duringwhichonly the winner-take-all(WTA) is fully inhibitedso that covertattention is preventedfromshifting. Thus,itisassumedherethatblinksyieldashorter,weakerinhibitionthansaccadicsuppression. Indeed,theWTAwillrecoververyrapidlyfrominhibition(10msorless)sothatablinkoccurringduringsmoothpursuit will only minimally disturb pursuit. In contrast, testing with inhibitingthe saliency map as was done for saccadic sup- pressiondisruptedtracking(pursuitpausedafter the blinkwhile the saliency maprecharged,anda catching-upsaccade occurredoncecovertattentionshiftsstartedagain).Casualexperimentsinourlaboratorysuggestthatblinksdonotdisrupt trackingbehaviorinhumans. 5. ANIMATIONANDRENDERING Inordertoconvincinglyillustrateoureyemotionsynthesistechnique,wemapthepredictedeye/headmovementsontoa realisticanimatablefacemodel.Thefacemodelusedinourexperimentsisadigitalreplicaofanactor’sface.Weusedthe techniquedevelopedbyPighinetal.62 toestimatethegeometryofanactor’sfacefromthreeimages. Wealsoextracteda texturemapfromtheimages.Theeyesweremodeledasperfectspheresandtexturedusingthephotographs. Themodeliscontrolledthroughfourdegreeoffreedoms:twofortheeyesandtwoforthehead.Thehead’sorientation canbechangedaccordingtotwoEulerangles. Thesetwoanglescontroltheelevationandazimutofthewholehead. The rotationcenterwasestimatedfromfacialmotioncapturedatarecordedonthesameactor. Aheadbandusedasamotion stabilizerprovidedtherigidmotionoftheface. Weaveragedthemeasuredpositionofthecenterofrotationovertimeto providean estimate ofthemodel’scenterof rotation. Theeyes havesimilarcontrols, androtatearoundtheirgeometric centerindependentlyofthehead’sorientation. WeconvertscreencoordinatesintoEuleranglesbyassumingthattheprocessedimagescorrespondtoafieldofview ◦ of90 . Alsothefullheadisconsideredthesamesizeasthescreenandthehead’scenterifrotationisatthesamealtitude asthecenterofthescreen.Sincewedonotknowthedepthoftheobjectinthevideos,weassumethattheeyesarefocused atinfinityandpointinthesamedirection. Tomakeourfacebehaveinamorenaturalway, webuilta mechanismtoanimatethe eyebrowsas afunctionofthe directionoftheeyes.Thismechanismreliesontheblendingoftwoshapes.Thefirstonehastheeyebrowslevel,whilethe secondonehasraisedeyebrows.Duringtheanimationweblendthesetwoshapesaccordingtotheorientationoftheeyes. Upwardpointingeyesactivatetheraisedeyebrowblendshapeaccordingtoalinearramp(Figure3).

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.