PupilNet: Convolutional Neural Networks for Robust Pupil Detection WolfgangFuhl ThiagoSantini PerceptionEngineeringGroup PerceptionEngineeringGroup UniversityofTübingen UniversityofTübingen [email protected] [email protected] GjergjiKasneci EnkelejdaKasneci 6 SCHUFAHoldingAG PerceptionEngineeringGroup 1 0 [email protected] UniversityofTübingen 2 [email protected] n a J 9 Abstract comean importantinstrument forcognitivebehavior stud- 1 iesinmanyareas,rangingfromreal-timeandcomplexap- Real-time,accurate,androbustpupildetectionisanes- plications(e.g.,drivingassistancebasedoneye-trackingin- ] V sentialprerequisiteforpervasivevideo-basedeye-tracking. put [11] and gaze-based interaction [31]) to less demand- C However,automatedpupildetectioninreal-worldscenarios ingusecases,suchasusabilityanalysisforwebpages[3]. has proven to be an intricate challenge due to fast illumi- Moreover, the future seems to hold promises of pervasive . s nationchanges, pupilocclusion, noncenteredandoff-axis andunobtrusivevideo-basedeyetracking[14],enablingre- c [ eyerecording,andphysiologicaleyecharacteristics. Inthis searchandapplicationspreviouslyonlyimagined. paper,weproposeandevaluateamethodbasedonanovel 1 While video-basedeye trackinghas been shownto per- dualconvolutionalneuralnetworkpipeline.Initsfirststage v formsatisfactorilyunderlaboratoryconditions,manystud- 2 the pipeline performs coarse pupil position identification iesreporttheoccurrenceofdifficultiesandlowpupildetec- 0 using a convolutionalneural network and subregions from tionrateswhentheseeyetrackersareemployedfortasksin 9 adownscaledinputimagetodecreasecomputationalcosts. naturalenvironments, forinstancedriving[11,21,30]and 4 Usingsubregionsderivedfromasmallwindowaroundthe 0 shopping [13]. The main source of noise in such realistic initialpupilpositionestimate,thesecondpipelinestageem- . scenariosisanunreliablepupilsignal,mostlyrelatedtoin- 1 ploys another convolutional neural network to refine this 0 tricatechallengesintheimage-basedpupildetection. Ava- position,resultinginanincreasedpupildetectionrateupto 6 rietyofdifficultiesoccurringwhenusingsucheyetrackers, 25% in comparison with the best performing state-of-the- 1 such as changing illumination, motion blur, and pupil oc- : art algorithm. Annotated data sets can be made available v clusionduetoeyelashes, aresummarizedin[28]. Rapidly uponrequest. i changing illumination conditions arise primarily in tasks X wherethesubjectismovingfast(e.g.,whiledriving)orro- r tates relative to unequally distributed light sources, while a 1.Introduction motion blur can be caused by the image sensor capturing For over a century now, the observation and measure- images during fast eye movements such as saccades. Fur- ment of eye movements have been employed to gain a thermore,eyewear(e.g.,spectaclesandcontactlenses)can comprehensiveunderstandingonhowthehumanoculomo- result in substantial and varied forms of reflections (Fig- tor and visual perception systems work, providing key in- ure1aandFigure1b),non-centeredoroff-axiseyeposition sights about cognitive processes and behavior [32]. Eye- relativetotheeye-trackercanleadtopupildetectionprob- trackingdevicesarerathermoderntoolsfortheobservation lems, e.g., when the pupil is surrounded by a dark region ofeyemovements. Initsearlystages,eyetrackingwasre- (Figure 1c). Other difficulties are often posed by physio- stricted to static activities, such as reading and image per- logicaleyecharacteristics,whichmayinterferewithdetec- ception[33],duetorestrictionsimposedbytheeye-tracking tion algorithms (Figure 1d). As a consequence, the data system – e.g., size, weight, cable connections, and restric- collectedinsuchstudiesmustbepost-processedmanually, tions to the subject itself. With recent developments in which is a laborious and time-consuming procedure. Ad- video-based eye-tracking technology, eye tracking has be- ditionally, this post-processing is impossible for real-time 1 applicationsthatrelyonthepupilmonitoring(e.g.,driving can be run in real-time on hardware architectures without orsurgeryassistance). Therefore,areal-time,accurate,and anaccessibleGPU. robustpupildetectionisanessentialprerequisiteforperva- In addition, we propose a method for generating train- sivevideo-basedeye-tracking. ing data in an online-fashion, thus being applicable to the taskofpupilcenterdetectioninonlinescenarios. Weevalu- atedtheperformanceofdifferentCNNconfigurationsboth in terms of quality and efficiency and report considerable improvementsoverstat-of-the-arttechniques. 2.RelatedWork (a) (b) (c) (d) Figure 1. Images of typical pupil detection challenges in real- Duringthelasttwodecades,severalalgorithmshavead- worldscenarios: (a)and(b)reflections,(c)pupillocatedindark dressedimage-basedpupildetection. Pérezetal.[25]first area,and(d)unexpectedphysiologicalstructures. threshold the image and compute the mass center of the resulting dark pixels. This process is iteratively repeated State-of-the-artpupildetectionmethodsrangefromrel- in an area around the previously estimated mass center to ativelysimplemethodssuchascombiningthresholdingand determine a new mass center until convergence. The Star- masscenterestimation[25]tomoreelaboratedmethodsthat burstalgorithm,proposedbyLietal.[19],firstremovesthe attempttoidentifythepresenceofreflectionsintheeyeim- corneal reflection and then locates pupil edge points using ageandapplypupil-detectionmethodsspecificallytailored aniterativefeature-basedapproach.BasedontheRANSAC to handle such challenges [7] – a comprehensive review is algorithm [6], a best fitting ellipse is then determined, and giveninSection2. Despitesubstantialimprovementsover the final ellipse parameters are determined by applying a earliermethodsinreal-worldscenarios,thesecurrentalgo- model-basedoptimization. Longetal.[22]firstdownsam- rithms still present unsatisfactory detection rates in many pletheimageandsearchthereforanapproximatepupillo- importantrealisticusecases(aslowas34%[7]). However, cation. Theimageareaaroundthislocationisfurtherpro- in this work we show that carefully designed and trained cessed and a parallelogram-based symmetric mass center convolutionalneuralnetworks(CNN)[4,17],whichrelyon algorithmisappliedtolocatethepupilcenter.Inanotherap- statisticallearningratherthanhand-craftedheuristics,area proach,Linetal.[20]thresholdtheimage,removeartifacts substantialstepforwardinthefieldofautomatedpupilde- bymeansofmorphologicaloperations,andapplyinscribed tection. CNNshavebeenshowntoreachhuman-levelper- parallelogramstodeterminethepupilcenter.Keiletal.[15] formance on a multitude of pattern recognition tasks (e.g., first locate corneal reflections; afterwards, the input image digitrecognition[2],imageclassification[16]). Thesenet- is thresholded, the pupil blob is searched in the adjacency worksattempttoemulatethebehaviorofthevisualprocess- ofthecornealreflection,andthecentroidofpixelsbelong- ingsystemandweredesignedbasedoninsightsfromvisual ingto theblob istaken aspupil center. Agustin etal. [27] perceptionresearch. thresholdtheinputimageandextractpointsinthecontour Weproposeadualconvolutionalneuralnetworkpipeline between pupil and iris, which are then fitted to an ellipse for image-based pupil detection. The first pipeline stage based on the RANSAC method to eliminate possible out- employsashallowCNNonsubregionsofadownscaledver- liers. S´wirski et al. [29] start with a coarse positioning sionoftheinputimagetoquicklyinferacoarseestimateof using Haar-like features. The intensity histogram of the thepupillocation. Thiscoarseestimationallowsthesecond coarse position is clustered using k-means clustering, fol- stagetoconsideronlyasmallregionoftheoriginalimage, lowedbyamodifiedRANSAC-basedellipsefit. Theabove thus, mitigating the impact of noise and decreasing com- approacheshaveshowngooddetectionratesandrobustness putationalcosts. Thesecondpipelinestagethensamplesa incontrolledsettings,i.e.,laboratoryconditions. small window around the coarse position estimate and re- Tworecentmethods,SET[10]andExCuSe[7],explic- fines the initial estimate by evaluating subregions derived itlyaddresstheaforementionedchallengesassociatedwith fromthiswindowusingasecondCNN.Wehavefocusedon pupildetectioninnaturalenvironments. SET[10]firstex- robust learning strategies (batch learning) instead of more tractspupilpixelsbasedonaluminancethreshold. There- accurate ones (stochastic gradient descent) [18] due to the sulting image is then segmented, and the segment borders factthatanadaptiveapproachhastohandlenoise(e.g., il- are extracted using a Convex Hull method. Ellipses are lumination,occlusion,interference)effectively. fit to the segments based on their sinusoidal components, Themotivationbehindtheproposedpipelineis(i)tore- and the ellipse closest to a circle is selected as pupil. Ex- ducethenoiseinthecoarseestimationofthepupilposition, CuSe [7] first analyzes the input images with regard to re- (ii) to reliably detect the exact pupil position from the ini- flectionsbasedonintensityhistograms. Severalprocessing tial estimate, and (iii) to provide an efficient method that stepsbasedonedgedetectors,morphologicoperations,and theAngularIntegralProjectionFunctionarethenappliedto 3.1.CoarsePositioningStage extractthepupilcontour. Finally,anellipseisfittothisline Thegrayscaleinputimagesgeneratedbythemobileeye usingthedirectleastsquaresmethod. trackerusedinthisworkaresized384×288pixels.Directly Although the latter two methods report substantial im- employing CNNs on images of this size would demand a provementsoverearliermethods, noisestillremainsama- largeamountofresourcesandwouldthusbecomputation- jorissue. Thus, robustdetection, whichiscriticalinmany allyexpensive,impedingtheirusageinstate-of-the-artmo- online real-world applications, remains an open and chal- bileeyetrackers.Thus,oneofthepurposesofthefirststage lengingproblem[7]. istoreducecomputationalcostsbyprovidingacoarseesti- matethatcaninturnbeusedtoreducethesearchspaceof 3.Method theexactpupillocation. However,themainreasonforthis step is to reduce noise, which can be induced by different The overall workflow for the proposed algorithm is cameradistances,changingsensorysystemsbetweenhead- shown in Figure 2. In the first stage, the image is down- mounted eye trackers [1, 5, 26], movement of the camera scaledanddividedintooverlappingsubregions. Thesesub- itself, or the usage of uncalibrated cameras (e.g., focus or regions are evaluated by the first CNN, and the center of whitebalance). Toachievethisgoal,firsttheinputimageis thesubregionthatevokesthehighestCNNresponseisused downscaledusingabicubicinterpolation,whichemploysa asacoarsepupilpositionestimate. Afterwards, thisinitial thirdorderpolynomialinatwodimensionalspacetoevalu- estimateisfedintothesecondpipelinestage. Inthisstage, atetheresultingvalues. Inourimplementation,weemploy subregionssurroundingtheinitialestimateofthepupilposi- a downscaling factor of four times, resulting in images of tionintheoriginalinputimageareevaluatedusingasecond 96×72pixels. Giventhattheseimagescontaintheentire CNN. The center of the subregion that evokes the highest eye,wechoseaCNNinputsizeof24×24pixelstoguar- CNN response is chosen as the final pupil center location. anteethatthepupilisfullycontainedwithinasubregionof Thistwo-stepapproachhastheadvantagethatthefirststep thedownscaledimages. Subregionsofthedownscaledim- (i.e.,coarsepositioning)hastohandlelessnoisebecauseof ageareextractedbyshiftinga24×24pixelswindowwith thebicubicdownscalingoftheimageand,consequently,in- a stride of one pixel (see Figure 3a) and evaluated by the volveslesscomputationalcoststhandetectingthepupilon CNN,resultinginaratingwithintheinterval[0,1](seeFig- thecompleteupscaledimage. ure3b). TheseratingsrepresenttheconfidenceoftheCNN Inthefollowingsubsections,wedelineatethesepipeline thatthepupilcenteriswithinthesubregion. Thus,thecen- stages and their CNN structures in detail, followed by the ter of the highest rated subregion is chosen as the coarse trainingprocedureemployedforeachCNN. pupillocationestimation. (a) (b) Figure3. Thedownscaledimageisdividedinsubregionsofsize 24×24pixelswithastrideofonepixel(a),whicharethenrated bythefirststageCNN(b). The core architecture of the first stage CNN is summa- rized in Figure 4. The first layer is a convolutional layer with kernel size 5 × 5 pixels, one pixel stride, and no padding. The convolution layer is followed by an aver- age pooling layer with window size 4×4 pixels and four Figure2. Workflowoftheproposedalgorithm. FirstaCNNis pixelsstride,whichisconnectedtoafully-connectedlayer employedtoestimateacoarsepupilpositionbasedonsubregions with depth one. The output is then fed to a single percep- from a downscaled version of the input image. This position is tron, responsibleforyieldingthefinalratingwithinthein- thenrefinedusingsubregionsaroundthecoarseestimationinthe terval[0,1]. Wehaveevaluatedthisarchitecturefordiffer- originalinputimagebyasecondCNN. entamountsoffiltersintheconvolutionallayerandvarying numbers of perceptrons in the fully connected layer; these valuesarereportedinSection5. Themainideabehindthe eightconvolutionfiltersandeightperceptronsduetothein- selected architecture is that the convolutional layer learns creased size of the convolution filter and the input region basicfeatures,suchasedges,approximatingthepupilstruc- size. Subregions surrounding the coarse pupil position are ture. The average pooling layer makes the CNN robust to extractedbasedonawindowofsize89×89pixelscentered small translations and blurring of these features (e.g., due aroundthecoarseestimate,whichisshiftedfrom−10to10 to the initial downscaling of the input image). The fully pixels (with a one pixel stride) horizontally and vertically. connected layer incorporates deeper knowledge on how to Analogouslytothefirststage,thecenteroftheregionwith combinethelearnedfeaturesforthecoarsedetectionofthe thehighestCNNratingisselectedasfinepupilpositiones- pupil position by using the logistic activation function to timate. producethefinalrating. Despitehighercomputationalcostsinthesecondstage, our approach is highly efficient and can be run on today’s conventionalmobilecomputers. 3.3.CNNTrainingMethodology Both CNNs were trained using supervised batch gradi- entdescent[18](whichisexplainedindetailinthesupple- mentarymaterial)withafixedlearningrateofone. Unless specifiedotherwise, trainingwasconductedfortenepochs with a batch size of 500. Obviously, these decisions are Figure4. ThecoarsepositionstageCNN.Thefirstlayerconsists aimedatanadaptivesolution,wherethetrainingtimemust of the shared weights or convolution masks, which are summa- rizedbytheaveragepoolinglayer. Thenafullyconnectedlayer be relatively short (hence the small number of epochs and combinesthefeaturesforwardedfromthepreviouslayeranddel- highlearningrate)andnonormalization(e.g.,PCAwhiten- egatesthefinalratingtoasingleperceptron. ing,meanextraction)canbeperformed.AllCNNs’weights wereinitializedwithrandomsamplesfromauniformdistri- bution,thusaccountingforsymmetrybreaking. 3.2.FinePositioningStage Whilestochasticgradientdescentsearchesforminimain Althoughthefirststageyieldsanaccuratepupilposition theerrorplanemoreeffectivelythanbatchlearning[8,23] estimate, it lacks precision due to the inherent error intro- whengivevalidexamples,itisvulnerabletodisastroushops duced by the downscaling step. Therefore, it is necessary ifgiveninadequateexamples(e.g.,duetopoorperformance torefinethisestimate. Thisrefinementcouldbeattempted ofthetraditionalalgorithm).Onthecontrary,batchtraining byapplyingmethodssimilartothosedescribedinSection2 dilutes this error. Nevertheless, we explored the impact of to a small window around the coarse pupil position esti- stochastic learning (i.e., using a batch size of one) as well mate. However, since most of the previously mentioned asanincreasednumberoftrainingepochsinSection5. challenges are not alleviated by using this small window, we chose to use a second CNN that evaluates subregions 3.3.1 CoarsePositioningCNN surroundingthecoarseestimateintheoriginalimage. The second stage CNN employs the same architecture The coarse position CNN was trained on subregions ex- patternasthefirststage(i.e.,convolution⇒averagepool- tractedfromthedownscaledinputimagesthatfallintotwo ing ⇒ fully connected ⇒ single logistic perceptron) since differentdataclasses: containingavalid(label = 1)orin- their motivations are analogous. Nevertheless, this CNN valid(label=0)pupilcenter.Trainingsubregionswereex- operates on a larger input resolution to provide increased tractedbycollectingallsubregionswithcenterdistantupto precision. Intuitively, theinputimageforthisCNNwould five pixels from the hand-labeled pupil center. Subregions be 96 × 96 pixels: the input size of the first CNN input withcenterdistantuptoonepixelwerelabeledasvalidex- (24×24) multiplied by the downscaling factor (4). How- ampleswhiletheremainingsubregionswerelabeledasin- ever, the resulting memory requirement for this size was validexamples. AsexemplifiedbyFigure5,thisprocedure largerthanavailableonourtestdevice; asaresult,weuti- resultsinninevalidand32invalidsamplesperhand-labeled lizedtheclosestworkingsizepossible: 89×89pixels. The data. sizeoftheotherlayerswereadaptedaccordingly. Thecon- WegeneratedtwotrainingsetsfortheCNNresponsible volutionkernelsinthefirstlayerwereenlargedto20pixels forthecoarsepositiondetection. Thefirsttrainingsetcon- to compensate for increased noise and motion blur. The sists of 50% of the images provided by related work from dimension of the pooling window was increased by one Fuhletal.[7]. Theremaining50%oftheimagesaswellas pixeloneachside,leadingtoadecreasedinputsizeonthe the additional data sets from this work (see Section 4) are fullyconnectedlayerandreducedruntime. ThisCNNuses usedtoevaluatethedetectionperformanceofourapproach. imagescollectedduringdrivingsessionsinpublicroadsfor an experiment [12] that was not related to pupil detection and were chosen due the non-satisfactory performance of the proprietary pupil detection algorithm. These new data setsincludefastchangingandadverseillumination,specta- cle reflections, and disruptive physiological eye character- istics (e.g., dark spot on the iris); samples from these data setsareshowninFigure6. Figure6. Samplesfromtheadditionaldatasetsemployedinthis work. Each column belongs to a distinct data set. The top row includesnon-challengingsamples,whichcanbeconsideredrela- Figure5. Ninevalid(topright)and32invalid(bottom)training tivelysimilartolaboratoryconditionsandrepresentonlyasmall samplesforthecoarsepositionCNNextractedfromadownscaled fractionofeachdataset. Theothertworowsincludechallenging inputimage(topleft). sampleswithartifactscausedbythenaturalenvironment. Thepurposeofthistrainingistoinvestigatehowthecoarse positioningCNNbehavesondataithasneverseenbefore. 5.Evaluation Thesecondtrainingincludesthefirsttrainingsetand50% Training and evaluation were performed on an Intel(cid:13)R oftheimagesofournewdatasetsandisemployedtoeval- CoreTMi5-4670 desktop computer with 8GB RAM. This uate the complete proposed method (i.e., coarse and fine setupwaschosenbecauseitprovidesaperformancesimilar positioning). tosystemsthatareusuallyprovidedbyeye-trackervendors, thus enabling the actual eye-tracking system to perform 3.3.2 FinePositioningCNN otherexperimentsalongwiththeevaluation. Thealgorithm wasimplementedusingMATLAB(r2013b)combinedwith ThefinepositioningCNN(responsiblefordetectingtheex- thedeeplearningtoolbox[24]. Duringthisevaluation, we act pupil position) is trained similarly to the coarse posi- employthefollowingnameconvention: K andP signify tioning one. However, we extract only one valid subre- n k thattheCNNhasnfiltersintheconvolutionlayerandkper- gion sample centered at the hand-labeled pupil center and ceptronsinthefullyconnectedlayer. Wereportourresults eight equally spaced invalid subregions samples centered intermsoftheaveragepupildetectionrateasafunctionof five pixels away from the hand-labeled pupil center. This pixel distance between the algorithmically established and reducedamountofsamplesperhand-labeleddatarelativeto the hand-labeled pupil center. Although the ground truth thecoarsepositioningtrainingistoconstrainlearningtime, waslabeledbyexpertsineye-trackingresearch,imprecision aswellasmainmemoryandstorageconsumption. Wegen- cannotbeexcluded. Therefore,theresultsarediscussedfor eratedsamplesfrom50%oftheimagesfromFuhletal.[7] a pixel error of five (i.e., pixel distance between the algo- andfromtheadditionaldatasets. Outofthesesamples,we rithmically established and the hand-labeled pupil center), randomly selected 50% of the valid examples and 25% of analogouslyto[7,29]. thegeneratedinvalidexamplesfortraining. 5.1.CoarsePositioning 4.DataSet We start by evaluating the candidates from Table 1 for Inthisstudy,weusedtheextensivedatasetsintroduced thecoarsepositioningCNN.Allcandidatesfollowthecore by Fuhl et al. [7], complemented by five additional hand- architecturepresentedinSection3.1andeachcandidatehas labeled data sets. Our additional data sets include 41,217 aspecificnumberoffiltersintheconvolutionlayerandper- Conv. FullyConn. 1 CNN Kernels Perceptrons 0.9 CK P 4 8 0.8 4 8 CK8P8 8 8 0.7 e CK8P16 8 16 Rat 0.6 CK16P32 16 32 on 0.5 CK8P16ext etecti 0.4 CCKK168PP1362 Table1. EvaluatedconfigurationsforthecoarsepositioningCNN D 0.3 CK8P8 (asdescribedinSection3.1). CK4P8 0.2 0.1 ceptronsinthefullyconnectedlayer. Theirnamesarepre- 0 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 fixedwithCforcoarseand,aspreviouslyspecifiedinSec- Downscaled Pixel Error tion 3.3, were trained for ten epochs with a batch size of 500. Moreover, since CK P provided the best trade- Figure7. PerformancefortheevaluatedcoarseCNNstrainedon 8 16 off between performance and computational requirements, 50%ofimagesfromalldatasetsandevaluatedonallimagesfrom alldatasets. wechosethisconfigurationtoevaluatetheimpactoftrain- ing for an extended period of time, resulting in the CNN 1 CK P ext,whichwastrainedforahundredepochswith 8 16 abatchsizeof1000. Becauseofinferiorperformanceand 0.9 for the sake of space we omit the results for the stochas- 0.8 tic gradient descent learning but will make them available 0.7 e online. Rat 0.6 n Figure 7 shows the performance of the coarse position- ctio 0.5 ingCNNswhentrainedusing50%oftheimagesrandomly Dete 0.4 CK8P16extnl CK8P16ext chosen from all data sets and evaluated on all images. As 0.3 CK16P32nl CK16P32 CK8P16nl CK8P16 can be seen in this figure, the number of filters in the first 0.2 CK8P8nl CK8P8 layer(compareCK4P8,CK8P8,andCK16P32)andexten- 0.1 CK4P8nl CK4P8 sivelearning(seeCK8P16 andCK8P16ext)haveahigher 0 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 impact than the number of perceptrons in the fully con- Downscaled Pixel Error nected layer (compare CK P to CK P ). Moreover, 8 8 8 16 these results indicate that the amount of filters in the con- Figure8. PerformanceofcoarsepositioningCNNsonallimages volutional layer still has not been saturated (i.e., there are providedbythiswork.ThesolidlinesbelongtotheCNNstrained still high level features that can be learned to improve ac- on50%ofallimagesfromalldatasets. Thedottedlinesbelong tothenl-CNNs, whichweretrainedonlyon50%ofthedataset curacy). However, it is important to notice that this is the providedbyFuhletal.[7]and,thus,havenotlearnedonexamples most expensive parameter in the proposed CNN architec- fromtheevaluationdatasets. ture in terms of computation time and, thus, further incre- mentsmustbecarefullyincluded. To evaluate the performance of the coarse CNNs only 5.2.FinePositioning ondatatheyhavenotseenbefore,weadditionallyretrained ThefinepositioningCNN(FK P )usestheCK P in 8 8 8 8 theseCNNsfromscratchusingonly50%oftheimagesin the first stage and was trained on 50% of the data from thedatasetsprovidedbyFuhletal.[7]andevaluatedtheir all data sets. Evaluation was performed against four state- performance solely on the new data sets provided by this of-the-artalgorithms, namely, ExCuSe[7], SET [10], Star- work.TheseresultswerecomparedtothosefromtheCNNs burst [19], and S´wirski [29]. Furthermore, we have devel- that were trained on 50% of images from all data sets and opedtwoadditionalfinepositioningmethodsforthiseval- are shown in Figure 8. The CNNs that have not learned uation: on the new data sets are identified by the suffix nl. All nl- CNNsexhibitedasimilardecreaseinperformancerelative • CK P ray: this method employs CK P to deter- 8 8 8 8 totheircounterpartsthathavebeentrainedonsamplesfrom mine a coarse estimation, which is then refined by all the data sets. We hypothesize that this effect is due to sending rays in eight equidistant directions with a thenewdatasetsholdingnewinformation(i.e.,containing maximum range of thirty pixels. The difference be- new challenging patterns not present in the training data); tweeneverytwoadjacentpixelsintheray’strajectory nevertheless, the CNNs generalize well enough to handle iscalculated,andthecenteroftheoppositerayiscal- eventheseunseenpatternsdecently. culated. Then, the mean of the two closest centers is used as fine position estimate. This method is used systemCPU,whichyieldsabaseline48GFLOPS[9]. asreferenceforhybridmethodscombiningCNNsand 5.3.CNNOperationalBehaviorAnalysis traditionalpupildetectionmethods. To analyze the patterns learned by the CNNs in detail, • SK P : this method uses only a single CNN similar 8 8 wefurtherinspectedthefiltersandweightsofCK P . No- 8 8 totheonesusedinthecoarsepositioningstage,trained tably, the CNNs had learned similar filters and weights. in an analogous fashion. However, this CNN uses an Representatively,wechosetoreportbasedonCK P since 8 8 input size of 25×25 pixels to obtain an even center. itcanbemoreeasilyvisualizedduetoitsreducedsize. This method is used as reference for a parsimonious ThefirstrowofFigure10showsthefilterslearnedinthe (although costlier than the original coarse positioning convolutionlayer,whereasthesecondrowshowsthesignof CNNs) single stage CNN approach and was designed thesefilters’weightswherewhiteandblackrepresentposi- tobeemployedonsystemsthatcannothandlethesec- tiveandnegativeweights,respectively. Filter(e)resembles ondstageCNN. acentersurrounddifference,andtheremainingfilterscon- tain round shapes, most probably performing edge detec- Forreference,thecoarsepositioningCNNusedinthefirst tions. Itisworthnoticingthatthefiltersappeartohavede- stage of FK P and CK P ray (i.e., CK P ) is also 8 8 8 8 8 8 velopedincomplementingpairs(i.e.,afilteranditsinverse) shown. tosomeextent. Thisbehaviorcanbeseeninthepairs(a,c), All CNNs in this evaluation were trained on 50% of (b,d),and(f,g),whilefilter(e)couldbepairedwith(h)ifthe the images randomly selected from all data sets. To avoid latterfurtherdevelopsitstopandbottomrightcorners. Fur- biasing the evaluation towards the data set introduced by thermore, the convolutional layer response based on these thiswork,weconsideredtwodifferentevaluationscenarios. filterswhengivenavalidsubregionasinputisdemonstrated First, we evaluate the selected approaches only on images inFigure11.Thefirstrowdisplaysthefiltersresponses,and fromthedatasetsintroducedbyFuhletal.[7],and,inasec- thesecondrowshowspositive(white)andnegative(black) ondstep,weperformtheevaluationonallimagesfromall responses. datasets.TheoutcomesareshownintheFigures9aand9b, respectively. Finally, we evaluated the performance on all images not used for training from all data sets. This pro- videsarealisticevaluationstrategyfortheaforementioned adaptivesolution. TheresultsareshowninFigure9c. In all cases, both FK P and SK P surpass the 8 8 8 8 bestperformingstate-of-the-artalgorithmbyapproximately 25% and 15%, respectively. Moreover, even with the penaltyduetotheupscalingerror,theevaluatedcoarsepo- Figure 10. CK P filters in the convolutional layer. The first sitioning approaches mostly exhibit an improvement rela- 8 8 row displays the intensity of the filters’ weights, and the second tive to the state-of-the-art algorithms. The hybrid method row indicates whether the weight was positive (white) or nega- CK P ray,didnotdisplayasignificantimprovementrel- 8 8 tive(black). Forvisualization,thefilterswereresizedbyafactor ativetothecoarsepositioning;aspreviouslydiscussed,this oftwentyusingbicubicinterpolationandnormalizedtotherange behaviorisexpectedasthetraditionalpupildetectionmeth- [0,1]. odsarestillafflictedbyaforementionedchallenges,regard- less of the availability of the coarse pupil position esti- The weights of all perceptrons in the fully connected mate.Althoughtheproposedmethod(FK P )exhibitsthe layer are also displayed in Figure 11. In the fully con- 8 8 best pupil detection rate, it is worth highlighting the per- nectedlayerarea,thefirstcolumnidentifiestheperceptron formanceoftheSK P methodcombinedwithitsreduced in the fully connected layer (i.e., from p1 to p8), and the 8 8 computationalcosts. Withoutaccountingforthedownscal- other columns display their respective weights for each of ing operation, SK P has an operating cost of eight con- the filters from Figure 10 (i.e., from (a) to (h) ). Since the 8 8 volutions((6×6)∗(20×20)∗8 = 115200FLOPS)plus output weight assigned to perceptrons p1, p2, p5, and p8 eight average-pooling operations ((20 × 20) ∗ 8 = 3200 arepositive,thesepatternswillrespondtocenteredpupils, FLOPS)plus(5×5×8)∗8=1600FLOPSfromthefully whiletheoppositeistrueforperceptronsp3,p4,p6,andp7 connectedlayerand8FLOPSfromthelastperceptron,to- (seethesingleperceptronlayerinFigure11). Thisbehav- taling120008FLOPSperrun. Givenaninputimageofsize ioriscausedbytheuseofasigmoid functionwithoutputs 96×72andtheinputsizeof25×25,72×48=3456runsare rangingbetweenzeroandone. Ifatanhfunctionwasused necessary,requiring≈415×106FLOPSwithoutaccount- instead,itcouldleadtonegativeweightsbeingpositivere- ingforextraoperations(e.g. load/store). Thesecanbeper- sponses. Based on the negative and positive values (Fig- formedin real-timeeven onthe accompanyingeye tracker ure 10, second rows of the convolutional layer), the filters FK8P8 SK8P8 CK8P8ray CK8P8 FK8P8 SK8P8 CK8P8ray CK8P8 FK8P8 SK8P8 CK8P8ray CK8P8 1 ExCuSe Świrski Starburst SET 1 ExCuSe Świrski Starburst SET 1 ExCuSe Świrski Starburst SET 0.9 0.9 0.9 0.8 0.8 0.8 0.7 0.7 0.7 e e e Rat 0.6 Rat 0.6 Rat 0.6 n n n etectio 00..45 etectio 00..45 etectio 00..45 D D D 0.3 0.3 0.3 0.2 0.2 0.2 0.1 0.1 0.1 0 0 0 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Pixel Error Pixel Error Pixel Error (a) (b) (c) Figure9. AllCNNsweretrainedon50%ofimagesfromalldatasets.Performancefortheselectedapproacheson(a)allimagesfromthe datasetsfrom[7],(b)allimagesfromalldatasets,and(c)allimagesnotusedfortrainingfromalldatasets. fittingpositiveweightmapforthefilterresponse(a)inthe first row of the convolutional layer. Similarly, p1 provides abestfitfor(b)and(d), andbothp1and p8providegood fits for (c) and (f); no perceptron presented an adequate fit for(g)and(h),indicatingthatthesefiltersarepossiblyem- ployedtorespondtooffcenterpupils. Moreover,allnega- tivelyweightedperceptronspresenthighvaluesatthecen- terfortheresponsefilter(e)(thepossiblecentersurround- ingfilter),whichcouldbeemployedtoignoresmallblobs. In contrast, p5 (a positively weighted perceptron) weights centerresponseshigh. 6.Conclusion We presented a naturally motivated pipeline of specif- ically configured CNNs for robust pupil detection and showedthatitoutperformsstate-of-the-artapproachesbya largemarginwhileavoidinghighcomputationalcosts. For the evaluationwe used over79.000 handlabeled images – 41.000 of which were complementary to existing images from the literature – from real-world recordings with arti- factssuchasreflections, changingilluminationconditions, occlusion, etc. Especially for this challenging data set, theCNNsreportedconsiderablyhigherdetectionratesthan state-of-the-art techniques. Looking forward, we are plan- Figure 11. CK8P8 response to a valid subregion sample. The ningtoinvestigatetheapplicabilityoftheproposedpipeline convolutionallayerincludesthefilterresponsesinthefirstrowand toonlinescenarios,wherecontinuousadaptationofthepa- positive(inwhite)andnegative(inblack)responsesinthesecond rametersisafurtherchallenge. row. The fully connected layer shows the weight maps for each perceptron/filterpair. Forvisualization,thefilterswereresizedby References a factor of twenty using bicubic interpolation and normalized to therange[0,1]. Thefilterorder(i.e., (a)to(b))matchesthatof [1] R.A.BoieandI.J.Cox. Ananalysisofcameranoise. Figure10. IEEETransactionsonPatternAnalysis&MachineIn- telligence,1992. 3 (b) and (f) display opposite responses. Based on the input [2] D. Ciresan, U. Meier, and J. Schmidhuber. Multi- comingfromtheaveragepoolinglayer,p2displaysthebest columndeepneuralnetworksforimageclassification. InComputerVisionandPatternRecognition(CVPR), [15] A.Keil,G.Albuquerque,K.Berger,andM.A.Mag- 2012 IEEE Conference on, pages 3642–3649. IEEE, nor. Real-time gaze tracking with a consumer-grade 2012. 2 videocamera. 2010. 2 [3] L.Cowen,L.J.Ball,andJ.Delin. Aneyemovement [16] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Im- analysis of web page usability. In People and Com- agenet classification with deep convolutional neural puters XVI-Memorable Yet Invisible, pages 317–335. networks. InAdvancesinneuralinformationprocess- Springer,2002. 1 ingsystems,pages1097–1105,2012. 2 [4] P. Domingos. A few useful things to know about [17] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. machine learning. Communications of the ACM, Gradient-based learning applied to document recog- 55(10):78–87,2012. 2 nition. Proceedings of the IEEE, 86(11):2278–2324, [5] D.DussaultandP.Hoess.Noiseperformancecompar- 1998. 2 isonoficcdwithccdandemccdcameras. InOptical [18] Y.A.LeCun,L.Bottou,G.B.Orr,andK.-R.Müller. ScienceandTechnology,theSPIE49thAnnualMeet- Efficientbackprop. InNeuralnetworks: Tricksofthe ing, pages 195–204. International Society for Optics trade,pages9–48.Springer,2012. 2,4 andPhotonics,2004. 3 [19] D.Li, D.Winfield, andD.J.Parkhurst. Starburst: A [6] M.A.FischlerandR.C.Bolles. Randomsamplecon- hybrid algorithm for video-based eye tracking com- sensus: aparadigmformodelfittingwithapplications biningfeature-basedandmodel-basedapproaches. In to image analysis and automated cartography. Com- CVPRWorkshops2005.IEEEComputerSocietyCon- municationsoftheACM,24(6):381–395,1981. 2 ferenceon.IEEE,2005. 2,6 [7] W. Fuhl, T. Kübler, K. Sippel, W. Rosenstiel, and [20] L. Lin, L. Pan, L. Wei, and L. Yu. A robust and ac- E. Kasneci. Excuse: Robust pupil detection in real- curatedetectionofpupilimages. InBMEI2010,vol- world scenarios. In CAIP 16th Inter. Conf. Springer, ume1.IEEE,2010. 2 2015. 2,3,4,5,6,7,8 [21] X. Liu, F. Xu, and K. Fujimura. Real-time eye de- [8] T. M. Heskes and B. Kappen. On-line learning pro- tectionandtrackingfordriverobservationundervari- cesses in artificial neural networks. North-Holland ouslightconditions.InIntelligentVehicleSymposium, MathematicalLibrary,51:199–233,1993. 4 2002.IEEE,volume2,2002. 1 [9] Intel Corporation. Intel(cid:13)RCore i7-3500 Mobile Pro- [22] X. Long, O. K. Tonguz, and A. Kiderman. A high cessorSeries. Accessed: 2015-11-02. 7 speedeyetrackingsystemwithrobustpupilcenteres- [10] A.-H. Javadi, Z. Hakimi, M. Barati, V. Walsh, and timationalgorithm. InEMBS2007.IEEE,2007. 2 L.Tcheang. Set: apupildetectionmethodusingsinu- [23] G. B. Orr. Dynamics and algorithms for stochas- soidalapproximation. Frontiersinneuroengineering, tic learning. PhD thesis, PhD thesis, Department of 8,2015. 2,6 Computer Science and Engineering, Oregon Gradu- [11] E.Kasneci. Towardstheautomatedrecognitionofas- ateInstitute,Beaverton,OR97006,1995.ftp://neural. sistance need for drivers with impaired visual field. cse. ogi. edu/pub/neural/papers/orrPhDch1-5. ps. Z, PhDthesis,UniversitätTübingen,Germany,2013. 1 orrPhDch6-9.ps.Z,1995. 4 [12] E. Kasneci, K. Sippel, K. Aehling, M. Heister, [24] R.B.Palm.Predictionasacandidateforlearningdeep W. Rosenstiel, U. Schiefer, and E. Papageorgiou. hierarchical models of data. Technical University of Driving with Binocular Visual Field Loss? A Study Denmark,2012. 5 onaSupervisedOn-roadParcourswithSimultaneous [25] A. Peréz, M. Cordoba, A. Garcia, R. Méndez, EyeandHeadTracking.PlosOne,9(2):e87470,2014. M. Munoz, J. L. Pedraza, and F. Sanchez. A precise 5 eye-gazedetectionandtrackingsystem. 2003. 2 [13] E. Kasneci, K. Sippel, M. Heister, K. Aehling, [26] Y. Reibel, M. Jung, M. Bouhifd, B. Cunin, and W. Rosenstiel, U. Schiefer, and E. Papageorgiou. C. Draman. Ccd or cmos camera noise characterisa- Homonymousvisualfieldlossanditsimpactonvisual tion.TheEuropeanPhysicalJournalAppliedPhysics, exploration: Asupermarketstudy. TVST,3,2014. 1 21(01):75–80,2003. 3 [14] M. Kassner, W. Patera, and A. Bulling. Pupil: an [27] J. San Agustin, H. Skovsgaard, E. Mollenbach, open source platform for pervasive eye tracking and M. Barret, M. Tall, D. W. Hansen, and J. P. Hansen. mobilegaze-basedinteraction. InProceedingsofthe Evaluationofalow-costopen-sourcegazetracker. In 2014 ACM International Joint Conference on Perva- Proceedingsofthe2010SymposiumonEye-Tracking siveandUbiquitousComputing: AdjunctPublication, Research & Applications, pages 77–80. ACM, 2010. pages1151–1160.ACM,2014. 1 2 [28] S.K.SchnipkeandM.W.Todd.Trialsandtribulations ofusinganeye-trackingsystem. InCHI’00ext.abstr. ACM,2000. 1 [29] L.S´wirski,A.Bulling,andN.Dodgson. Robustreal- timepupiltrackinginhighlyoff-axisimages. InPro- ceedingsoftheSymposiumonETRA.ACM,2012. 2, 5,6 [30] S. Trösterer, A. Meschtscherjakov, D. Wilfinger, and M. Tscheligi. Eye tracking in the car: Challenges in adual-taskscenarioonatesttrack. InProceedingsof the6thAutomotiveUI.ACM,2014. 1 [31] J. Turner, J. Alexander, A. Bulling, D. Schmidt, and H. Gellersen. Eye pull, eye push: Moving ob- jectsbetweenlargescreensandpersonaldeviceswith gaze and touch. In Human-Computer Interaction– INTERACT2013,pages170–186.Springer,2013. 1 [32] N. Wade and B. W. Tatler. The moving tablet of the eye: The origins of modern eye movement research. OxfordUniversityPress,2005. 1 [33] A.Yarbus. Theperceptionofanimagefixedwithre- spect to the retina. Biophysics, 2(1):683–690, 1957. 1