DOCUMENT RESUME IR 058 806 ED 482 122 Milekic, Slavko AUTHOR The More You Look the More You Get: Intention-Based Interface Using TITLE Gaze-Tracking. 2003-00-00 PUB DATE 20p.; In: Museums and the Web 2003: Selected Papers from an NOTE International Conference (7th, Charlotte, NC, March 19-22, 2003); see IR 058 801. For full text: http://www.archimuse.com/mw2003/ AVAILABLE FROM papers/milekic/milekic.html/. Speeches/Meeting Papers (150) Descriptive (141) Reports PUB TYPE EDRS Price MF01/PC01 Plus Postage. EDRS PRICE Computer Interfaces; Eye Contact; *Eye Movements; Eyes; *Intention; DESCRIPTORS Visual Measures; *Visual Perception Digital Technology; *Gaze Patterns; *Visual Tracking IDENTIFIERS ABSTRACT Only a decade ago eye- and gaze-tracking technologies using cumbersome and expensive equipment were confined to university research labs. However, rapid technological advancements (increased processor speed, advanced digital video processing) and mass production have both lowered the cost and dramatically increased the efficacy of eye- and gaze-tracking ,equipment. This opens up a whole new area of interaction mechanisms with museum content. This paper describes a conceptual framework for an interface, designed for use in museums and galleries, which is based on non-invasive tracking of a viewer's gaze direction. Following the simple premise that prolonged visual fixation is an indication of a viewer's interest, the author (Contains 27 dubbed this approach intention-based interface. Includes 16 figures. references.) (Author) Reproductions supplied by EDRS are the best that can be made from the original document. PAPERS Museums and theWeb 2003 The More You Look The More You Get: Intention- Based Interface Using Gaze-Tracking PERMISSION TO REPRODUCE AND DISSEMINATE THIS MATERIAL HAS Slavko Milekic, The University of the Arts, USA BEEN GRANTED BY D. Bearman http://www.uarts.edu Register Workshops Abstract TO THE EDUCATIONAL RESOURCES Sessions INFORMATION CENTER (ERIC) Speakers Only a decade ago eye- and gaze-tracking technologies using cumbersome Interactions and expensive equipment were confined to university research labs. However, Demonstrations rapid technological advancements (increased processor speed, advanced digital video processing) and mass production have both lowered the cost and Exhibits dramatically increased the efficacy of eye- and gaze-tracking equipment. This U.S. DEPARTMENT OF EDUCATION /Office of Educational Research and Improvement opens up a whole new area of interaction mechanisms with museum content. Events DUCATIONAL RESOURCES INFORMATION In this paper I will describe a conceptual framework for an interface, designed CENTER (ERIC) Best of the Web This document has been reproduced as for use in museums and galleries, which is based on non-invasive tracking of a received from the person or organization viewer's gaze direction. Following the simple premise that prolonged visual Key Dates originating it.. fixation is an indication of a viewer's interest, I dubbed this approach intention- 0 Minor changes have been made to Charlotte based interface. improve reproduction quality. Points of view or opinions stated in this Keywords: eye tracking, gaze tracking, intention-based interface document do not necessarily represent A&MI official OERI position or policy. Introduction Archives & Museum Informatics In humans, gaze direction is probably the oldest and earliest means of 158 Lee Avenue communication at a distance. Parents of young infants are often trying to 'decode' Toronto Ontario from an infant's gaze direction the needs and interest of their child. Thus, gaze M4E 2P3 Canada direction can be viewed as a first instance of pointing. A number of developmental studies (Scaife and Bruner 1975; Corkum and Moore, 1988; Moore 1999 ) show that ph: +1 416-691-2516 even very young infants actively follow and respond to the gaze direction of their fx: +1 416-352-6025 caregivers. The biological significance of eye movements and gaze direction in humans is illustrated by the fact that humans, unlike other primates, have visible white area (sclera) around the pigmented part of the eye (iris, covered by transparent infoAarchimuse.com www.archimuse.com cornea, see Figure 1). This makes even discrete shifts of gaze direction very noticeable (as is painfully obvious in cases of 'lazy eye'). Search 40, AmAi Join our Mailing List. Privacy. Updated: March 13, 2003 Figure 1. Comparison of human and non-human eye (chimpanzee). Although many animals have pigmentation that accentuates the eyes, the visible white area of human eye makes it easier to interpret the gaze direction Eye contact is one of the first behaviors to develop in young infants. Within the first few days of life, infants are capable of focusing on their caregiver's eyes (Infants are BEST COPY AVMLABLE 2 5/27/2003 file://E:\mw2003\papers\milekic\milekic.html physiologically shortsighted with the ideal focusing distance of 25-40 cm. This distance corresponds to the distance between the mother's and infant's eyes when the baby is held at the breast level. Everything else is conveniently a blur. Within the first few weeks, establishing eye contact with the caregiver produces a smiling reaction (Stewart & Logan, 1998). Eye contact and gaze direction continue to play a significant role in social communication throughout life. Examples include: regulating conversation flow; regulating intimacy levels; indicating interest or disinterest; seeking feedback; expressing emotions; influencing; signaling and regulating social hierarchy; indicating submissiveness or dominance; Thus, it is safe to assume that humans have a large number of behaviors associated with eye movements and gaze direction. Some of these are innate (orientation reflex, social regulation), and some are learned (extracting information from printed text, interpreting traffic signs). Our relationship with works of art is essentially a social and intimate one. In the context of designing a gaze tracking-based interface with cultural heritage information, innate visual behaviors may play a significant role precisely because they are social and emotional in nature and have the potential to elicit a reaction external to the viewer. In this paper I will provide a conceptual framework for the design of gaze-based interactions with cultural heritage information using the digital medium. Before we proceed, it is necessary to clarify some of the basic physiological and technological terms related to eye- and gaze-tracking. Eye Movements and Visual Perception While we are observing the world, our subjective experience is that of a smooth, uninterrupted flow of information and a sense of the wholeness of the visual field. This, however, contrasts sharply with what actually happens during visual perception. Our eyes are stable only for brief periods of time (200-300 milliseconds) called fixations. Fixations are interspersed by rapid, jerky movements called saccades. During these movements no new visual information is acquired. Furthermore, the information gained during the periods of fixations is clear and detailed only in a small area of the visual field - about 20 of visual angle. Practically, this corresponds to the area covered by one's thumb at arm's length. The rest of the visual field is fuzzy but provides enough information for the brain to plan the location of the next fixation point. The problems that arise because of the discrepancy between our subjective experience and the data gained by using eye-tracking techniques can be illustrated by the following example: (268)(292) (197) qat) (177) (156) (221) The horse raced past the barn fell. I I I 1 1 7 4 3 5 6 2 Figure 1 The sentence above is a classical example of a "garden path" sentence that (as you probably have experienced) initially leads the reader to a wrong interpretation (Bever, 1970). The eye-tracking data provide information about the sequence of fixations (numbered 1 to 7) and their duration in milliseconds. The data above provide some clues about the relationship between visual analysis during reading and eye movements. For example, notice the presence of two retrograde saccades 5/27/2003 file://E:\mw2003\papers\milekic\milekic.html (numbered 6 and 7) that happened after initial reading of the sentence. They more than double the total fixation time of the part of the sentence necessary for disambiguation of its meaning. Nowadays there is a general consensus in the eye- tracking community that the number and the duration of fixations are related to the cognitive load imposed during visual analysis. Figure 2. Illustration of differences in gaze paths while interpreting I. Repin's painting "They did not expect him." Path (1) corresponds to free exploration. Path (2) was obtained when subjects were asked to judge the material status of the family, and path (3) when they were asked to guess the age of different individuals. Partially reproduced from Yarbus, A. L. (1967) Eye-tracking studies of reading are very complex but have the advantage of allowing fine control of different aspects of the visual stimuli (complexity, length, exposure time, etc.). Interpretation of eye movement data during scene analysis is more complicated because visual exploration strategy is heavily dependent on the context of exploration. Data (Figure 2) from an often-cited study by Yarbus (1967) illustrate differences in visual exploration paths during interpretation of Ilya Repin's painting "They did not expect him, or "the unexpected guest". Brief History of Eye- and Gaze-Tracking The history of documented eye- and gaze-tracking studies is over a hundred years old (Javal, 1878). It is a history of technological and theoretical advances where progress in either area would influence the other, often producing a burst of research activity that would subsequently subside due to the uncovering of a host of new problems associated with the practical uses of eye-tracking. Not surprisingly, the first eye-tracking studies used other humans as tracking instruments by utilizing strategically positioned mirrors to infer gaze direction. Experienced psychotherapists (and socially adept individuals) still use this technique, which, however imperfect it may seem, may yield a surprising amount of useful information. Advancements in photography led to the development of a technique based on capturing the light reflected from the cornea on photographic plate (Dodge & Cline, 1901). Some of these techniques were fairly invasive, requiring placement of a reflective white dot directly onto the eye of the viewer (Jud, McAllister & Steel, 1905) or a tiny mirror, attached to the eye with a small suction cup (Yarbus, 1967). In the field of medicine a technique was developed (electro-oculography, still in use for certain diagnostic procedures) that allowed registering of eyeball movements using a number of electrodes positioned around the eye. Most of the described techniques 4 5/27/2003 file://E:\mw2003\papers\milekic\milekic.html required the viewer's head to be motionless during eye tracking and used a variety of devices like chin rests, head straps and bite-bars to constrain the head movements. The major innovation in eye tracking was the invention of a head-mounted eye Jracker (Hartridge & Thompson, 1948). With technological advances that reduced the weight and size of an eye tracker to that of a laptop computer, this technique is still widely used. Most eye tracking techniques developed before the 1970s were further constrained by the fact that data analysis was possible only after the act of viewing. It was the advent of mini- and microcomputers that made possible real-time eye tracking. Although widely used in studies of perceptual and cognitive processes, it was only with the proliferation of personal computers in the 1980s that eye tracking was applied as an instrument for the evaluation of human-computer interaction (Card, 1984). Around the same time, the first proposals for the use of eye tracking as a means for user-computer communication appeared, focusing mostly on users with special needs (Hutchinson, 1989; Levine, 1981). Promoted by rapid technological advancements, this trend continued, and in the past decade a substantial amount of effort and money was devoted to the development of eye- and gaze-tracking mechanisms for human-computer interaction (Vertegaal, 1999;Jacob, 1991; Zhai, Morimoto & Ihde, 1999). Detailed analysis of these studies is beyond the scope of this paper, and I will refer to them only insofar as they provide reference points to my proposed design. Interested readers are encouraged to consult several excellent publications that deal with the topic in much greater detail (Duchowsky, 2002; Jacob, Karn, 2003 hn press/). Eye and Gaze Tracking in a Museum Context The use of eye and gaze tracking in a museum context extends beyond interactions with the digital medium. Eye tracking data can prove to be extremely useful in revealing how humans observe real artifacts in a museum setting. The sample data and the methodology from a recent experiment conducted in the National Gallery in London (in conjunction with the Institute for Behavioural Studies) can be seen on the Web. Although some of my proposed gaze-based interaction solutions can be applied to the viewing of real artifacts (for example, to get more information about particular detail that a viewer is interested in), the main focus of my discussion will be on the development of affordable and intuitive gaze-based interaction mechanisms with(in) the digital medium. The main reason for this decision is the issue of accessibility to cultural heritage information. Although an impressive 4000 people participated in the National Gallery experiment, they all had to be there at certain time. I am not disputing the value of experiencing the real artifact, but the introduction of the digital medium has dramatically shifted the role of museums from collection & preservation to dissemination & exploration. Recent advancements in Web-based technologies make it possible for museums to develop tools (and social contexts) that allow them to serve as centers of knowledge transfer for both local and virtual communities. My proposal will focus on three issues: problems associated with use of gaze tracking data as interaction mechanism; 1. conceptual framework for the development of gaze-based interface; 2. currently existing (and affordable) technologies that could support non- 3. intrusive eye and gaze tracking in a museum context. I. Problems associated with gaze tracking input as an interaction mechanism The main problem associated with use of eye movements and gaze direction as an interaction mechanism is known in the literature as "Midas touch" or "the clutch" problem (Jacob, 1993). In simple terms, the problem is that if looking at something should trigger an action, one would be triggering this action even by just observing a particular element on the display (or projection). The problem has been addressed numerous times in literature, and there are many proposed technical solutions. Detailed analysis and overview of these solutions is beyond the scope of this paper. I will present here only a few illustrative examples. 5 5/27/2003 file://E:\mw2003\papers\milekic\milekic.html One of the solutions to the Midas Touch problem, one developed by Rise National Research Laboratory, was to separate the gaze-responsive area from the observed object. The switch (aptly named Eye Con) is a square button placed next to the object that one wants to interact with. When the button is focused (ordinarily for half a second), it 'acknowledges' the viewer's intent to interact with an animated sequence depicting a gradually closing eye. The completely closed eye is equivalent to the pressing of a button (see Figure 3). Figure 3. An Eye Con activation sequence. Separating the control mechanism from interactive objects allows natural observation of the object (image reproduced from Glenstrup, A.J., Engell-Nielsen, T., 1995) it is the One of the problems with this technique comes from the very solution separation of selection and action. The other problem is the interruption of the flow of in order to select (interact with) an object, the user has to focus on the interaction action button for a period of time. This undermines the unique quality of gaze direction as the fastest and natural way of pointing and selection (focus). Another solution to the same problem (with very promising results) was to provide the 'clutch' for interaction through another modality - voice (Glenn, lavecchia, Ross, Stokes, Weiland, Weiss, Zak land 1986) or manual (Zhai, Morimoto, lhde 1999) input. The second major problem with eye movement input is the sheer volume of data collected during eye-tracking and its meaningful analysis. Since individual fixations carry very little meaning on their own, a wide range of eye tracking metrics has been developed in the past 50 years. An excellent and very detailed overview of these metrics can be found in Jacob (2003/in print). Here, I will mention only a few that may be used to infer viewer's interest or intent: number of fixations: a concentration of a large number of fixations in a certain area may be related to a user's interest in the object or detail presented in that area when viewing a scene (or a painting). Repeated, retrograde fixations on a certain word while reading text are taken to be indicators of increased processing load (Just, Carpenter 1976). gaze duration: gaze is defined as a number of consecutive fixations in an area of interest. Gaze duration is the total of fixation durations in a particular area. number of gazes: this is probably a more meaningful metric than the number of fixations. Combined with gaze duration, it may be indicative of a viewer's interest. scan path: the scan path is a line connecting consecutive fixations (see Figure 2, for example). It can be revealing of a viewer's visual exploration strategies and is often very different in experts and novices. The problem of finding the right metric for interpretation of eye movements in a gallery/museum setting is more difficult than in a conventional research setting because of the complexity of the visual stimuli and the wide individual differences of users. However, the problem may be made easier to solve by dramatically constraining the number of interactions offered by a particular application and making them correspond to the user's expectations. For example, one of the applications of the interface I will propose is a simple gaze-based browsing mechanism that allows the viewer to quickly and effortlessly leaf through a museum collection (even if he/she is a quadriplegic and has retained only the ability to move the eyes). 6 5/27/2003 file://E:\mw2003\papers\milekic\milekic.html II. Gaze-based interface for museum content Needless to say, even a gaze-based interface that is specifically designed for museum use has to provide a solution for general problems associated with the use of eye movement-based interactions. I will approach this issue by analyzing three different strategies that may lead to the solution of the Midas touch problem. These strategies differ in terms of the of the interaction mechanism, as it relates to: time location, and user action It is clear that any interaction involves time, space and actions, so the above classification should be taken to refer to the key component of the interface solution. Each of these solutions has to accommodate two modes of operation: the observation mode, and the action (command) mode The viewer should have a clear indication as to which mode is currently active, and the interaction mechanism should provide a way to switch between the modes quickly and effortlessly. Time-based interfaces At first glance, a time-based interface seems like a good choice (evident even for myself when choosing the title of this paper). An ideal setup (for which I will provide more details in the following sections) for this type of interface would be a high- resolution projection of a painting on the screen with an eye-tracking system concealed in a small barrier in front of the user. An illustration of a time-based interaction mechanism is provided in Figure 4. The gaze location is indicated by a traditional cursor as long as it remains in a non-active (in this case, outside of the painting) area. When the user shifts the gaze to the gaze-sensitive object (painting), the cursor changes its shape to a faint circle, indicating that the observed object is aware of the users attention. I have chosen the circle shape because it does not interfere with the viewer's observation, even though it clearly indicates potential interaction. As long as the viewer continues visual exploration of the painting there is no change in status. However, if the viewer decides to focus on a certain area for a predetermined period of time (600 ms), the cursor/circle starts to shrink (zoom), indicating the beginning of the focusing procedure. 7 5/27/2003 file://E:\mw2003\papers\milekic\milekic.html vkA-4 pad, 5f.fi it44, tA 3--1C Figure 4. The cursor changes at position (A) into focus area indicating that the object is 'hot'. Position (B) marks the period of relative immobility of the cursor and the beginning of the focusing procedure. Relative change in the size of the focus area (C) indicates that focusing is taking place. The appearance of concentric circles at time (D) indicates imminent action. The viewer can exit the focusing sequence at any time by moving the point of observation outside of the current focus area. If the viewer continues to fixate on the area of interest, the focusing procedure continues for the next 400 milliseconds, ending with a 200 millisecond long signal of imminent action. At any time during the focusing sequence (including the imminent action signal), the viewer can return to observation mode by moving the gaze away from the current fixation point. In the scenario depicted above (and in general, for time-based interactions) it is desirable to have only one pre-specified action relevant to the context of viewing. For example, the action can be that of zooming-in to the observed detail of the painting (see Figure 6), or proceeding to the next item in the museum collection. The drawbacks of time-based interaction solutions triggered by focusing on the object/area of interest areas follows: the problem of going back to observation mode. This means that the action triggered by focusing on a certain area has to be either self-terminating (as is the case with the 'display the next artifact' action, where the application switches automatically back to the observation mode) ,or one has to provide a simple mechanism that would allow the viewer to return to the observation mode (for example, by moving the gaze focus outside of the object boundary); the problem of choice between multiple actions. Using the time-based mechanism, it is possible to trigger different actions. By changing the cursor/focus shape, one can also indicate to the viewer which action is going to take place. However, since the actions are tied to the objects themselves, the viewer essentially has no choice but to accept the pre-specified action. This may not be a problem in a context where pre-specified actions are meaningful and correspond to the viewer's expectations. However, it does limit the number of actions one can 'pack' into an application and can create confusion in cases where two instances of focusing on the same object may trigger off different actions. the problem of interrupted flow or waiting. Inherent to time-based solutions is the problem that the viewer always has to wait for an action to be executed. In my experience, after getting acquainted with the interaction mechanism, the waiting time becomes subjectively longer (because the users know what to AVAILABLE REST COPY 8 file://E:\mw2003\papers\milekic\milekic.html 5/27/2003 expect) and often leads to frustration. The problem can be diminished to some extent by progressively shortening the duration of focusing necessary to trigger the action. However, at some point it can lead to another source of frustration since the viewer may be forced to constantly shift the gaze around in order to stay in the observation mode. Inspite of the above mentioned problems, time-based gaze interactions can be an effective solution for museum use where longer observation of an area of interest provides the viewer with more information. Another useful approach is to use the gaze direction as input for the delivery of additional information through another modality. In this case, the viewer does not need to get visual feedback related to his/her eye movements (which can be distracting on its own). Instead, focusing to an area of interest may trigger voice narration related to viewer's interest. For an example of this technique in the creation of a gaze-guided interactive narrative, see Starker & Bolt (1990). Location-based interfaces Another traditional way of solving the "clutch" problem in gaze-based interfaces is by separating the modes of observation and action by using controls that are in the proximity of the area of interest but do not interfere with visual inspection. I have already described Eye Cons (Figure 3) designed by the Riso National Research Laboratory in Denmark (for a detailed description see Glenstrup and Engell-Nielsen, 1995). In the following section I will first expand on Eye Cons design and then propose another location-based interaction mechanism. The first approach is illustrated in Figure 5. 4.ro;orf S+ 114. Figure 5. Movement of the cursor (A) into the gaze-sensitive area (B) slides into view the action palette (C). Fixating any of the buttons is equivalent to a button press and chooses the specified action which is executed without delay when the gaze returns to the object of interest. The viewer can also return to observation mode by choosing no action button. The action palette slides out of view as soon as the gaze moves out of the area (B). The observation area (the drawing) and the controls (buttons) are separated. At first glance, the design seems very similar to that of the EyeCons, but there are some enhancements that make the interactions more efficient. First, the controls (buttons) are located on a configurable 'sliding palette', a mechanism that was adopted by the most widely used operating system (Windows) in order to provide users with more BEST COPY AVAILABLE 9 5/27/2003 file://E:\mw2003\papers\milekic\milekic.html 'screen real estate'. The reason for doing this in a museum context is also to minimize the level of distraction while observing the artifact. Shifting the gaze to the side of the projection space (B) slides the action palette into the view. The button that is currently focused becomes immediately active (D) signaling the change of mode by displaying the focus ring and changing the color. This is a significant difference compared to the Eye Cons design, which combines both location- and time-based mechanisms to initiate action. Moving the gaze back to the object leads to the execution of specified action (selection, moving, etc.). Figure 6 illustrates the outcome of choosing the 'zoom action from the palette. The eye-guided cursor becomes a magnifying glass allowing close inspection of the artifact. Figure 6. After choosing the desired action (see Figure 5), returning the gaze to the object executes the action without delay. The detail above shows the 'zoom-in' tool, which becomes 'tied' to the viewer's gaze and allows close inspection of the artifact. One can conceptually expand location-based interactions by introducing the concept of an active surface. Buttons can be viewed as being essentially single-action locations (switches). It really does not matter which part of the button one is focusing the outcome is always the same. In contrast, a surface on (or physically pressing) affords assigning meaning to a series of locations (fixations) and makes possible incremental manipulation of an object. Figure 7 provides an example of a surface-based interaction mechanism. Interactive surfaces are discretely marked on the area surrounding the object. For the purpose of illustration, a viewer's scan path (A) is shown superimposed over the object and indicates gaze movement towards the interactive surface. Entering the active area is marked by the appearance of a cursor in a shape that is indicative of the possible action (D). The appearance of the cursor is followed by a brief latency period (200- 300 ms) during which the viewer can return to the observation mode by moving the gaze outside of the active area. If the focus remains in the active area (see Figure 8), any movement of the cursor along the longest axis of the area will be incrementally in this case, rotation of the object. mapped onto an action sequence 10 5/27/2003 file://E:\mw2003\papers\milekic\milekic.html