Vision-Based Image Retrieval (VBIR) - A New Eye-Tracking Based Approach to Efficient and Intuitive Image Retrieval Dissertation zur Erlangung des Grades eines Doktors der Naturwissenschaften vorgelegt von Kai Essig an der Technischen Fakult¨at der Universita¨t Bielefeld im September 2007 i This research was conducted within the Neuroinformatics Group and the Collaborative Research Center 360 “Situated Artificial Communicators” (SFB 360, Unit B4) at Bielefeld University. I wish to acknowledge an immeasurable debt of gratitude to a few individuals who supported and encouraged me during the writing of this thesis: First, I want to thank Helge Ritter for supervision and support of my work. His ideas and enthusiasm were a valuable source of inspiration for this thesis. Furthermore, his optimism has always been a powerful motivating force throughout my work. Second, I am grateful to Marc Pomplun from the University of Massachusetts at Boston for his valuable feedback and for his willingness to participate in the PhD com- mittee.Hewastheone,whointroducedmetothefascinatingresearchfieldofeyetracking. Furthermore, I wish to express my thanks to the “Trekkies”: Elena Carbone, Sonja Folker and Lorenz “Max” Sichelschmidt. You all gave me helpful comments, sugges- tions and advice over the last years. It is a great pleasure for me to work together with you in this interdisciplinary team. Prof. Arbee L.-P. Chen (National Hsing Hua University in HsinChu, Taiwan) and Prof. Chiou-Shann Fuh (National Taiwan Univer- sity in Taipei, Taiwan) introduced me to the field of Content-Based Image Retrieval and provided me with a rich and pleasant research environment during my stay in Taiwan. Sven Pohl’s permanent willingness to help and his continuous improvement of the VDesigner were a great help in the realisation of a complex image retrieval experiment embedded in an eye-tracking environment. But more of this, Sven became a very good friend of mine during the years we worked together in the eye-tracking group. Andreas Hu¨wel had a hand in the SOM visualisations of the high dimensional feature vectors. Gunther Heidemann, Tim Nattkemper, J¨org Ontrop and Thies Pfeiffer gave me valuable feedback. Tanja K¨ampfe, Dirk Selle, Frank Schu¨tte and J¨org Walter supported me by proofreading this thesis, making it readable as well as understandable. I am also much obliged to Petra Udelhoven for her assistance in handling the manifoldness of bureaucracy. Furthermore, I thank all participants for taking part in my experiments. Moreover, I would like to express my gratitude to the staff of the child care facilities “Kita EffHa” and “Bonh¨offer Tageseinrichtung der Ev. Dietrich-Bonh¨offer Kirchengemeinde” in Bielefeld. Finally, I am grateful to Li-Ying for her longstanding support, patience and confidence over the last years and to my daughter Vivienne for enriching my life. This research was funded by grants of the “Evangelisches Studienwerk e.V. Villigst” and the Graduate and Postdoctoral Programme “Strategies & Optimisation of Behaviour”. Bielefeld, 30th August 2007 Kai Essig ii Abstract The advances in digitalisation technology demand new techniques for the retrieval of relevant images from large image databases, resulting in the foundation of a new research area called Content-Based Image Retrieval (CBIR) in 1992. CBIR describes a set of techniques to retrieve fast and reliably relevant images from large image repositories on the basis of automatically derived image features. The research on CBIR focusses on the investigation of new image features and distance functions suitable for the retrieval task at hand, and on the optimal integration of the user into the retrieval process by providing intelligent input options (user-relevance feedback). Recent scientific findings came to the conclusion that a retrieval system, in order to be generally accepted by the user and to be applicable to various image domains, requires not only features and distance functions that are consistent with human perception of similarity, but also sophisticated and natural human-machine interfaces that optimally integrate the user into the retrieval process. This PhD thesis documents a new approach to image retrieval, called Vision-Based Image Retrieval (VBIR), by using an eye tracker as a natural and elegant method for user-relevance feedback. Eye tracking denotes the process of monitoring and recording participants’ gaze positions during the observation of stimuli presented on a computer screen. When humans compare images, they focus on specific image regions and check for similarities and differences. Thus, semantically important image regions receive much attention, manifested by a higher number of fixations and increasing fixation durations. The central hypothesis is that the retrieval process should be improved substantially by increasing the weights for the features of semantically important image regions. This hypothesis was investigated by comparing the performances of the new eye-tracking based image-retrieval approach (VBIR) with a classic CBIR approach. The results revealed not only higher retrieval performances for the VBIR system, but also a higher correlation of the systems’ retrieval results with human measures of similarity. Before the experiment could be performed, not only suitable colour, shape and texture features had to be found, but also an optimal weighting scheme had to be determined. The suitability of the chosen image features for the retrieval experiments in this work were evaluated with the self-organizing map (SOM) and the result viewer. The outcome showsthatimageswithsimilarfeaturevectorsareclusteredtogether,wherebythenumber of outliers for the shape and texture features were higher than for the colour feature. To determine the optimal weighting scheme for the single image features, the Shannon entropy was calculated from the feature distance histograms. The optimal feature weight combination resulted from the highest Shannon entropy value. It was found to be 41%, 33% and 26% for colour, shape and texture, respectively. These findings are in accordance with the overall impression that colour plays a dominant role for the discrimination of flower images. In order to test the CBIR and VBIR approaches in the experiment on a representative set of queries, the maximum number of retrieval steps for each query image was limited. iii A second experiment was designed for the evaluation of the retrieval results of both approaches, especially in cases where the corresponding query was not retrieved. In this experiment, participants ranked the similarity of the retrieval results for both approaches to the corresponding query images according to their overall subjective impression. The empirical findings show significantly higher similarity values for the retrieval results of the VBIR approach. Furthermore, participants’ subjective similarity estimations correspond to objectively calculated feature distances, i.e., high similarity values correlate with small feature distances. The empirical findings then led to the development of computational models for image retrieval. Altogether five models were implemented in this thesis: Two of these models, CBIR and CBIR MLP, apply the pre-calculated global image features. The main purpose behind the CBIR models is to simulate closely the human selection process of choosing the most similar image from a set of six retrieved database images. In case of CBIR, the selection of the most similar image to the query is based on the corresponding feature distances, whereas in the CBIR MLP approach, this selection is modelled by a multi- layer perceptron (MLP), trained on participants’ similarity estimations from the second experiment.ThethreeVBIRmodelsarebasedonpre-calculatedtile-basedimagefeatures. The models differ in regard to the weighting schemes for the single image tiles. The main purpose behind the VBIR models is to simulate closely the retrieval of similar images from the database. The results revealed the best overall performances for the VBIR models to find the queries of all start configurations, which are in accordance to the results of the empirical experiments. The CBIR and CBIR MLP models on the other hand provided results which are not conform to the outcome of the retrieval experiment: Both CBIR models do not adequately simulate humans’ similarity decisions. In another model, three (bottom-up) saliency maps (i.e., colour, intensity and orien- tation) were calculated from the flower images. The overall saliency map resulted as a weighted combination of the different conspicuity maps. The model computes the cor- relation between the overall saliency map and the human fixation map, calculated from participants’ eye movements recorded in the retrieval experiment, for a set of different weight combinations. The results revealed that weight combinations in the ranges of 70%- 100%forcolour,10%-30%forintensityand0%-20%fororientationresembledmostclosely human attention distribution. All in all, the results of the models yield further support for the suitability of an attention-based approach to image retrieval and the adequateness of an eye tracker as a natural source for human relevance-feedback. iv Contents Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii Table of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv 1 Introduction 1 1.1 The Need for New and Intuitive Image Retrieval Techniques . . . . . . . . 1 1.2 Content-Based Image Retrieval (CBIR) . . . . . . . . . . . . . . . . . . . . 3 1.3 RBIR and CBsIR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 1.4 Challenges for CBIR Systems . . . . . . . . . . . . . . . . . . . . . . . . . 40 1.5 A New Approach: Vision-Based Image Retrieval (VBIR) . . . . . . . . . . 42 2 Visual Information Processing 43 2.1 The Human Eye . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 2.2 The Visual System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 2.3 Selective Visual Attention . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 2.4 The Indispensability of Eye Movements . . . . . . . . . . . . . . . . . . . . 47 2.5 Colour Perception . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 2.6 Shape Perception . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 2.7 Texture Perception . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 2.8 Object Perception . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 3 Eye Movements, Eye Tracking and Applied Software 61 3.1 Eye Movements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 3.2 Eye Movements During the Perception of Natural Scenes . . . . . . . . . . 64 3.3 Eye Tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 3.4 Software Development in the Eye-tracking Group . . . . . . . . . . . . . . 81 3.5 VDesigner . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 3.6 EyeDataAnalyser (EDA) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 4 Vision-Based Image Retrieval (VBIR) 89 4.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 4.2 Eye Tracker as an Input Medium . . . . . . . . . . . . . . . . . . . . . . . 91 4.3 Feedback through Eye Movements . . . . . . . . . . . . . . . . . . . . . . . 92 4.4 The Vision-Based Image Retrieval (VBIR) Approach . . . . . . . . . . . . 95 4.5 Implementation of the Retrieval Systems (CBIR and VBIR) . . . . . . . . 99 vi CONTENTS 4.6 Optimal Feature Weight Estimation . . . . . . . . . . . . . . . . . . . . . . 113 5 Feature Evaluation 119 5.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 5.2 Methods for the Analysis of High-Dimensional Feature Vectors . . . . . . . 119 5.3 Evaluation of the Global and Tile-Based Image Features . . . . . . . . . . 121 5.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135 6 Image Retrieval Experiments 137 6.1 Experiment I: CBIR versus VBIR . . . . . . . . . . . . . . . . . . . . . . 137 6.2 Experiment II: Evaluation of the Retrieval Results . . . . . . . . . . . . . . 152 7 Computer Models of Image Retrieval 159 7.1 The Motivation of Computer Simulations . . . . . . . . . . . . . . . . . . . 159 7.2 Computer Simulation I: The Image Retrieval Models . . . . . . . . . . . . 160 7.3 Computer Simulations II: Saliency Map Model . . . . . . . . . . . . . . . . 181 8 Conclusions and Outlook 195 A 221 A.1 Query Images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222 A.2 Attention Values for Query Images . . . . . . . . . . . . . . . . . . . . . . 223 A.3 Set 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224 A.4 Set 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225 A.5 Set 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 226 A.6 Set 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227 A.7 Set 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 228 A.8 Set 6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 229 A.9 Set 7 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 230 A.10 Set 8 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231 Chapter 1 Introduction 1.1 The Need for New and Intuitive Image Retrieval Techniques With the spreading of fast computers and high speed network connections, the digital ac- quisition of information has become more and more widespread in recent years. Supported by the progress in digitalisation, the steady growth of computer power, declining costs for storage capacities and easier access to the Internet, the amount of available images increases every day. Compared to analog formats, digitalised information can be conve- niently saved on portable storage devices or on server databases, where it can be easily downloaded,sharedanddistributed.Whereastheindexingandretrievaloftextdocuments has been a research area for a long time and many sophisticated solutions have already been proposed (Baeza-Yates & Ribeiro-Neto, 1999; Salton & McGill, 1988), the automatic retrieval of images according to image content is still a challenge. The complexity of image retrieval is based mainly on two reasons: I.) It is difficult to find features suitably describ- ing image content, II.) the lack of techniques in Computer Vision to understand image semantics. The application of text retrieval techniques for image retrieval is cumbersome, because the manual annotation and indexing of images by humans is subjective and very time consuming. Additionally, techniques successfully applied to text documents are not suitable for image retrieval. Hence, automatic image retrieval requires the design of new and more sophisticated techniques that differ from those applied to text retrieval. As a consequence, so called Content-Based Image Retrieval (CBIR) systems were developed. In CBIR systems, each image is represented as a vector in a high dimensional feature space. The user mostly provides a query image and the system automatically returns those im- ages from the database that are most similar to the query image in the feature space. CBIR techniques are useful whenever information shall be automatically retrieved from large image repositories, like medical image databases, news archives of news agencies or broadcasting stations, or digital archives in museums or education. Whereas CBIR systems use low level features (like colour, shape and texture) for image retrieval, humans interpret image contents semantically (Rui, Huang, Mehrota & 2 Introduction Ortega, 1999). Because image semantics cannot be suitably described by primitive image features, modern CBIR systems generally integrate the user into the retrieval process to provide so called user-relevance feedback (e.g., by ranking the retrieved images) in order to improve the system’s performance and to overcome the semantic gap and the subjectivity of human perception. The former term denotes the gap between the object in the world and the information in a (computational) description derived from a recording of that scene, whereas the latter is the lack of coincidence between the information that one can extract from the visual data and the meaning that the same data have for a user in a given situation (Smeulders et al., 2000). Through relevance feedback, the system tries to narrow the search space and to retrieve images that are semantically better correlated to the users’ retrieval needs. Unfortunately, the automatic mapping of those semantic descriptions to low level features is very challenging, not to say impossible for complicated images. Furthermore, different persons (or the same person under different circumstances) may perceive the same visual content differently. And finally, providing relevance feedback is quite tedious for the user, since he/she has to rate all the result images for each retrieval step. Hence, after a while, user-relevance feedback is often only provided occasionally or even not at all. The approach to automatic retrieval systems, based on primitive image features and user-relevance feedback through keyboard or mouse input, is not a promising approach to overcome the limitations of CBIR systems described above. One reason is that there are no suitable techniques to automatically relate the users’ similarity ratings to the semantic content of the image. Furthermore, the feedback through mouse or keyboard does not provide a natural and intuitively to use interface between the system and the user. Thus, users are more discouraged than delighted to utilise CBIR software for long retrieval sessions. In order to provide a convenient and easy to use interface for relevance feedback, this thesis presents an alternative approach to image retrieval, called Vision-Based Image Retrieval (VBIR). This approach uses an eye tracker as a novel and natural source for user-relevance feedback. An eye tracker measures and records eye movements of participants looking at images. By online analysis of eye movements during image retrieval, the retrieval system can be guided to focus on important image areas (i.e., regions with a high number of fixations) so that accuracy can be improved. In order to provide a better understanding of the link between Content-Based Image Retrieval (CBIR) and eye tracking, the two research areas are first described in more detail in this work. We start by addressing Content-Based Image Retrieval in the next section. 1.2 Content-Based Image Retrieval (CBIR) 3 1.2 Content-Based Image Retrieval (CBIR) We all know the popular saying “A picture is worth a thousand words”. But in practice, a picture is worth far less than a thousand words if it cannot be found. This is where Content-Based Image Retrieval (CBIR) comes into play. 1.2.1 Motivation There is a huge demand for sophisticated CBIR systems because more and more visual data are digitalised as a result of the progress in digitalisation techniques. For example, museum collections need reliable indexing and retrieval techniques, whereby the number of users getting online to use those resources is steadily increasing. Broadcasting stations receive a high number of new images every day. At the British Broadcasting Corporation, for example, 750000 hours of news material still has to be archived. Thirty employees are needed to catalogue new material so that the archive can answer the 2000 to 3000 requests every week (Sietmann, 2003). The advantage of the digitalised form over the traditional one is that the data cannot only be stored locally but also be conveniently distributed (for example via the Internet or on CD or DVD). CBIR was established as a research area in 1992 with the USNSF (US National Sci- ence Foundation) workshop in Redwood, California. The aim of the workshop was to identify “major research areas that should be addressed by researchers for visual informa- tion management systems that would be useful in scientific, industrial, medical, environ- mental, educational, entertainment, and other applications” (Smeulders et al., 2000, p. 1). Probably Kato, Kurita, Otsu and Hirata (1992) were the first ones to use the term Content-Based Image Retrieval in order to describe their experiments on automatic re- trieval of images from a database by colour and shape. From then on, the term CBIR denotes the process of retrieving desired images from a large collection based on features (mostly colour, shape and texture) that can be automatically extracted from the images. These features can be primitive or semantic. Primitive features are low level features like object perimeter, colour histogram and so on. Semantic features on the other hand refer to semantically-meaningful information inside an image, for example to identify a person in an image. Whereas there are already many suitable low level features, the derivation of semantic meaning from an image is still a huge challenge for existing systems. Despite all the progress made, those systems mostly lack feature extraction and retrieval tech- niques that match human needs closely. Even though the first long term projects have been started to analyse user behaviour for image retrieval, we still do not know enough to develop sophisticated CBIR programs that are suited to human retrieval needs (Eakins & Graham, 1999). There is also a lack of adequate visual query formulation and refinement interfaces for CBIR systems (Venters, Eakins & Hartley, 1997), a barrier for effective image retrieval. A few commercial and some research prototype systems using different features and retrieval techniques are already available, showing quite impressive retrieval results for
Description: