Interactive Multimodal Information Management © 2014 by EPFL Press © 2014 by EPFL Press Interactive Multimodal Information Management Edited by Hervé Bourlard and Andrei Popescu-Belis WITHCONTRIBUTIONSBY: Aude Billard, Hervé Bourlard, Barbara Caputo, Andrzej Drygajlo, Touradj Ebrahimi, Martina Fellay, Marc Ferràs, François Foglia, Gerald Friedland, Daniel Gatica-Perez, Luc Van Gool, Denis Lalanne, Agnes Lisowska Masson, Marcus Liwicki, Mathew Magimai.-Doss, Sébastien Marcel, Stéphane Marchand-Maillet, Kaspar Meuli, Fabian Nater, Basilio Noris, Jean-Marc Odobez, Andrei Popescu-Belis, Thierry Pun, Steve Renals, Maurizio Rigamonti, Jürgen Sauer, Francesca De Simone, Andreas Sonderegger, Matteo Sorci, Jean-Philippe Thiran, Tatiana Tommasi, Alessandro Vinciarelli, Chuck Wooters, Anil Yüce EPFL Press A Swiss academic publisher distributed by CRC Press © 2014 by EPFL Press EPFL Press Taylor and Francis Group, LLC Presses polytechniques et universitaires roman- 6000 Broken Sound Parkway NW, Suite 300 des, EPFL Boca Raton, FL 33487 Post office box 119, CH-1015 Lausanne, Distribution and Customer Service Switzerland [email protected] E-Mail:[email protected], Phone: 021/693 21 30, Fax: 021/693 40 27 © 2014 by EPFL Press EPFL Press ia an imprint owned by Presses polytechniques et universitaires romandes, a Swill aca- demic publishing company whose main purpose is to publish the teaching and research works of the Ecole polytechnique fédérale de Lausanne. Version Date: 20140225 International Standard Book Number-13: 978-1-4822-1213-6 (eBook - PDF) All rights reserved (including those of translation into other languages). No part of this book may be reproducted in any form — by photoprint, microfilm, or any other means — nor transmitted or translated into a machine language without written permission from the publisher. The authors and publishers express their thanks to the Ecole polytechnique fédérale de Lausanne (EPFL) for its generous support towards the publication of this book. Visit the Taylor & Francis Web site at http://www.taylorandfrancis.com and the CRC Press Web site at http://www.crcpress.com © 2014 by EPFL Press Contents 1 Interactive Multimodal Information Management: Shaping the Vision 1 1.1 Meeting capture, analysis and access . . . . . . . . . . . . . . 2 1.1.1 Development of meeting support technology . . . . . . 2 1.1.2 Scenario and context . . . . . . . . . . . . . . . . . . . 3 1.1.3 Smart meeting rooms . . . . . . . . . . . . . . . . . . 4 1.1.4 Data: multimodal signals and their annotations. . . . 7 1.2 The IM2 Swiss National Center of Competence in Research . 8 1.2.1 History of IM2 . . . . . . . . . . . . . . . . . . . . . . 9 1.2.2 Size and management of IM2 . . . . . . . . . . . . . . 10 1.2.3 Structure of IM2 . . . . . . . . . . . . . . . . . . . . . 11 1.3 Related international projects and consortia. . . . . . . . . . 13 Human-Computer Interaction and Human Factors 19 2 Human Factors in Multimodal Information Management 21 2.1 Role of human factors . . . . . . . . . . . . . . . . . . . . . . 21 2.2 Prominent research topics in human factors . . . . . . . . . . 22 2.2.1 Automation . . . . . . . . . . . . . . . . . . . . . . . . 22 2.2.2 Consumer product design . . . . . . . . . . . . . . . . 23 2.3 Methodological approach in human factors . . . . . . . . . . 23 2.3.1 General approaches . . . . . . . . . . . . . . . . . . . 23 2.3.2 Four-factor framework . . . . . . . . . . . . . . . . . . 24 2.3.3 Specific approaches used . . . . . . . . . . . . . . . . . 25 2.3.4 The cBoard and the EmotiBoard as task environments 26 2.4 Empirical studies . . . . . . . . . . . . . . . . . . . . . . . . . 27 2.4.1 The utility of the cBoard for co-located work groups . 27 2.4.2 Static mood feedback and distributed teamwork . . . 28 2.4.3 Dynamic mood feedback and mood priming in teams. 29 2.5 Discussion and implications . . . . . . . . . . . . . . . . . . . 30 3 User Attention During Mobile Video Consumption 33 3.1 Modeling user behavior . . . . . . . . . . . . . . . . . . . . . 34 3.2 Data acquisition experiment . . . . . . . . . . . . . . . . . . 36 3.3 Data processing and results . . . . . . . . . . . . . . . . . . . 37 3.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 © 2014 by EPFL Press vi InteractiveMutlimodalInformationManagement 4 Wizard of Oz Evaluations of the Archivus Meeting Browser 43 4.1 The Archivus meeting browser . . . . . . . . . . . . . . . . . 43 4.1.1 Design decisions and process . . . . . . . . . . . . . . 44 4.1.2 The Archivus user interface . . . . . . . . . . . . . . . 45 4.1.3 Implementation . . . . . . . . . . . . . . . . . . . . . . 47 4.2 Multimodal Wizard of Oz evaluation . . . . . . . . . . . . . . 48 4.2.1 Adapting Wizard of Oz evaluation to multimodal contexts . . . . . . . . . . . . . . . . . . . . . . . . . . 48 4.2.2 Evaluating Archivus . . . . . . . . . . . . . . . . . . . 51 4.2.3 Implications for the interactive systems prototyping methodology and dialogue strategies . . . . . . . . . . 51 4.2.4 Implications for natural language understanding . . . 52 4.2.5 Implications for modality choice . . . . . . . . . . . . 54 4.3 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 5 Document-Centric and Multimodal Meeting Assistants 59 5.1 The Smart Meeting Minutes application . . . . . . . . . . . . 60 5.2 Document centric meeting browsing . . . . . . . . . . . . . . 61 5.3 Cross-meeting and ego-centric browsing . . . . . . . . . . . . 62 5.4 Multimodal user interfaces prototyping for online meeting assistants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 5.5 The Communication Board application . . . . . . . . . . . . 67 5.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 6 Semantic Meeting Browsers and Assistants 71 6.1 The JFerret framework and browser . . . . . . . . . . . . . . 72 6.2 TQB: a transcript-based query and browsing interface . . . . 72 6.3 Evaluation of meeting browsers . . . . . . . . . . . . . . . . . 74 6.3.1 Evaluation task, protocol and measures . . . . . . . . 75 6.3.2 BET results of several meeting browsers . . . . . . . . 76 6.4 Automatic meeting browsers and assistants . . . . . . . . . . 78 6.4.1 The AutoBET . . . . . . . . . . . . . . . . . . . . . . 78 6.4.2 The Automatic Content Linking Device . . . . . . . . 79 6.4.3 Evaluation of the ACLD . . . . . . . . . . . . . . . . . 80 6.5 Conclusions and perspectives . . . . . . . . . . . . . . . . . . 80 7 Multimedia Information Retrieval 85 7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 7.1.1 Information retrieval as a complex process. . . . . . . 85 7.1.2 Multimedia versus text IR. . . . . . . . . . . . . . . . 86 7.1.3 The advent of big players in IR . . . . . . . . . . . . . 87 7.2 Multimedia information retrieval: from information to user satisfaction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 7.2.1 Image and video retrieval . . . . . . . . . . . . . . . . 87 7.2.2 Cross-modal information processing and retrieval . . . 89 7.2.3 Information representation . . . . . . . . . . . . . . . 90 © 2014 by EPFL Press Contents vii 7.2.4 Related problems . . . . . . . . . . . . . . . . . . . . . 92 7.3 Interaction log mining: from user satisfaction to improved information retrieval . . . . . . . . . . . . . . . . . . . . . . . 92 7.3.1 Modeling and analyzing interaction . . . . . . . . . . 93 7.3.2 Semantic learning . . . . . . . . . . . . . . . . . . . . 94 7.4 Multimedia information retrieval in a wider context . . . . . 94 Visual and Multimodal Analysis of Human Appearance and Behavior 99 8 Face Recognition for Biometrics 101 8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 8.2 Face processing in a nutshell . . . . . . . . . . . . . . . . . . 102 8.3 From face detection to face recognition . . . . . . . . . . . . 103 8.3.1 Local Binary Patterns for face detection and recognition 103 8.3.2 Face binary features for face detection . . . . . . . . . 105 8.3.3 Multivariate boosting for face analysis . . . . . . . . . 106 8.4 Statistical generative models for face recognition . . . . . . . 107 8.4.1 Distribution modeling for part-based face recognition 107 8.4.2 Bayesian Networks for face recognition . . . . . . . . . 108 8.4.3 Session variability modeling . . . . . . . . . . . . . . . 109 8.5 Cross-pollination to other problems . . . . . . . . . . . . . . 109 8.5.1 Spoofing and anti-spoofing . . . . . . . . . . . . . . . 109 8.5.2 Cross-pollination from face recognition to speaker recognition . . . . . . . . . . . . . . . . . . . . . . . . 110 8.5.3 Cross-pollination from face recognition to brainwaves (EEG) processing . . . . . . . . . . . . . . . . . . . . 110 8.6 Open data and software . . . . . . . . . . . . . . . . . . . . . 110 8.7 Conclusion and future work . . . . . . . . . . . . . . . . . . . 111 9 Facial Expression Analysis 117 9.1 Introduction and state-of-the-art . . . . . . . . . . . . . . . . 117 9.2 Recognizing action units. . . . . . . . . . . . . . . . . . . . . 120 9.3 Modeling human perception of static facial expressions . . . 123 9.3.1 Data description: the EPFL Facial Expression Perception survey . . . . . . . . . . . . . . . . . . . . 123 9.3.2 Features: action units and expression descriptive units 124 9.3.3 Modeling with discrete choice models . . . . . . . . . 125 9.3.4 Model specifications . . . . . . . . . . . . . . . . . . . 126 9.3.5 Model validation . . . . . . . . . . . . . . . . . . . . . 128 9.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129 10 Software for Automatic Gaze and Face/Object Tracking 133 10.1 Gaze tracking. . . . . . . . . . . . . . . . . . . . . . . . . . . 133 10.1.1 Estimating the direction of gaze . . . . . . . . . . . . 134 10.1.2 Experimental setup . . . . . . . . . . . . . . . . . . . 137 © 2014 by EPFL Press viii InteractiveMutlimodalInformationManagement 10.1.3 Eye-tracking results and discussion . . . . . . . . . . . 137 10.2 Face tracking in real environments . . . . . . . . . . . . . . . 138 10.2.1 Active-selection based SVM with particle-tracking face detector . . . . . . . . . . . . . . . . . . . . . . . . . . 139 10.2.2 Face tracking results . . . . . . . . . . . . . . . . . . . 140 10.3 Application to autism spectrum disorder . . . . . . . . . . . 141 10.3.1 Visual behavior of ASD children in semi-naturalistic environments . . . . . . . . . . . . . . . . . . . . . . . 142 10.3.2 Results of ASD study . . . . . . . . . . . . . . . . . . 143 10.3.3 Discussion. . . . . . . . . . . . . . . . . . . . . . . . . 143 10.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145 11 Learning to Learn New Models of Human Activities in Indoor Settings 149 11.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 149 11.2 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . 150 11.3 Proposed approach . . . . . . . . . . . . . . . . . . . . . . . . 151 11.4 Activity tracking for unusual event detection . . . . . . . . . 152 11.5 Knowledge transfer for unusual event learning . . . . . . . . 154 11.5.1 Adaptive knowledge transfer . . . . . . . . . . . . . . 154 11.5.2 One-versus-all multiclass extension . . . . . . . . . . . 155 11.6 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . 156 11.6.1 Dataset and setting . . . . . . . . . . . . . . . . . . . 156 11.6.2 Transfer learning . . . . . . . . . . . . . . . . . . . . . 158 11.6.3 Activity tracking . . . . . . . . . . . . . . . . . . . . . 160 11.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162 12 Nonverbal Behavior Analysis 165 12.1 Introduction: a brief history of nonverbal behavior research in IM2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165 12.2 VFOA recognition for communication analysis in meeting rooms and beyond . . . . . . . . . . . . . . . . . . . . . . . . 168 12.2.1 Head pose estimation . . . . . . . . . . . . . . . . . . 168 12.2.2 VFOA recognition in meetings . . . . . . . . . . . . . 169 12.2.3 VFOA recognition for wandering people . . . . . . . . 171 12.2.4 Some perspectives on VFOA analysis . . . . . . . . . 173 12.3 Social signal processing . . . . . . . . . . . . . . . . . . . . . 174 12.3.1 Role recognition . . . . . . . . . . . . . . . . . . . . . 175 12.3.2 Automatic personality perception. . . . . . . . . . . . 176 12.3.3 Conflict detection . . . . . . . . . . . . . . . . . . . . 177 12.4 Behavioral analysis of video blogging. . . . . . . . . . . . . . 179 12.4.1 Extracting nonverbal communicative cues from vlogs . 179 12.4.2 Characterizing social perception in vlogging . . . . . . 180 12.4.3 Investigating connections between nonverbal behavior and social perception . . . . . . . . . . . . . . . . . . 181 12.5 Final remarks. . . . . . . . . . . . . . . . . . . . . . . . . . . 181 © 2014 by EPFL Press Contents ix 13 Multimodal Biometric Person Recognition 189 13.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 189 13.1.1 Multimodal biometric systems . . . . . . . . . . . . . 190 13.1.2 Quality of biometric data . . . . . . . . . . . . . . . . 190 13.1.3 Reliability of biometric systems . . . . . . . . . . . . . 191 13.2 Biometric classification with quality measures . . . . . . . . . 192 13.2.1 Q-stack: a systematic framework of classification with quality measures . . . . . . . . . . . . . . . . . . . . . 193 13.2.2 Performance prediction with quality measures: experimental evaluation . . . . . . . . . . . . . . . . . 194 13.3 Modeling reliability with Bayesian networks . . . . . . . . . . 197 13.3.1 Observable evidence for reliability estimation . . . . . 198 13.4 A-stack: biometric recognition in the score-age-quality classification space . . . . . . . . . . . . . . . . . . . . . . . . 200 13.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202 14 Medical Image Annotation 205 14.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 205 14.2 Multiple cues for image annotation . . . . . . . . . . . . . . . 207 14.2.1 High-level integration . . . . . . . . . . . . . . . . . . 207 14.2.2 Mid-level integration . . . . . . . . . . . . . . . . . . . 208 14.2.3 Low-level integration . . . . . . . . . . . . . . . . . . . 208 14.3 Exploiting the hierarchical structure of data: confidence-based opinion fusion . . . . . . . . . . . . . . . . . . . . . . . . . . 209 14.4 Facing the class imbalance problem: virtual examples . . . . 209 14.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . 210 14.5.1 Features . . . . . . . . . . . . . . . . . . . . . . . . . . 210 14.5.2 Classifier . . . . . . . . . . . . . . . . . . . . . . . . . 212 14.5.3 Experimental setup and results . . . . . . . . . . . . . 212 14.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215 Speech, Language, and Document Processing 219 15 Speech Processing 221 15.1 Methods for automatic speech recognition . . . . . . . . . . . 221 15.1.1 Hidden Markov model-based approach . . . . . . . . . 222 15.1.2 Instance-based approach . . . . . . . . . . . . . . . . . 223 15.2 Front-end processing of speech . . . . . . . . . . . . . . . . . 224 15.2.1 Microphone array based speech processing . . . . . . . 224 15.2.2 Noise-robust feature extraction . . . . . . . . . . . . . 225 15.3 Posterior-based automatic speech recognition . . . . . . . . . 227 15.3.1 Enhancement of a posteriori probabilities using hierarchical architectures . . . . . . . . . . . . . . . . 228 15.3.2 Multistream combination . . . . . . . . . . . . . . . . 230 15.3.3 MLP feature based ASR . . . . . . . . . . . . . . . . . 232 15.3.4 Categorical HMM based ASR . . . . . . . . . . . . . . 235 © 2014 by EPFL Press x InteractiveMutlimodalInformationManagement 15.3.5 Template-based ASR using posterior features . . . . . 237 15.4 The Juicer decoder . . . . . . . . . . . . . . . . . . . . . . . . 240 15.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 241 16 Research Trends in Speaker Diarization 247 16.1 Goals and applications of speaker diarization . . . . . . . . . 247 16.2 A state-of-the-art speaker diarization system . . . . . . . . . 248 16.2.1 Underlying model . . . . . . . . . . . . . . . . . . . . 248 16.2.2 Speaker diarization process . . . . . . . . . . . . . . . 248 16.3 Research problems in speaker diarization . . . . . . . . . . . 251 16.3.1 Impact of data domain on diarization . . . . . . . . . 252 16.3.2 Diarization using multiple distant microphones . . . . 253 16.3.3 Purification . . . . . . . . . . . . . . . . . . . . . . . . 254 16.3.4 Automatic estimation of system parameters . . . . . . 254 16.3.5 Speech/non-speech detection . . . . . . . . . . . . . . 255 16.3.6 Error analysis. . . . . . . . . . . . . . . . . . . . . . . 256 16.3.7 Speed and accuracy improvements . . . . . . . . . . . 256 16.3.8 Combining speaker diarization with localization. . . . 258 16.4 Conclusions and perspectives . . . . . . . . . . . . . . . . . . 258 17 Speaker Diarization of Large Corpora 263 17.1 Two-stage cross-meeting diarization . . . . . . . . . . . . . . 263 17.2 Speaker linking . . . . . . . . . . . . . . . . . . . . . . . . . . 265 17.2.1 Speaker cluster modeling . . . . . . . . . . . . . . . . 265 17.2.2 Ward clustering . . . . . . . . . . . . . . . . . . . . . 266 17.2.3 Cluster dissimilarity . . . . . . . . . . . . . . . . . . . 267 17.2.4 Speaker labeling . . . . . . . . . . . . . . . . . . . . . 268 17.3 Experimental results . . . . . . . . . . . . . . . . . . . . . . . 268 17.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 270 18 Language Processing in Dialogues 273 18.1 Objectives of language analysis in meetings . . . . . . . . . . 273 18.2 Dialogue acts . . . . . . . . . . . . . . . . . . . . . . . . . . . 275 18.2.1 Manual annotation of dialogue acts. . . . . . . . . . . 275 18.2.2 Automatic recognition of dialogue acts . . . . . . . . . 277 18.3 Discourse particles . . . . . . . . . . . . . . . . . . . . . . . . 278 18.4 Thematic episodes and hot spots . . . . . . . . . . . . . . . . 279 18.5 Semantic cross-modal alignment . . . . . . . . . . . . . . . . 279 18.6 Conclusion and perspectives . . . . . . . . . . . . . . . . . . 280 19 Offline Handwriting Recognition 285 19.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 285 19.2 Offline word recognition . . . . . . . . . . . . . . . . . . . . . 287 19.3 From word to text recognition . . . . . . . . . . . . . . . . . 289 19.3.1 The data . . . . . . . . . . . . . . . . . . . . . . . . . 289 19.3.2 Decoding techniques and language modeling. . . . . . 290 19.3.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . 291 © 2014 by EPFL Press