SimoneMarinaiandHiromichiFujisawa(Eds.) MachineLearninginDocumentAnalysisandRecognition StudiesinComputationalIntelligence,Volume90 Editor-in-chief Prof.JanuszKacprzyk SystemsResearchInstitute PolishAcademyofSciences ul.Newelska6 01-447Warsaw Poland E-mail:[email protected] Furthervolumesofthisseriescanbefoundonour Vol.78.CostinBadicaandMarcinPaprzycki(Eds.) homepage:springer.com IntelligentandDistributedComputing,2008 ISBN978-3-540-74929-5 Vol.66.LakhmiC.Jain,VasilePaladeandDipti Vol.79.XingCaiandT.-C.JimYeh(Eds.) Srinivasan(Eds.) QuantitativeInformationFusionforHydrological AdvancesinEvolutionaryComputingforSystemDesign, Sciences,2008 2007 ISBN978-3-540-75383-4 ISBN978-3-540-72376-9 Vol.80.JoachimDiederich Vol.67.VassilisG.KaburlasosandGerhardX.Ritter(Eds.) RuleExtractionfromSupportVectorMachines,2008 ComputationalIntelligenceBasedonLatticeTheory,2007 ISBN978-3-540-75389-6 ISBN978-3-540-72686-9 Vol.81.K.Sridharan Vol.68.CiprianoGalindo,Juan-Antonio RoboticExplorationandLandmarkDetermination,2008 Ferna´ndez-MadrigalandJavierGonzalez ISBN978-3-540-75393-3 AMulti-HierarchicalSymbolicModeloftheEnvironment Vol.82.AjithAbraham,CrinaGrosanandWitold forImprovingMobileRobotOperation,2007 Pedrycz(Eds.) ISBN978-3-540-72688-3 EngineeringEvolutionaryIntelligentSystems,2008 Vol.69.FalkoDresslerandIacopoCarreras(Eds.) ISBN978-3-540-75395-7 AdvancesinBiologicallyInspiredInformationSystems: Vol.83.BhanuPrasadandS.R.M.Prasanna(Eds.) Models,Methods,andTools,2007 Speech,Audio,ImageandBiomedicalSignalProcessing ISBN978-3-540-72692-0 usingNeuralNetworks,2008 Vol.70.JavaanSinghChahl,LakhmiC.Jain, ISBN978-3-540-75397-1 AkikoMizutaniandMikaSato-Ilic(Eds.) Vol.84.MarekR.OgielaandRyszardTadeusiewicz InnovationsinIntelligentMachines-1,2007 ModernComputationalIntelligenceMethodsforthe ISBN978-3-540-72695-1 InterpretationofMedicalImages,2008 Vol.71.NorioBaba,LakhmiC.JainandHisashiHanda ISBN978-3-540-75399-5 (Eds.) Vol.85.ArpadKelemen,AjithAbrahamandYulanLiang AdvancedIntelligentParadigmsinComputer (Eds.) Games,2007 ComputationalIntelligenceinMedicalInformatics,2008 ISBN978-3-540-72704-0 ISBN978-3-540-75766-5 Vol.72.RaymondS.T.LeeandVincenzoLoia(Eds.) Vol.86.ZbigniewLesandMogdalenaLes ComputationIntelligenceforAgent-basedSystems,2007 ShapeUnderstandingSystems,2008 ISBN978-3-540-73175-7 ISBN978-3-540-75768-9 Vol.73.PetraPerner(Ed.) Vol.87.YuriAvramenkoandAndrzejKraslawski Case-BasedReasoningonImagesandSignals,2008 CaseBasedDesign,2008 ISBN978-3-540-73178-8 ISBN978-3-540-75705-4 Vol.74.RobertSchaefer Vol.88.TinaYu,DavidDavis,CemBaydarandRajkumarRoy FoundationofGlobalGeneticOptimization,2007 (Eds.) ISBN978-3-540-73191-7 EvolutionaryComputationinPractice,2008 ISBN978-3-540-75770-2 Vol.75.CrinaGrosan,AjithAbrahamandHisaoIshibuchi (Eds.) Vol.89.ItoTakayuki,HattoriHiromitsu,ZhangMinjie HybridEvolutionaryAlgorithms,2007 andMatsuoTokuro(Eds.) ISBN978-3-540-73296-9 Rational,Robust,Secure,2008 ISBN978-3-540-76281-2 Vol.76.SubhasChandraMukhopadhyayandGourabSen Gupta(Eds.) Vol.90.SimoneMarinaiandHiromichiFujisawa(Eds.) AutonomousRobotsandAgents,2007 MachineLearninginDocumentAnalysisandRecognition, ISBN978-3-540-73423-9 2008 ISBN978-3-540-76279-9 Vol.77.BarbaraHammerandPascalHitzler(Eds.) PerspectivesofNeural-SymbolicIntegration,2007 ISBN978-3-540-73953-1 Simone Marinai Hiromichi Fujisawa (Eds.) Machine Learning in Document Analysis and Recognition With142Figuresand41Tables 123 SimoneMarinai HiromichiFujisawa DipartimentodiSistemieInformatica HitachiCentralResearchLaboratory UniversityofFlorence 1-280,Higashi-Koigakubo ViaS.Marta,3 Kokubunji-shi,Tokyo185-8601 50139Firenze Japan Italy [email protected] [email protected] ISBN978-3-540-76279-9 e-ISBN978-3-540-76280-5 StudiesinComputationalIntelligenceISSN1860-949X LibraryofCongressControlNumber:2007939058 (cid:1)c 2008Springer-VerlagBerlinHeidelberg Thisworkissubjecttocopyright.Allrightsarereserved,whetherthewholeorpartofthematerial isconcerned,specificallytherightsoftranslation,reprinting,reuseofillustrations,recitation,broad- casting,reproductiononmicrofilmorinanyotherway,andstorageindatabanks.Duplicationof thispublicationorpartsthereofispermittedonlyundertheprovisionsoftheGermanCopyrightLaw ofSeptember9,1965,initscurrentversion,andpermissionforusemustalwaysbeobtainedfrom Springer-Verlag.ViolationsareliabletoprosecutionundertheGermanCopyrightLaw. Theuseofgeneraldescriptivenames,registerednames,trademarks,etc.inthispublicationdoesnot imply, even in the absence of a specific statement, that such names are exempt from the relevant protectivelawsandregulationsandthereforefreeforgeneraluse. Coverdesign:Deblik,Berlin,Germany Printedonacid-freepaper 9 8 7 6 5 4 3 2 1 springer.com Preface The objective of Document Analysis and Recognition (DAR) is to recognize thetextandgraphicalcomponentsofadocumentandtoextractinformation. With first papers dating back to the 1960’s, DAR is a mature but still grow- ing researchfield withconsolidatedandknowntechniques.OpticalCharacter Recognition (OCR) engines are some of the most widely recognized prod- ucts ofthe researchinthis field, while broaderDARtechniques arenowadays studied and applied to other industrial and office automation systems. In the machine learning community, one of the most widely known re- searchproblems addressedin DAR is recognition of unconstrained handwrit- ten characterswhichhas beenfrequently used inthe pastasa benchmark for evaluating machine learning algorithms, especially supervised classifiers. However, developing a DAR system is a complex engineering task that involves the integration of multiple techniques into an organic framework. A reader may feel that the use of machine learning algorithms is not appropri- ate for other DAR tasks than character recognition. On the contrary, such algorithms have been massively used for nearly all the tasks in DAR. With large emphasis being devoted to character recognition and word recognition, other tasks such as pre-processing, layout analysis, character segmentation, and signature verification have also benefited much from machine learning algorithms. This book is a collection of research papers and state-of-the-art reviews by leading researchers all over the world including pointers to challenges and opportunities for future research directions. The main goals of the book are identification of good practices for the use of learning strategies in DAR, identification of DAR tasks more appropriate for these techniques, and high- lighting new learning algorithms that may be successfully applied to DAR. Depending on reader’s interests, there are several paths that can be fol- lowed when reading the chapters of the book. We therefore avoided grouping the chaptersinto sections;insteadwe provide a deepintroduction to the field and to the book’s contents in the first chapter. VI Preface It is our hope that this book will help readers identify the current status of the use of machine learning techniques in DAR. Moreover, we expect that itcancontributetostimulatenewideas,newcollaborationsandnewresearch activities in this researcharena. We wish to express our warmest thanks to the Authors, without whose interesting work this book would not have materialized. Simone Marinai, Hiromichi Fujisawa July 2007 Contents Introduction to Document Analysis and Recognition Simone Marinai ................................................. 1 Structure Extraction in Printed Documents Using Neural Approaches Abdel Bela¨ıd and Yves Rangoni .................................... 21 Machine Learning for Reading Order Detection in Document Image Understanding Donato Malerba, Michelangelo Ceci, and Margherita Berardi .......... 45 Decision-Based Specification and Comparison of Table Recognition Algorithms Richard Zanibbi, Dorothea Blostein and James R. Cordy .............. 71 Machine Learning for Digital Document Processing: from Layout Analysis to Metadata Extraction Floriana Esposito, Stefano Ferilli, Teresa M.A. Basile, and Nicola Di Mauro .............................................105 Classification and Learning Methods for Character Recognition: Advances and Remaining Problems Cheng-Lin Liu and Hiromichi Fujisawa .............................139 Combining Classifiers with Informational Confidence Stefan Jaeger, Huanfeng Ma, and David Doermann ..................163 Self-Organizing Maps for Clustering in Document Image Analysis Simone Marinai, Emanuele Marino, and Giovanni Soda ..............193 Adaptive and Interactive Approaches to Document Analysis George Nagy and Sriharsha Veeramachaneni.........................221 VIII Contents Cursive Character Segmentation Using Neural Network Techniques Michael Blumenstein .............................................259 Multiple Hypotheses Document Analysis Tatsuhiko Kagehiro and Hiromichi Fujisawa .........................277 Learning Matching Score Dependencies for Classifier Combination Sergey Tulyakov and Venu Govindaraju .............................305 Perturbation Models for Generating Synthetic Training Data in Handwriting Recognition Tama´s Varga and Horst Bunke ....................................333 Review of Classifier Combination Methods Sergey Tulyakov, Stefan Jaeger, Venu Govindaraju, and David Doermann.............................................361 Machine Learning for Signature Verification Sargur N. Srihari, Harish Srinivasan, Siyuan Chen, and Matthew J. Beal .............................................387 Off-line Writer Identification and Verification Using Gaussian Mixture Models Andreas Schlapbach and Horst Bunke...............................409 Index..........................................................429 Contributors Teresa M.A. Basile Kingston, Ontario, Universita` di Bari Canada, K7L 3N6 Dipartimento di Informatica [email protected] Via Orabona 4 Michael Blumenstein 70126 Bari - Italy School of Information and Commu- [email protected] nication Technology Gold Coast campus Matthew J. Beal Griffith University QLD 4222 Department of Computer Science Australia and Engineering, [email protected] University at Buffalo, The State University of New York, Horst Bunke Buffalo NY, USA University of Bern [email protected] Institute of Computer Science and Applied Mathematics (IAM) Abdel Belaid Neubru¨ckstrasse 10 Universit´e Nancy 2 CH-3012 Bern, Switzerland LORIA UMR 7503 [email protected] France Michelangelo Ceci [email protected] Universita` di Bari Dipartimento di Informatica Margherita Berardi Via Orabona 4 Universita` di Bari 70126 Bari - Italy Dipartimento di Informatica [email protected] Via Orabona 4 70126 Bari - Italy Siyuan Chen [email protected] Center of Excellence for Document Analysis and Recognition Dorothea Blostein (CEDAR), School of Computing Buffalo NY, USA Queen’s University [email protected] X Contributors James R. Cordy Venu Govindaraju School of Computing University at Buffalo Queen’s University Dept. of Computer Science and Kingston, Ontario, Engineering Canada, K7L 3N6 520 Lee Entrance, Suite 202, UB [email protected] Commons Amherst, NY 14228-2567 Nicola Di Mauro [email protected] Universita` di Bari Dipartimento di Informatica Tatsuhiko Kagehiro Via Orabona 4 Central Research Laboratory, 70126 Bari - Italy Hitachi, Ltd. [email protected] 1-280 Higashi-koigakubo Kokubunji, Tokyo 185-8601 David Doermann Japan Laboratory for Language and Media tatsuhiko.kagehiro. Processing [email protected] Institute for Advanced Computer Stefan Jaeger Studies Institute for Advanced Computer 3451 AV Williams Building Studies, University of Maryland University of Maryland, College College Park, Maryland 20742 Park, [email protected] MD 20742,USA [email protected] Floriana Esposito Universita` di Bari Huanfeng Ma Dipartimento di Informatica Institute for Advanced Computer Via Orabona 4 Studies, 70126 Bari - Italy University of Maryland, College [email protected] Park, MD 20742,USA Stefano Ferilli [email protected] Universita` di Bari Dipartimento di Informatica Cheng-Lin Liu Via Orabona 4 Institute of Automation, 70126 Bari - Italy Chinese Academy of Sciences [email protected] Beijing 100080,P.R. China [email protected] Hiromichi Fujisawa Central Research Laboratory, Donato Malerba Hitachi, Ltd. Universita` di Bari 1-280 Higashi-koigakubo Dipartimento di Informatica Kokubunji, Tokyo 185-8601 Via Orabona 4 Japan 70126 Bari - Italy [email protected] [email protected]