ebook img

Guide to OCR for Arabic Scripts PDF

592 Pages·2012·19.587 MB·English
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Guide to OCR for Arabic Scripts

Guide to OCR for Arabic Scripts Volker Märgner (cid:2) Haikal El Abed Editors Guide to OCR for Arabic Scripts Editors VolkerMärgner HaikalElAbed InstituteforCommunicationsTechnology InstituteforCommunicationsTechnology BraunschweigTechnicalUniversity BraunschweigTechnicalUniversity Braunschweig,Niedersachsen Braunschweig,Niedersachsen Germany Germany ISBN978-1-4471-4071-9 ISBN978-1-4471-4072-6(eBook) DOI10.1007/978-1-4471-4072-6 SpringerLondonHeidelbergNewYorkDordrecht LibraryofCongressControlNumber:2012941223 ©Springer-VerlagLondon2012 Thisworkissubjecttocopyright.AllrightsarereservedbythePublisher,whetherthewholeorpartof thematerialisconcerned,specificallytherightsoftranslation,reprinting,reuseofillustrations,recitation, broadcasting,reproductiononmicrofilmsorinanyotherphysicalway,andtransmissionorinformation storageandretrieval,electronicadaptation,computersoftware,orbysimilarordissimilarmethodology nowknownorhereafterdeveloped.Exemptedfromthislegalreservationarebriefexcerptsinconnection with reviews or scholarly analysis or material supplied specifically for the purpose of being entered and executed on a computer system, for exclusive use by the purchaser of the work. Duplication of this publication or parts thereof is permitted only under the provisions of the Copyright Law of the Publisher’slocation,initscurrentversion,andpermissionforusemustalwaysbeobtainedfromSpringer. PermissionsforusemaybeobtainedthroughRightsLinkattheCopyrightClearanceCenter.Violations areliabletoprosecutionundertherespectiveCopyrightLaw. Theuseofgeneraldescriptivenames,registerednames,trademarks,servicemarks,etc.inthispublication doesnotimply,evenintheabsenceofaspecificstatement,thatsuchnamesareexemptfromtherelevant protectivelawsandregulationsandthereforefreeforgeneraluse. Whiletheadviceandinformationinthisbookarebelievedtobetrueandaccurateatthedateofpub- lication,neithertheauthorsnortheeditorsnorthepublishercanacceptanylegalresponsibilityforany errorsoromissionsthatmaybemade.Thepublishermakesnowarranty,expressorimplied,withrespect tothematerialcontainedherein. Printedonacid-freepaper SpringerispartofSpringerScience+BusinessMedia(www.springer.com) Foreword ItiswithgreatpleasurethatIacceptedtowritetheforewordforthisbook. In our society, we use computers increasingly for all kinds of communication. This means that for a language to work well for communication, we need to have computer tools—language technology—for that language. Language technology makescommunicationmoreefficient,anditenablesmanymoreuserstohaveaccess toinformationinvariousforms. Arabic is an important language, spoken in a large region of the world, and it isimportanttosupporttheuseofArabic—inbusiness,inpublicadministration,in privatehomes,andevenintheexchangeofinformationwiththerestoftheworld. AsnotallArabicinformationisalreadyavailableinelectronicform,oneofthe basic requirements in such a scenario is the ability of computers to “read” images of written Arabic, be it handwritten or printed, be it online or offline. This kind of technology (OCR—optical character recognition) is difficult to create even for languagesinLatinscript,andwrittenArabicaddstothecomplexity. This book is a comprehensive guide to the field of Arabic OCR, offering thor- ough descriptions of the data sets for training and testing, several different OCR methodologies,andtheevaluationandassessmentthereof. Beforeclosing,Iwouldliketocongratulatetheeditorsinhavingassembledsuch animportantcollectionofcontributionstoacomplexproblemforwhichtheArabic world,andtherestoftheworld,needssolutions. Copenhagen,Denmark BenteMaegaard v Preface ArabicScripts The Internet is a source of information which is used by approximately 2 bil- lionusersworldwide(www.internetworldstats.com).Twoaspectsaffectthewayin whichtheInternetisused,especiallyinsearchingfordocuments.Oneistheavail- abilityofmoreandmorescanneddocumentswithvaryingamountsofmetadatafor searchingthesedocumentsontheInternet,andtheotherostheappearanceofmore andmorelanguageswithnon-Latincharacters.Bothaspectsshowtheimportanceof developingrecognitiontechnologyforalltypesofcharactersandlanguagestomake thecontentofscannedimagesoftextavailabletotheInternetusers.Aworldwide- used acronym for any type of text recognition is OCR, which means optical char- acterrecognition.OCRisusednotonlyforrecognizingprintedcharacters,butitis oftenalsousedforcursivehandwriting,evenwhenwordsinsteadofsinglecharac- tersarerecognized.Somealternativeacronymsareusedforthecaseofhandwritten words,likeHWR(handwrittenwordrecognition)butthesearenotincommonuse today. Knowingthatabout200millionpeopleintheworlduseArabicastheirfirstlan- guage it is obvious that a growing interest of that huge group of Arabic-speaking Internetusersistosearchfordocumentsintheirmothertongue.Inparalleltothis situationisinthepastfewyearsagrowinginterestinArabicwordandtextrecog- nition has been observed. During that time two events have been important land- marks in Arabic text recognition technology development. In 2002 a database on Arabic handwritten words (IFN/ENIT-database)1 was made available to the com- munityandhasservedasareferenceforcompetitionssince2005(ICDAR2005).2 InSeptember2006asummitonArabicandChineseHandwritingRecognitionwas held at College Park,MD in the USA (SACH2006),3 where experts from both re- 1http://www.ifnenit.com 2http://ieeexplore.ieee.org/xpl/mostRecentIssue.jsp?punumber=10526 3http://www.umiacs.umd.edu/lamp/meetings/SACH06/ vii viii Preface Table1 Arabiccharacters(www.ethnologue.com) Fig.1 ExampleofanArabic printedtext searchfieldspresentedtheiractualwork.FromthattimeintensiveresearchonAra- bicscriptrecognitionstartedandhasresultedinabigstepforwardtoday. Arabicscriptisthesecondmostwidespreadscriptintheworld;itisusednotonly forArabicbutalsoforthePersian,Urdu,andPashtolanguages,forexample.Today 14languagesuseArabicscriptworldwide,whichshowsitsimportance.Character- isticsofArabicscriptareawritingdirectionfromrighttoleft,characterswithina wordbeingmostlyconnected,28characterswithdifferentshapesfordifferentpo- sitionsinaword,anddotsanddiacriticalsignsaboveandbelowcharacters.Table1 showsalltheshapesofthe28Arabiccharacters. Fordifferentlanguagessomeadditionalcharactersmaybeused.TypicalforAra- bic script is also the variation of a word in length by elongationof the connecting linesbetweenthecharacters.Figure1showsanexample. Preface ix Table2 Examplesof ligatures AfurtherimportantspecialArabicscriptstyleisthepossibilitytowritecharac- tersasverticalorhorizontalligatures.Theseligaturesmodifytheshapeofthechar- acterssignificantly.SomeexamplesofligaturesareshowninTable2.Allthesetyp- icalArabicscriptcharacteristicsinfluencetheprocessingandrecognitionofArabic scriptindifferentwaysandmakeitclearthatasimpleadaptationofLatincharacter- basedprocessingisnotpossible. ThisbookpresentsthestateoftheartofOCRforArabicscriptspresentedfrom mostactiveandsuccessfulgroups.Thepartsofthebookshowthatalotofworkstill hastobedoneonArabicscriptrecognition.Butthetechniquesandalgorithmsused areofgeneralinterest;manyproblemsaretypicalnotonlyforArabicbutformany otherscripts.WebelievethatthecollectionofArabicOCRrelatedworkisalsoan inspirationforotherscriptsandviceversa. The book is divided into four parts. Part I, Pre-processing, presents different aspects of pre-processing and feature extraction for Arabic OCR systems. Part II, Recognition,includeschapterswithdetailsaboutdifferentrecognitionapproaches. PartIIIcollectschaptersdescribingtheimportantaspectsofhowtoassesstheper- formanceofarecognitionsystem.ThefinalPartIV,Applicationspresentssystem solutionsforselectedapplicationfields. PartI:Pre-processing Part I presents different approaches for the pre-processing of OCR systems for Arabic. It starts with an overview of Arabic handwriting recognition technology. Srihari and Ball present in their chapter the parts of a recognition system from pre-processing to classification. Finally they discuss application fields and chal- lenges. Chapters 2–6 deal with pre-processing tasks of an OCR system. Bukhari, Shafait, and Breuel discuss layout analysis methods, Setlur and Govindaraju pre- processingissues,BelaidandOuwayedsegmentationofancientArabicdocuments, andLikforman-Sulemetal.featuresforwordrecognitionsystems. PartII: Recognition Chapters7–15presentdifferentapproachesfortherecognitionofArabicscript.The firstsixchaptersalluseHMM-basedapproaches.BorovikovandZavorinpresenta multi-stageapproachtodocumentanalysis,Ahmed,Mahmoud,andParvezarecog- nizerforprintedArabictext,Pechwitz,ElAbed,andMärgneranofflinehandwritten x Preface Arabicwordrecognizer,Dreuw,Rybach,Heigold,andNeyalargevocabularyop- ticalcharacterrecognitionsystem,Alkhoury,Gimenez,andJuanaBernoulli-based handwriting recognition system, Jifroodian and Suen a handwritten Farsi word recognitionsystem,Kessentini,Paquet,andBenHamadouamulti-streamMarkov modelrecognizer. Two chapters discuss further approaches. Graves presents a recognition system basedonmultidimensionalrecurrentneuralnetworks,andMozaffaridiscussesthe application of fractal theory for document analysis and recognition. Khemakhem andBelghithdiscussanOCRsystembasedonthecombinationofcomplementary systemsinChap.15. PartIII:Evaluation The subject of Part III is the evaluation of recognition systems. In Chap. 16 Za- vorin and Borovikov discuss data collectionand annotation, and Arabic handwrit- ingrecognitioncompetitionsaredescribedinChap.17byMärgnerandElAbed.In Chap.18,Slimaneetal.describebenchmarkingstrategiesforArabicwordrecogni- tion. PartIV:Applications The final Part IV presents different applications using Arabic script recognition technology. In Chap. 19 Cheriet and Moghaddam present a robust word spot- ting system for historical Arabic manuscripts. Natarajan discusses, in Chap. 20, script-independent methods for Arabic handwriting recognition, and Kundu and Hines present an Arabic handwriting recognition system using over-segmentation inChap.21.Boubakeretal.discussonlineArabicdatabasesandapplicationsusing thesedatainChap.22,andAbdelazeemetal.present,inChap.23,techniquesfor usingonlineandofflinefeaturesforArabichandwritingrecognition. TargetAudience Thisbookprovidesanoverviewofthestate-of-the-artresearchinthefieldofOCR for Arabic scripts. Different aspects and solutions have been addressed by the au- thors,andwehopethatthiscomprehensivecollectionofideas,problems,andsolu- tionsmotivatesresearcherstocontinuethiswork.Inthatsensethisbookshallserve asareferenceforresearchersandgraduatestudentsstudyingOCRtechnologyand methodologyingeneralandforArabicscriptinparticular. Braunschweig,Germany VolkerMärgner HaikalElAbed Acknowledgements We thank all the chapter authors for their contributions to this book, which make it an invaluable resource to researchers in the area of OCR for all kinds of script using Arabic characters. A total of 67 authors prepared their contributions for the 23chaptersofthebookandsupportedusduringallstepsfromconcepttodraftand finalized chapter. We would like to thank Bente Maegaard, Director of the Cen- tre for Language Technology at Copenhagen University, Denmark, for writing the foreword and for her continuing support in developing Arabic language resources andanetworkofresearchersinthisfield.WewouldalsoliketothankSpringerfor theirencouragementandpersistenceindrivingustocompletethisworkinatimely manner. Braunschweig,Germany VolkerMärgner HaikalElAbed xi Contents PartI Pre-processing 1 AnAssessmentofArabicHandwritingRecognitionTechnology . . . 3 SargurN.SrihariandGregoryBall 2 LayoutAnalysisofArabicScriptDocuments . . . . . . . . . . . . . 35 SyedSaqibBukhari,FaisalShafait,andThomasM.Breuel 3 AMulti-stageApproachtoArabicDocumentAnalysis . . . . . . . . 55 EugeneBorovikovandIlyaZavorin 4 Pre-processingIssuesinArabicOCR . . . . . . . . . . . . . . . . . . 79 ZhixinShi,SrirangarajSetlur,andVenuGovindaraju 5 SegmentationofAncientArabicDocuments . . . . . . . . . . . . . . 103 AbdelBelaïdandNazihOuwayed 6 FeaturesforHMM-BasedArabicHandwrittenWordRecognition Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 Laurence Likforman-Sulem, Ramy Al Hajj Mohammad, Chafic Mokbel,FaresMenasri,Anne-LaureBianne-Bernard,andChristopher Kermorvant PartII Recognition 7 PrintedArabicTextRecognition . . . . . . . . . . . . . . . . . . . . 147 IrfanAhmed,SabriA.Mahmoud,andMohammedTanvirParvez 8 HandwrittenArabicWordRecognitionUsing theIFN/ENIT-database . . . . . . . . . . . . . . . . . . . . . . . . . 169 MarioPechwitz,HaikalElAbed,andVolkerMärgner 9 RWTHOCR:ALargeVocabularyOpticalCharacterRecognition SystemforArabicScripts . . . . . . . . . . . . . . . . . . . . . . . . 215 PhilippeDreuw,DavidRybach,GeorgHeigold,andHermannNey xiii

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.