Advances in Pattern Recognition Forothertitlespublishedinthisseries,goto www.springer.com/series/4205 Stefano Ferilli Automatic Digital Document Processing and Management Problems, Algorithms and Techniques StefanoFerilli DipartimentodiInformatica UniversitàdiBari ViaE.Orabona4 70126Bari Italy [email protected] SeriesEditor ProfessorSameerSingh,PhD ResearchSchoolofInformatics LoughboroughUniversity Loughborough UK ISSN1617-7916 ISBN978-0-85729-197-4 e-ISBN978-0-85729-198-1 DOI10.1007/978-0-85729-198-1 SpringerLondonDordrechtHeidelbergNewYork BritishLibraryCataloguinginPublicationData AcataloguerecordforthisbookisavailablefromtheBritishLibrary ©Springer-VerlagLondonLimited2011 Apartfromanyfairdealingforthepurposesofresearchorprivatestudy,orcriticismorreview,asper- mittedundertheCopyright,DesignsandPatentsAct1988,thispublicationmayonlybereproduced, storedortransmitted,inanyformorbyanymeans,withthepriorpermissioninwritingofthepublish- ers,orinthecaseofreprographicreproductioninaccordancewiththetermsoflicensesissuedbythe CopyrightLicensingAgency.Enquiriesconcerningreproductionoutsidethosetermsshouldbesentto thepublishers. Theuseofregisterednames,trademarks,etc.,inthispublicationdoesnotimply,evenintheabsenceofa specificstatement,thatsuchnamesareexemptfromtherelevantlawsandregulationsandthereforefree forgeneraluse. Thepublishermakesnorepresentation,expressorimplied,withregardtotheaccuracyoftheinformation containedinthisbookandcannotacceptanylegalresponsibilityorliabilityforanyerrorsoromissions thatmaybemade. Coverdesign:VTEX,Vilnius Printedonacid-freepaper SpringerispartofSpringerScience+BusinessMedia(www.springer.com) Verba volant,scriptamanent Foreword Imagineaworldwithoutdocuments!Nobooks,magazines,email,laws,andrecipes. Forbetterorworse,inourworlddocumentsoutnumbereveryotherkindofartifact. The Library of Congress contains over 20 million books plus another 100 million items in the special collections. Google indexes about 20 billion web pages (July 2010). Both of these are growing like topsy. There are far more documents than houses,cars,electronicgadgets,socks,andevenrubberbands. Sincethedawnofantiquity,documentshaveplayedanessentialroleinfostering civilizations.Onecouldmakeaplausibleargumentthatmaterialandsocialprogress areproportionaltodocumentdensity.Ithelps,ofcourse,ifthedocumentsarewidely availableratherthankeptunderlockandkeyinamonasteryortheroyallibrary.The firstchapterofthisbookhighlightstherapidtransitionofjuridicalandcommercial recordsfrompapertoelectronicform.Therestofthebooktracesthecontemporane- ousevolutionofdigitaldocumentprocessingfromesotericresearchintoapervasive technology. Thevarietyofphysicalembodimentsofpaperdocuments(multivolumeencyclo- pedias, pocket books, newspapers, magazines, passports, driver’s licenses) is eas- ily matchedby the number of file types, generally known by acronyms like DOC, PDF, HTM, XML, TIF, GIF. The suffix indicates the type of processing appropri- ateforeachfiletype.Differencesbetweenfiletypesincludehowtheyarecreated, compressed,decompressed,andrenderedinahuman-readableformat.Equallyim- portant is the balance between ease of modification of textual content (linguistic orlogicalcomponents)andappearance(physicalorlayoutcomponents).Professor Ferillioffersanup-to-datetourofthearchitectureofcommondocumentfiletypes, lists their areas of applicability, and provides detailed explanations of the notions underlyingclassicalcompressionalgorithms. Importantpaperdocumentslikedeedsandstockcertificatesarewrittenorprinted in indelible ink on watermarked stock, and then locked into a vault. Digital files storedonpersonalcomputersortransmittedovertheinternetmustbesimilarlypro- tectedagainstmalwarelaunchedbycuriousormalicioushackers.Perhapssurpris- ingly,manyoftheelaborateandreconditemeasuresusedtoensuretheauthenticity, secrecy,andtraceabilityofdigitaldocumentsarerootedinNumberTheory.Thisis vii viii Foreword one of the oldest branches of pure mathematics, with many counterintuitive theo- rems related to the factorization of integers into prime numbers. The difficulty of factorizing large numbers is of fundamental importance in modern cryptography. Nevertheless,somecurrentsystemsalsoincorporatesymbolsubstitutionsandshuf- flesborrowedfromancientciphers. Because of its applications to computer security, cryptography has advanced moreinthelast40yearsthaninthepreviousthreethousand.Amongwidelyused systemsbasedonthesecret(symmetric)keyandpublic(asymmetric)keyparadigms aretheDataEncryptionStandard(DES),RivestCiphers(RCn),theRivest,Shamir, Adleman (RSA) algorithm, and the Digital Signature Algorithm (DSA). One-way encryption is often used to encrypt passwords and to generate a digital fingerprint that provides proof that a file has not been altered. Other methods can provide ir- refutable evidence of the transmission of a document from one party to another. ProfessorFerilliexpertlyguidesthereaderalongtheoftentortuouspathsfromthe basicmathematicaltheoremstotheresultingsecuritysoftware. Thereisapiquantcontrastbetweentheroleofnationalgovernmentsinpromot- ing the use of secure software to facilitate commerce and to grease the wheels of democracy, and its duty to restrict the propagation of secure software in order to protect its military value and to maintain the ability of law enforcement agents to access potentially criminal communications.This books reviews recent legislation withemphasisonrelevantItalian,BritishandEuropeanCommunitylaws. Whilewearenotyetreadytodeclareallwrittenmaterialthatexistsonlyonpa- peraslegacydocuments,thatmomentcannotbeveryfar.ProfessorFerilliisakey memberofaresearchteamthathasmadesteadyprogressforupwardsof30years ontheconversionofscannedpaperdocumentstodigitalformats.Hepresentsause- fuloverviewofthenecessarytechniques,rangingfromthelow-levelpreprocessing functions of binarization and skew correction to complex methods based on first- orderlogicfordocumentclassificationandlayoutanalysis.Manyofthealgorithms thatheconsidersbest-in-classhavebeenincorporated,afterfurthertuning,intohis group’sprototypedocumentanalysissystems. Forscanneddocuments,thetransitionfromdocumentimageanalysistocontent analysisrequiresopticalcharacterrecognition(OCR).Evenborn-digitaldocuments mayrequireOCRintheabsenceofsoftwareforreadingintermediatefileformats. OCR is discussed mainly from the perspective of open source developments be- cause little public information is available on commercial products. Handwritten andhand-printeddocumentsareoutsidethescopeofthiswork. Some documents such as musical scores, maps and engineering drawings are based primarily on long-established and specialized graphic conventions. They make use of application-specific file types and compression methods. Others, like postalenvelopes,bankchecksandinvoiceforms,arebasedonlettersanddigitsbut don’t contain a succession of sentences. Except for some sections on general im- ageprocessingtechniquesandcolorspaces,thefocusofthisbookisondocuments comprisingmainlynaturallanguage. Thevalueofdigitaldocumentstranscendseaseofstorage,transmissionandre- production. Digital representation also offers the potential of the use of computer Foreword ix programstofinddocumentsrelevanttoaqueryfromacorpusandtoanswerques- tions based on facts contained therein. The corpus may contain all the documents accessibleontheWorldWideWeb,inadigitallibrary,orinadomain-specificcol- lection(e.g.,ofjournalsandconferencereportsrelatedtodigitaldocumentprocess- ing). For text-based documents, both information retrieval (IR) and query-answer (QA) systems require natural language processing (NLP). Procedures range from establishing the relative frequency, morphology and syntactic role of words to de- termining the sense, in a particular context, of words, phrases and sentences. Of- tensimplerelationshipswitharcanenames,likesynonymy,antinomy,hyperonymy, hyponymy,meronymyandholonymy,aresoughtbetweentermsandconcepts.For- tunately,thenecessarylinguisticresourceslikelexicons,dictionaries,thesauri,and grammarsarereadilyavailableindigitalform. To avoid having to search through the entire collection for each query, docu- ments in large collections—even the World Wide Web—are indexed according to their putative content. Metadata (data about the data, like catalog information) is extractedandstoredseparately.TherelevantNLP,IRandIEtechniques(basedon both statistical methods and formal logic) and the management of large collection ofdocumentsarereviewedinthelasttwochapters. The study of documents from the perspective of computer science is so enjoy- ablepartlybecauseitprovides,asisevidentfromthisvolume,manyopportunitiesto bridgecultureandtechnology.Thematerialpresentedinthefollowingpageswillbe mostvaluabletothemanyresearchersandstudentswhoalreadyhaveadeepunder- standingofsomeaspectofdocumentprocessing.Itissuchscholarswhoarelikelyto feeltheneed,andtoharvestthebenefits,oflearningmoreaboutthegrowinggamut oftechniquesnecessarytocopewiththeentiresubjectofdigitaldocumentprocess- ing.Extensivereferencesfacilitatefurtherexplorationofthisfascinatingtopic. RensselaerPolytechnicInstitute Prof.GeorgeNagy Preface Automatic document processing plays a crucial role in the present society, due to the progressive spread of computer-readable documents in everyday life, from in- formalusestomoreofficialexploitations.Thisholdsnotonlyfornewdocuments, typicallyborndigital,butalsoforlegacyonesthatundergoadigitizationprocessin ordertobeexploitedincomputer-basedenvironments.Inturn,theincreasedavail- ability of digital documents has caused a corresponding increase in users’ needs and expectations. It is a very hot topic in these years, for both academy and in- dustry, as witnessed by several flourishing research areas related to it and by the ever-increasingnumberandvarietyofapplicationsavailableonthemarket.Indeed, the broad range of document kinds and formats existing today makes this subject a many-faceted and intrinsically multi-disciplinary one that joins the most diverse branches of knowledge, covering the whole spectrum of humanities, science and technology.It turns outto be afairly complexdomainevenfocusingon theCom- puterScienceperspectivealone,since almostallof its branchescomeintoplayin document processing, management, storage and retrieval, in order to support the severalconcernsinvolvedin,andtosolvethemanyproblemsraisedfrom,applica- tiontoreal-worldtasks.Theresultinglandscapecallsforareferencetextwhereall involvedaspectsarecollected,describedandrelatedtoeachother. ThisbookconcernsAutomaticDigitalDocumentProcessingandManagement, where the adjective ‘digital’ is interpreted as being associated to ‘processing and management’ratherthanto‘document’,thusincludingalsodigitizeddocumentsin the focus of interest, in addition to born-digital ones. It is conceived as a survey onthedifferentissuesinvolvedintheprincipalstagesofadigitaldocument’slife, aimed at providing a sufficiently complete and technically valid idea of the whole range of steps occurring in digital document handling and processing, instead of focusing particularly on any specific one of them. For many of such steps, funda- mentalsandestablishedtechnology(orcurrentproposalsforquestionsstillunderin- vestigation)arepresented.Beingthemattertoowideandscattered,acompletecov- erageofthesignificantliteratureisinfeasible.Moreimportantismakingthereader acquaintedofthemainproblemsinvolved,oftheComputerSciencebranchessuit- ablefortacklingthem,andofsomeresearchmilestonesandinterestingapproaches xi xii Preface available. Thus, after introducing each area of concern, a more detailed descrip- tionisgivenofselectedalgorithmsandtechniquesproposedinthisfieldalongthe past decades. The choice was not made with the aim of indicating the best solu- tions available in the state-of-the-art (indeed, no experimental validation result is reported),butratherforthepurposeofcomparingdifferentperspectivesonhowthe various problems can be faced, and possibly complementary enough to give good chanceoffruitfulintegration. Theorganizationofthebookreflectsthenaturalflowofphasesindigitaldocu- ment processing: acquisition, representation, security, pre-processing, layout anal- ysis, understanding, analysis of single components, information extraction, filing, indexingandretrieval.Specifically,threemainpartsaredistinguished: PartI dealswithdigitaldocuments,theirroleandexploitation.Chapter1provides an introduction to documents, their history and their features, and to the specific digitalperspectiveonthem.Chapter2thenoverviewsthecurrentwidespreadfor- mats for digital document representation, divided by category according to the degreeofstructuretheyexpress.Chapter3discussestechnologicalsolutionstoen- surethatdigitaldocumentscanfulfillsuitablesecurityrequirementsallowingtheir exploitationinformalenvironmentsinwhichlegalissuescomeintoplay. PartII introducesimportantnotionsandtoolsconcerningthegeometricalandpic- torial perspective on documents. Chapter 4 proposes a selection of the wide lit- eratureonimageprocessing,withspecificreferencetotechniquesusefulforhan- dlingimagesthatrepresentawholedigitizeddocumentorjustspecificcomponents thereof.Chapter5isdevotedtothecoreofprocessingandrepresentationissuesre- latedtothevariousstepsadocumentgoesthroughfromits submissionuptothe identificationofitsclassandrelevantcomponents. PartIII analyzesthewaysinwhichusefulinformationcanbeextractedfromthe documents in order to improve their subsequent exploitation, and is particularly focused on textual information (although a quick glance to the emerging field of image retrieval is also given). Chapter 6 surveys the landscape of Natural Lan- guageProcessingresourcesandtechniquesdevelopedtocarryoutlinguisticanal- ysis steps that are preliminary to further processing aimed at content handling. Chapter 7 closes the book dealing with the ultimate objective of document pro- cessing:beingabletoextract,retrieveandrepresent,possiblyatasemanticlevel, thesubjectwithwhichadocumentisconcernedandtheinformationitconveys. AppendicesAandBbrieflyrecallfundamentalMachineLearningnotions,andde- scribe as a case-study a prototypical framework for building an intelligent system aimedatmerginginatightcooperationandinteractionmostofthepresentedsolu- tions,toprovideaglobalapproachtodigitaldocumentsandlibrariesmanagement. Thebookaimsatbeingself-containedasmuchaspossible.Onlybasiccomputer scienceandhigh-schoolmathematicalbackgroundisneededtobeabletoreadand understanditscontent.Generalpresentationofthevarioustopicsandmorespecific aspects thereof are neatly separated, in order to facilitate exploitation by readers interestedineitherofthetwo.Thetechnicallevelis,whenneeded,sufficientlyde- tailed to give a precise account of the matter presented, but not so deep and per- vasiveastodiscouragenon-professionalsfromusefullyexploitingit.Inparticular, Preface xiii mostverytechnicalpartsarelimitedtosub-subsections,sothattheycanbeskipped withoutlosingthegeneralviewandunityofthecontents.Tobettersupportreaders, particularcarewasputintheaidstoconsultation:theindexreportsbothacronyms and their version in full, and in case of phrases includes entries for all component terms; the glossary collects notions that are referred to in different places of the book,sothatasinglereferenceisprovided,avoidingredundancy;theacronymlist isverydetailed,includingevenitemsthatareusedonlyonceinthetextbutcanbe needed in everyday practice on document processing; the final Reference section collectsallthebibliographycitedinthetext. Themainnoveltyofthisbookliesinitsbridgingthegapsleftbythecurrentlit- erature,whereallworksfocusonspecificsub-fieldsofdigitaldocumentprocessing butdonotframetheminapanoramicperspectiveofthewholesubjectnorprovide links to related areas of interest. It is conceived as a monograph for practitioners thatneedasingleandwide-spectrumvade-mecumtothemanydifferentaspectsin- volvedindigitaldocumentprocessing,alongwiththeproblemstheypose,notewor- thy solutions and practices proposed in the last decades, possible applications and openquestions.Itaimsatacquaintingthereaderwiththegeneralfieldandatbeing complementedbyotherpublicationsreportedintheReferencesforfurtherin-depth and specific treatment of the various aspects it introduces. The possible uses, and connected benefits, are manifold. In an academic environment, it can be exploited asatextbookforundergraduate/graduatecoursesinterestedinabroadcoverageof the topic.1 Researchers may consider it as a bridge between their specific area of interest and the other disciplines, steps and issues involved in Digital Document Processing.Document-basedorganizationsandfinaluserscanfinditusefulaswell, asarepertoireofpossibletechnologicalsolutionstotheirneeds. Althoughcarehasbeenputonthoroughproof-readingofthedrafts,thesizeof thisworkmakesitlikelythatsometyposorotherkindsofimprecisionsarepresent inthefinalversion.I apologizeinadvancefor this,andwillbe gratefultoanyone whowillnotifymeaboutthem. Bari,Italy StefanoFerilli 1Theincludedmaterialistoomuchforasemester,buttheteachercanselectwhichpartstostress moreandwhichonestojustintroduce.