EXPLOITING THE GAP IN HUMAN AND MACHINE ABILITIES IN HANDWRITING RECOGNITION FOR WEB SECURITY APPLICATIONS By Amalia Rusu August,2007 A DISSERTATION SUBMITTED TO THE FACULTY OF THE GRADUATE SCHOOL OF STATE UNIVERSITY OF NEW YORK AT BUFFALO IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING (cid:0) c Copyright 2007 by Amalia Rusu ii To my children and husband, Andreea, Alex, and Adrian, for their love and support. iii Acknowledgments I am deeply grateful to several people for making this thesis possible. I would like to begin by expressing my deep appreciation to my major advisor, Dr. Venu Govindaraju. I would highly recommend him as advisor to any prospective student looking around for a mentor. With his encouragement, guidance, and support he has helped me realize what is important in my research and career. His help and enthusiasm have been invaluable and have motivated me to become a researcher. I will always be grateful for his positive influencein mylife. Many thanksto my dissertation committee,Dr. Peter Scott forhis time and supportand Dr. WilliamRapaportforhiscommentsonmyresearchwork. Iwouldalsoliketooffermy thanksto manyotherpeopleat theUniversityatBuffalo, the faculty andthe officestafffor being so friendly and supportive. I would like to also express my appreciation to the col- leagues, students, researchers and staffat the Center of Excellence for Document Analysis andRecognition(CEDAR) andCenter forUnifiedBiometricsandSensors(CUBS). I am deeply thankful to my family for their patience and love,my parentsand my sister for guiding me, my children Andreea and Alex for being so sweet, and finally for loveand supporttomyhusbandAdrianwhonevergivesupon hishighestexpectationsforme. iv Abstract Automatedrecognitionofunconstrainedhandwritingcontinuestobeachallengingresearch task. In contrast to the traditional role of handwriting recognition in applications such as postal automation, bank check reading etc, in this dissertation we explore the use of handwritingrecognitionforcybersecurity. HIPs(HumanInteractiveProofs)areautomatic reverse Turing tests designed so that virtually all humans can pass the test but state-of-the- artcomputerprogramswillfail. Machine-printed,text-basedHIPsarenowcommonlyused to defend against bot attacks. We have designed a new methodology that will exploit the gap between the abilities of humans and computers in reading handwritten text images to designefficientHIPs. We have: (i) developed an algorithm to automatically generate random and infinitely manydistinct handwritten HIPs, (ii) identified the weaknesses ofstate-of-the-art handwrit- ingrecognizers,and(iii)developedamethodwhichexploitsthestrengthsofhumanreading abilities that can be controlled, so that the HIPs are human readable but not machine read- able. We have used a large repository of handwritten word images that current handwriting recognizers cannot read (even when provided with a lexicon) and also generated synthetic v handwritten samples using a character tracing model. We have designed word images (HIPs) to take advantage of both our knowledge of the common source of errors in au- tomated handwriting recognition systems as well as the salient aspects of human reading. Forexample,humanscantolerateintermittentbreaksinstrokes(usingtheGestaltlaws)but current computer programs fail when the breaks vary in size or exceed certain thresholds. The simultaneous interplay of several Gestalt laws of perception and the geon theory of pattern recognition (that implies object recognition by components) adds to the challenge offindingthe parametersthattrulyseparatehuman andmachineabilities. We have conducted several experiments which have all reconfirmed the superiority of humans in reading handwritten text especially under conditions of lowimage quality,clut- ter, and occlusion, and empirically demonstrated that handwritten HIPs are a viable option forcybersecurityapplications. OurgoalistousehandwrittenHIPsforprotectionofappli- cations,data,andsystemsinnetworks thatareconnectedto theInternet(Cyberspace). vi List of Figures 1.1 HandwrittenCAPTCHAchallenges. . . . . . . . . . . . . . . . . . . . . . 4 1.2 Maincomponentsthat buildup thisdissertation. . . . . . . . . . . . . . . . 7 2.1 VariousCAPTCHAtests. . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.2 HandwrittenCAPTCHAchallengeseasyto interpretbyhumansbutnotby machines. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.3 Speed (in seconds) and accuracy (the percentage of correctly recognized words) of a lexicon-driven handwritten word recognizer when the lexicon contains10,100,1,000,and20,000entries(words). . . . . . . . . . . . . . 14 2.4 Lexicon-drivenmodelforwordrecognizer. (Figuretakenfrom[25]) . . . . 16 2.5 Lexicon-drivenmodelforcharacterrecognizer. (Figuretakenfrom[19]) . . 16 2.6 Graphememodel. (Figuretakenfrom [65]) . . . . . . . . . . . . . . . . . 17 2.7 Handwritten word images recognized by the state-of-the-art handwriting recognizers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.8 AutomaticauthenticationsessionforWebservices. . . . . . . . . . . . . . 19 3.1 Originalhandwrittenimage (a). Syntheticimages (b,c,d,e,f). . . . . . . . . 26 3.2 Asynthetichandwritingsample. . . . . . . . . . . . . . . . . . . . . . . . 27 vii 3.3 Atraced templateforcharacterx. . . . . . . . . . . . . . . . . . . . . . . . 28 3.4 Examplesofafunction,madeupofcosinefunctionsegments(top)andour proposedwave-likefunction(bottom). . . . . . . . . . . . . . . . . . . . . 29 3.5 Illustrationofvariousnonlineartransformationsperformedindividually: a) onlyascentlinevariation,b)onlyx-linevariation,c)onlydescentlinevari- ation, d) only text width variation, e) only shearing variation, and f) only baselinevariation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 3.6 Perturbationsofcurve-definingpoints withvariousdegreesofperturbation. 33 3.7 Thefinalizedsynthetichandwritingsamplewithvaryingwidthandthickness. 33 3.8 TheGUIofthe tracingprogram. . . . . . . . . . . . . . . . . . . . . . . . 35 3.9 TheGUIofthe generatorprogram. . . . . . . . . . . . . . . . . . . . . . . 36 3.10 Synthetic handwritten CAPTCHA challenges based on real and non-sense words. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 3.11 Synthetic handwritten samples using various parameters for the same tem- plates. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 4.1 Exampleofcontext: isthe letter thesame? . . . . . . . . . . . . . . . . . . 42 4.2 SeveralexamplesforGestaltlawsofperception: a)similarity,b)proximity, c) continuity, d) symmetry, e) closure, f) familiarity, g) figure-ground, and h)memory. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 4.3 Evidence of Geon Theory when objects are lacking some of their compo- nents. a)Recoverableobjects,b)Non-recoverableobjects. . . . . . . . . . 52 4.4 a)Basicgeons. b)Objects constructedfromgeons. . . . . . . . . . . . . . 52 viii 4.5 Objectrecognitionissizeinvariant.. . . . . . . . . . . . . . . . . . . . . . 52 4.6 Objectrecognitionisrotationalinvariant. . . . . . . . . . . . . . . . . . . 52 4.7 The truth words are: Lockport, Silver Creek, Young America, W. Seneca, NewYork . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 4.8 Thetruth wordsare: LosAngeles,Buffalo,Kenmore . . . . . . . . . . . . . 53 4.9 Thetruth wordsare: Young America,Clinton,Blasdell . . . . . . . . . . . 53 4.10 Thetruth wordsare: Albany,Buffalo,Rockport . . . . . . . . . . . . . . . 53 4.11 Thetruth wordis: Buffalo . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 4.12 Thetruth wordsare: Syracuse,Tampa, Amherst,Kenmore . . . . . . . . . . 53 4.13 Thetruth wordsare: Buffalo,Hamburg,Waterville,Lewiston . . . . . . . . 53 4.14 Thetruth wordsare: Binghamton,Lockport, Rochester,Bradenton . . . . . 54 4.15 Thetruth wordis: W.Seneca . . . . . . . . . . . . . . . . . . . . . . . . . 54 5.1 Handwritten CAPTCHA images that exploit the gap in abilities between humansand computers. Humans can read them, but OCR andhandwriting recognitionsystems fail. . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 5.2 HandwrittenUS citynameimagescollectedoravailablefrom postalappli- cations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 5.3 Isolatedupperandlowercasehandwrittencharactersusedtogenerateword images,realornonsense. . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 5.4 HandwritingCAPTCHApuzzlegeneration. . . . . . . . . . . . . . . . . . 58 5.5 Severaltransformationsthataffectimagequality. . . . . . . . . . . . . . . 60 ix 5.6 Segmentationerrorsarecausedbyover-segmentation,merging,fragmenta- tion, ligatures, scrawls, etc. To make segmentation fail we can delete liga- tures,usetouchingletters/digits,mergecharactersforoversegmentationor tobeunableto segment. . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 5.7 Increasing lexicon challenges such as size, density, and availability cause problemsto handwritingrecognizers. . . . . . . . . . . . . . . . . . . . . . 62 5.8 Transformationsthataffectthe imagefeatures. . . . . . . . . . . . . . . . . 63 5.9 MultiplechoicehandwrittenCAPTCHA. . . . . . . . . . . . . . . . . . . 64 5.10 Confusing results: a) if the overlaps are too large both humans and ma- chinescould recognize awrong word(e.g., Wiilllliiamsvilllleewherein re- alitythe truth wordis Williamsville),b)machines can read the image if the overlapsare toosmall(thetruth wordsisLockport). . . . . . . . . . . . . . 66 5.11 Word images that have been recognized by machine due to size uncorrela- tions. Thetruthwordsare: Cheektowaga, YoungAmerica. . . . . . . . . . 66 5.12 The area where the occlusions are applied has to be carefully chosen. We show examples here that do not pose enough difficulty to computers and thereforetheyhaverecognizedthe words AlbanyandSilver Creek. . . . . . 66 5.13 Example of handwritten image that was recognized by one of our testing recognizers. ThetruthwordisLewiston. . . . . . . . . . . . . . . . . . . . 67 x
Description: