COMPUTATIONALLYACCELERATEDPAPYROLOGY By AlexC.Williams Athesissubmittedinpartialfulfillment oftherequirementsforthedegreeof MASTEROFSCIENCE in ComputerScience MiddleTennesseeStateUniversity May2015 ThesisCommittee: Dr. HyrumD.Carroll Dr. JohnF.Wallin Dr. CenLi ACKNOWLEDGEMENTS Iwouldliketothankmyadvisor,Dr. HyrumCarroll,forhiswonderfulmentorshipand his contributions to my growth as both a professional researcher and computer scientist. I would like to also thank Dr. John Wallin for encouraging me to be think creatively and introducing me to theexciting area of citizen science. Iwould also like to acknowledge the facultywithintheDepartmentofComputerScience,particularlyDr. ChrisilaPettey,who havegraciouslyofferedtheirsupportandencouragementincompletingthisthesis. Iwouldliketothankmyfellowgraduatestudents,AnthonyMills,AnthonyDavis,Matt Houglum, RobertReaves, MattWang, andNathan Reale, fortheir advice andpatience in listeningtomerambleaboutmythesis. I would like to extend the largest of thanks to the University of Oxford’s papyrology staff, which includes James Brusuelas, Chiara Meccariello, Gertjan Verhasselt, Maroula Salemenou,andAnnapaolaSantarsiero,fortheirassistanceindevelopingbettersoftware forpapyrologyandcontributingtothestrengthofthisthesis. Iwould liketoacknowledgeandthankthe EgyptianExplorationSocietyfor accessto thedatasetusedinthisthesis. Lastly,IwouldliketothankRoyaforherencouragement,love,andsupportthroughout thecompletionofthisthesis. Mostimportantly,Iwouldliketothankbothmyparents,Mike and Tammy, to whom this thesis is dedicated, for inspiring me to pursue my dreams and bringpositivechangetotheworldthroughmywork. ii ABSTRACT Papyrologists transcribe and identify papyrus fragments in order to enrich modern lives by better understanding the linguistics, culture, and literature of the ancient world. In practice, these tasks are extremely challenging and slow due the limited amount of information preserved in each papyrus fragment (i.e., due to deterioration). For example, sincetheirdiscoveryinthelate19thcentury,only10%ofthemorethan500,000fragments intheOxyrhynchuspapyricollectionhasbeengivenpreliminaryidentifications. Thisthesispresentstwocomputationalapproachesforacceleratingpapyrustranscription andidentification. Thefirstapproachisacomputationalpipelinethataggregatesmillions ofcrowdsourcedletterclassificationsintotranscriptionsofpapyrusfragments. Thesecond approach leverages genetic sequence alignment algorithms to rapidly identify damaged papyrus fragments to known papyrus manuscripts. These approaches greatly improve upon thecurrentstate-of-the-arttechniquesand setanewstandardforleveragingcomputationto thetranscriptionandidentificationofancienttexts. iii PREFACE Chapter 3 is a version of the conference paper, A Computational Pipeline for Crowd- sourcedTranscriptionsofAncientGreekPapyrusFragments1. TheUniversityofOxford’s ZooniverseteamandAncient Livesproject, whichincludesChrisLintott, JohnWallin, and JamesBrusuelas,createdtheAncientLivesinterfaceandprovidedthecrowdsourceddata usedinthisstudy. CollaboratorsfromtheUniversityofMinnesota,includingLucyFortson, Anne-FrancoiseLamblin,andHaoyuYu,contributedgreatlytotheinitialstagesofproject development. Idesignedanddevelopedthepre-processingstageforseparatingindividual click-dataforeachfragmentintodistinctfilesanddetermininganimage’smostprobableori- entation. JohnWallinandHaoyuYudesignedanddevelopedthekernel-basedapproachfor aggregatingindividualuserletteridentifications. JohnWallinalsodesignedanddeveloped thestepwiseapproach,whichIhavemadeveryfewmodifications. Idevelopedtheapproach used in Stage 2 of the pipeline which calculates line sequences. John Wallin designed and developed an additional version of Stage 2 that was referential in the design of my approach. Idevelopedahigh-levelwrapperforbothexecutingandevaluatingthepipeline. I also developed visualization software used to overlay consensus letters and letter-groups on an image (see Figure 8). Marco Perale manually performed a preliminary evaluation of the kernel-based approach. Marco Perale, JamesBrusuelas, andDirk Obbink provided letteridentificationsfromthepublishedOxyrhynchusvolumesthatwereusedinevaluating both stages of the pipeline. I conceived the study’s two-part evaluation. Additionally, I performed the analysis of the results for each stage and wrote the software for creating Figure9. Additionally,Iwrote themanuscriptfor thepublishedpaper. JohnWallin was the supervisoryauthorof thisprojectandhas beeninvolvedheavily throughouttheprojectin conceptformationandcontinuousdevelopment. 1Williams,A.C.,Wallin,J.F.,Yu,H.,Perale,M.,Carroll,H.D.,Lamblin,A.F.,Fortson,L.,Obbink. D, Lintott,C.J.,&Brusuelas,J.H.(2014,October). AComputationalPipelineforCrowdsourcedTranscriptions ofAncientGreekPapyrusFragments. InBigData(BigData),2014IEEEInternationalConferenceon(pp. 100-105). IEEE. iv Chapter 4 is a version of the conference paper, Identification of Ancient Greek Papyrus FragmentsUsingGeneticSequenceAlignmentAlgorithms2. HyrumCarrollandJohnWallin initiated the study by implementing an identity substitution matrix in BLAST, converting a proprietary database of Greek manuscripts into a FASTA format, and creating the en- codingschemebetweenaminoacidsandGreekletters. CollaboratorsfromtheUniversity of Minnesota, including Anne-Francoise Lamblin and Haoyu Yu, contributed additional preliminary work with investigating the applicability of genetic sequence alignment. I investigatedapplicablemethodologiesforcalculatinganewsubstitutionandselectedthe BLOSUM methodology. Both Hyrum Carroll and I conceived how to leverage the BLO- SUMmethodology. Idevelopedsoftwarefor(a)calculatingbackgroundfrequenciesofeach letter in the Greek alphabet, (b) calculating misidentification ratios for each letter pair in the AncientLivescrowdsourced data,and (c) calculatinga newsubstitution matrix withthe calculated letter background frequencies and misidentification ratios. I implemented the calculated matrix into BLAST. Hyrum Carroll provided software that I heavily modified to createthe 14,100 simulated fragmentsused to evaluate thenewmatrix. I conceived and performed the study’s evaluation. I independently analyzed the results of the evaluation andbothauthoredandpresentedthepublishedpaper. HyrumCarrollandJohnWallinare the supervisory authors for this study and were involved to a great degree during project conception. TheworkinbothChapters3and4wouldnothavebeenpossiblewithoutthemillionsof contributions madebythe thousandsofgenerous volunteersfrom theZooniverseproject. Toeachvolunteer,Iwouldliketoextendaspecialthanksforallottingtheirvaluabletimeto makingeachoftheseprojectspossible. 2Williams,A.C.,Carroll,H.D.,Wallin,J.F.,Brusuelas,J.,Fortson,L.,Lamblin,A.F.,&Yu,H.(2014, October). IdentificationofAncientGreekPapyrusFragmentsUsingGeneticSequenceAlignmentAlgorithms. Ine-Science(e-Science),2014IEEE10thInternationalConferenceon(Vol. 2,pp. 5-10). IEEE. v TABLEOFCONTENTS LISTOFTABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii LISTOFFIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix CHAPTERI. INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 Papyrology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 ThesisStatement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 ThesisOutline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 CHAPTERII. RELATEDWORK . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 MethodsforAcceleratingTranscription . . . . . . . . . . . . . . . . . . . 8 OpticalCharacterRecognitionforAncientManuscripts . . . . . . . 8 CrowdsourcingTranscriptionsforAncientManuscripts . . . . . . . 9 MethodsforAcceleratingIdentification . . . . . . . . . . . . . . . . . . . 12 PairwiseSequenceAlignment . . . . . . . . . . . . . . . . . . . . 12 GeneticSequenceAlignment . . . . . . . . . . . . . . . . . . . . . 13 CHAPTERIII.ACOMPUTATIONALPIPELINEFORCROWDSOURCEDTRAN- SCRIPTIONSOFANCIENTGREEKPAPYRUSFRAGMENTS . . . . . . . . 15 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 PreprocessingStage . . . . . . . . . . . . . . . . . . . . . . . . . . 17 AggregationofUserClicksintoConsensusClicks . . . . . . . . . 19 CreationofLineSequencesfromConsensusClicks . . . . . . . . . 20 PipelineEvaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 vi EvaluationofConsensusLetterIdentificationsfromUsers . . . . . 22 EvaluationofLineSequenceCreationComponent . . . . . . . . . 23 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 ConsensusLetterIdentificationEvaluationResults . . . . . . . . . 24 LineSequenceCreationComponentEvaluationResults . . . . . . . 25 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 FutureWork . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 CHAPTERIV. IDENTIFICATIONOFANCIENTGREEKPAPYRUSFRAGMENTS USINGGENETICSEQUENCEALIGNMENTALGORITHMS . . . . . . . . . 28 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 RelatedWork . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 AppliedSequenceAlignment . . . . . . . . . . . . . . . . . . . . . 30 AcceleratedTranscriptionandIdentification . . . . . . . . . . . . . 31 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 CalculatingaNewSubstitutionMatrix . . . . . . . . . . . . . . . . 33 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 SimulatedFragments . . . . . . . . . . . . . . . . . . . . . . . . . 35 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 FutureWork . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 CHAPTERV. CONCLUSION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 AcceleratedTranscriptionsforAncientGreekPapyrusFragments . . . . . . 41 AcceleratedIdentificationforAncientGreekPapyrusFragments . . . . . . 42 FutureWork . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 vii LISTOFTABLES Table 1– Average F scores of consensus letter identifications in each hand- 1 writingstylefromboththekernel-basedapproachandstepwiseaggregation approach. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 Table2– Averagefragmentlengths, average editdistances, anderrorratiosfor the41fragmentsusedinthelinesequencecreationcomponentevaluation. Error ratios are calculated by dividing the average edit distance by the averagefragmentlength. . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 Table 3– TheGreekLetterOrientedSubstitutionMatrix(GLOSUM)isa sub- stitution matrixthat contains ascore for every possibleletter-pair alignment intheGreekalphabet. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 Table 4– Example of a simulated fragment with key variable modifications highlightedinred. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 Table5– Percentagesofidentificationforeachsubsetoffragmentsequences usedintheevaluation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 viii LISTOFFIGURES Figure1– AdeterioratedpapyrusfragmentfromtheOxyrhynchuscollection. . 3 Figure2– Anexampleofhowtheimageextractionalgorithmmisidentifieda portionofP.Oxyfragment3076asmultiplefragments. . . . . . . . . . . . 9 Figure3– AnexampleofaverticallineinthemarginofP.Oxyfragment4611. 10 Figure4– TheAncientLivestranscriptioninterface. . . . . . . . . . . . . . . 11 Figure5– Anexamplealignmentbetween“elegant”and“elephant”. . . . . . . 12 Figure 6– The architectureof the computationalpipeline used to create consen- sustranscriptionsfromuser-clicksmadethroughtheAncientLivesinterface. 17 Figure 7– An Ancient Lives fragment image (left). A digital consensus tran- scriptionforthesameAncientLivesfragment(right). . . . . . . . . . . . . 18 Figure8– Avisualizationoftheprocessingperformedbythepipelineusingthe sameAncientLivesfragmentfromFigure7. . . . . . . . . . . . . . . . . . 20 Figure9– APrecision-Recallgraphthatvisualizeshowtheprecisionandrecall of each consensus letter identification from both approaches align with respecttocalculatedF scoresinTable1. . . . . . . . . . . . . . . . . . . 25 1 Figure10– ApapyrusfragmentfromOxyrhynchus. . . . . . . . . . . . . . . . 29 Figure11– ExamplealignmentoftwogeneticsequencesbyBLAST[3]. . . . . 30 Figure12– Asectionoftheoriginalpapyrusandthecrowdsourcedtranscription. 32 Figure 13– Example match between a fragment (top line) and a portion of a knownfull-textmanuscript(bottomline)inGreek-BLAST. . . . . . . . . . 33 Figure14– Aboveisaplotrepresentingtheaverageidentificationperformanceof allsimulatedfragmentsintheverticalgapratesubsetbasedonaverticalgap raterangingfrom0to10andfragmentlengthsrangingfrom10characters to150characters. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 ix 1 CHAPTERI. INTRODUCTION Classicalhistoriansagreethatthevastmajorityofwrittenknowledgefromtheancient Mediterranean (i.e., Egypt, Greece, and Rome) vanished with the burning of the Ancient LibraryofAlexandriain48BC[25]. Withthedestructionofthelibrary,muchofthemodern world’s knowledge of ancient life has been established by studying ancient texts. Some writtenworksofantiquityhavesurvivedtodaydaythroughaprocessknownasmedieval transmission. In this process, works of antiquity were manually selected and copied by scribesuntiltheMiddleAges,wherecopiesoftheseworksfinallymadetheirwaytoprint, thus being extant today. However, the works that did not undergo medieval transmission havesincebeenlost. Today, professionals continuetoresearchtheancientworld, specificallyGraeco-Roman Egypt, by studying collections of papyrus fragments that were discovered and excavated from archaeological dig sites in the Mediterranean region. Over the last century, only a handfulofancientpapyricollectionshavebeenfoundthroughouttheworld. Oneofthemost influentialand richestcollectionsof discoveredpapyrus todateis theOxyrhynchuspapyri1, avasttroveofover500,000GreekpapyrusfragmentsdiscoveredintheancientEgyptian city of Oxyrhynchus by British classicists Bernard Grenfell and Arthur Hunt [8]. After transportingthecollectiontotheUniversityofOxford,thesescholarsandtheirsuccessors began the development of formal methods and techniques for studying and extracting information from the papyrus fragments. Their efforts and work with the Oxyrhynchus papyri,alongsidethoseofotherspecialistsofthe19thand20thcenturies,wouldultimately formthebasisforthemoderndisciplineofpapyrology,thepracticeofstudyingancienttexts writtenon papyrus inorder tobetter understandthe culture,literature, andlifestyle ofthe ancientworld,especiallyGraeco-RomanEgypt. 1http://www.papyrology.ox.ac.uk/POxy/
Description: