ebook img

Structures of String Matching and Data Compression PDF

130 Pages·2012·0.66 MB·English
by  
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Structures of String Matching and Data Compression

Structures of String Matching and Data Compression N. Jesper Larsson Departmentof ComputerScience LundUniversity DepartmentofComputerScience LundUniversity Box118 S-22100Lund Sweden Copyright'1999byJesperLarsson CODEN:LUNFD6/(NFCS-1015)/1(cid:150)130/(1999) ISBN 91-628-3685-4 Abstract This doctoral dissertation presents a range of results concerning ef(cid:2)- cient algorithms and data structures for string processing, including several schemescontributing to sequential data compression.It com- prises boththeoretic resultsand practical implementations. We study the suf(cid:2)x tree data structure, presenting an ef(cid:2)cient rep- resentation and several generalizations. This includes augmenting the suf(cid:2)xtreetofullysupportslidingwindowindexing(includingapracti- calimplementation)inlineartime.Furthermore,weconsideravariant thatindexesnaturallyword-partitioneddata, andpresentalinear-time constructionalgorithmforatreethatrepresentsonlysuf(cid:2)xesstarting atwordboundaries,requiringspacelinearinthenumberofwords. By applying our sliding window indexing techniques, we achieve an ef(cid:2)cient implementation for dictionary-based compression based on the LZ-77 algorithm. Furthermore, considering predictive source modelling, we show that a PPM* style model can be maintained in linear time using arbitrarily bounded storage space. We also consider the related problem of suf(cid:2)x sorting, applicable to suf(cid:2)x array construction and block sorting compression. We present an algorithm that eliminates super(cid:3)uousprocessing of previoussolu- tions while maintaining robust worst-case behaviour.We experimen- tally show favourable performance for a wide range of natural and degenerate inputs,and present a complete implementation. Block sorting compression using BWT, the Burrows-Wheeler trans- form,hasimplicitstructurecloselyrelatedtocontexttreesusedinpre- dictive modelling. We show how an explicit BWT context tree can be ef(cid:2)ciently generated as a subset of the corresponding suf(cid:2)x tree and explore the central problems in using this structure. We experi- mentallyevaluatepredictioncapabilitiesofthetreeandconsiderrep- resenting it explicitly as part of the compressed data, arguing that a conscious treatment of the context tree can combine the compres- sion performance of predictivemodelling with the computational ef- (cid:2)ciency of BWT. Finally,weexploreof(cid:3)inedictionary-basedcompression,andpresent asemi-staticsourcemodellingschemethatobtainsexcellentcompres- sion,yetisalsocapableofhighdecodingrates.Theamountofmemory used by the decoder is (cid:3)exible, and the compressed data has the po- tential of supporting direct search operations. Between theoryand practice, sometalk as iftheyweretwo (cid:150)making a separation and difference between them. Yet wise men know that bothcan begained inapplyingoneselfwhole-heartedlytoone. Bhagavad-Gfl(cid:17)tafl 5:4 Short-sighted programming can fail to improve the quality of life. It can reduce it, causing economic loss or even physical harm. In a few extreme cases, bad programmingpractice can lead todeath. P.J. Plauger, Computer Language, Dec. 1990 Contents Foreword 7 ChapterOne Fundamentals 9 1.1 Basic De(cid:2)nitions 10 1.2 TrieStorageConsiderations 12 1.3 Suf(cid:2)xTrees 13 1.4 SequentialDataCompression 19 ChapterTwo SlidingWindowIndexing 21 2.1 Suf(cid:2)xTreeConstruction 22 2.2 SlidingtheWindow 24 2.3 StorageIssuesandFinalResult 32 ChapterThree IndexingWord-PartitionedData 33 3.1 De(cid:2)nitions 34 3.2 WastingSpace:AlgorithmA 36 3.3 SavingSpace:AlgorithmB 36 3.4 ExtensionsandVariations 40 3.5 SublinearConstruction:AlgorithmC 41 3.6 AdditionalNotesonPractice 45 ChapterFour Suf(cid:2)xSorting 48 4.1 Background 50 4.2 AFasterSuf(cid:2)xSort 52 4.3 TimeComplexity 56 4.4 AlgorithmRe(cid:2)nements 59 4.5 ImplementationandExperiments 63 ChapterFive Suf(cid:2)xTreeSourceModels 71 5.1 Ziv-LempelModel 71 5.2 PredictiveModelling 73 5.3 Suf(cid:2)xTreePPM*Model 74 5.4 FinitePPM*Model 76 5.5 Non-StructuralOperations 76 5.6 Conclusions 78 ChapterSix Burrows-WheelerContextTrees 79 6.1 Background 80 6.2 ContextTrees 82 6.3 TheRelationshipbetweenMove-to-frontCoding andContextTrees 86 6.4 ContextTreeBWTCompressionSchemes 87 6.5 FinalComments 89 ChapterSeven Semi-Static DictionaryModel 91 7.1 PreviousApproaches 93 7.2 RecursivePairing 94 7.3 Implementation 95 7.4 CompressionEffectiveness 101 7.5 EncodingtheDictionary 102 7.6 Tradeoffs 105 7.7 ExperimentalResults 106 7.8 FutureWork 110 AppendixA SlidingWindowSuf(cid:2)xTreeImplementation 111 AppendixB Suf(cid:2)xSortingImplementation 119 AppendixC Notation 125 Bibliography 127 Foreword O riginally,mymotivationforstudyingcomputersciencewasmost likely spawned by a calculator I bought fourteen years ago. This gad- getcouldstoreashortsequenceofoperations,includingaconditional jumptothestart,whichmadeitpossibletoprogramsurprisinglyintri- catecomputations.Isoonrealizedthatthissimplemechanismhadthe powerto replace the tedious repeated calculations I sodetested with an intellectual exercise: to (cid:2)nd a general method to solve a speci(cid:2)c problem (something I would later learn to refer to as an algorithm) thatcouldbeexpressedbypressingasequenceofcalculator keys.My fascination for this process still remains. With more powerful computers, programming is easier, and more challenging problems are needed to keep the process interesting. Ul- timately,inalgorithmtheory,the bothersofproducingan actual pro- gram are completely skipped over. Instead, the (cid:2)nal product is an explanation of how an idealized machine could be programmed to solve a problem ef(cid:2)ciently. In this abstract world, program elements are represented as mathematical objects that interact as if they were physical.Theycanbechainedtogether,piledontopofeachother,or linked together to any level of complexity. Without these data struc- tures, which can be combined into specialized tools for solving the problemat hand, producinglargeorcomplicated programswould be infeasible. However, they do not exist any further than in the pro- grammer’smind;whentheprogramistobewritten,everythingmust again betranslated into morebasic operations.Inmyresearch,Ihave 7 Foreword triedtomaintain thisconnection, seeingalgorithmtheorynotmerely as mathematics, but ultimately as a programming tool. At a low level, computers represent everything as sequences of numbers, albeit with different interpretations depending on the con- text. The main topic in this thesis is algorithms and data structures (cid:150) most often tree shaped structures (cid:150) for(cid:2)nding patterns and repeti- tions in long sequences, strings, of similar items. Examples of typical strings are texts (strings of letters and punctuation marks), programs (stringsofoperations),andgeneticdata(stringsofaminoacids).Even two-dimensional data, such as pictures, are represented as strings at a lower level. One area particularly explored in the thesis is storing strings compactly, compressing them, by recording repetition and sys- tematically introducing abbreviations forrepeating patterns. Theresultisacollectionofmethodsfororganizing,searching,and compressing data. Its creation has deepened my insights in computer science enormously, and I hope some of it can make a lasting contri- bution to the computing world as well. Numerouspeoplehavein(cid:3)uencedthis work.Obviously,mycoau- thorsfordifferentpartsofthethesis,ArneAndersson,AlistairMoffat, Kunihiko Sadakane, and Kurt Swanson, have had a direct part in its creation,butmanyothershavecontributedinavarietyofways.With- outattemptingtonamethemall,Iwouldliketoexpressmygratitude toallthecentral andperipheralmembersoftheglobalresearchcom- munity who have supported and assisted me. Thein(cid:3)uenceofmyadvisorArneAnderssongoesbeyondthework wherehe stands as an author.He broughtme into the research com- munity from his special angle, and imprinted me with his views and visions.Hisnotionsofwhatisrelevantresearch,andhowitshouldbe presented, have guided me throughthese last (cid:2)ve years. Finally,Iwishtospeci(cid:2)callythankAlistairMoffatforinvitingmeto Melbourneandcollaboratingwithmeforthreemonths,duringwhich time I was accepted as a full member of his dynamic research group. This gave me a newperspective,and a signi(cid:2)cant pushtowards com- pleting the thesis. Malm(cid:246),August1999 JesperLarsson 8 Chapter One Fundamentals T he main theme of this work is the organization of sequential data to (cid:2)nd and exploit patterns and regularities. This chapter de(cid:2)nes ba- sicconcepts,formulatesfundamental observationsandtheorems,and presentsanef(cid:2)cientsuf(cid:2)xtreerepresentation.Followingchaptersfre- quently referand relate tothe material givenin this chapter. The material and much of the text in this current work is taken primarilyfromthe following(cid:2)ve previouslypresentedwritings: (cid:149) ExtendedApplicationofSuf(cid:2)xTreesto DataCompression,presented at the IEEE Data Compression Conference 1996 [42]. A revised and updated version of this material is laid out in chapters two and (cid:2)ve, andtosomeextentin§1.3. (cid:149) Suf(cid:2)x Trees on Words, written in collaboration with Arne Andersson andKurtSwanson,publishedinAlgorithmica,March1998[4].Apre- liminary versionwaspresentedat theseventhAnnual Symposiumon Combinatorial Pattern Matching in June 1996. This is presented in chapterthree,withsomeofthepreliminariesgivenin§1.2. (cid:149) TheContextTreesofBlock SortingCompression,presented at the IEEE Data Compression Conference 1998 [43]. This is the basis of chap- tersix. (cid:149) Of(cid:3)ine Dictionary-Based Compression,written with Alistair Moffat of theUniversityofMelbourne,presentedattheIEEEDataCompression Conference1999[44].Anextendedversionofthisworkispresented inchapterseven. 9 ChapterOne (cid:149) FasterSuf(cid:2)xSorting,writtenwithKunihikoSadakaneoftheUniversity of Tokyo; technical report, submitted [45]. This work is reported in chapterfour.Someofitsmaterialhasbeenpresentedinapreliminary version as A Fast Algorithm for Making Suf(cid:2)xArrays andfor Burrows- WheelerTransformationbyKunihikoSadakane[59]. 1.1 Basic De(cid:2)nitions We assume that the reader is familiar with basic conventional de(cid:2)ni- tions regarding strings and graphs, and do not attempt to completely de(cid:2)ne all the concepts used. However, to resolve differences in the literature concerning the use of some concepts, we state the de(cid:2)ni- tions of not only our specialized concepts, but also of some more general ones. For quick reference to our specialized notations, appendix C on pages 125(cid:150)126 summarizes terms and symbols used in each of the chapters of the thesis. Notational Convention For notation regarding asymptotic growth of functionsandsimilarconcepts,weadoptthegeneraltraditionincom- puterscience;see,forinstance,Cormen,Leiserson,andRivest[20]. All logarithms in the thesis are assumed to be base two, except where otherwise stated. 1.1.1 SymbolsandStrings Theinputofeach ofthe algorithmsdescribed in this thesis is a sequence of items which we refer to as symbols. The interpretationofthesesymbolsasletters,programinstructions,amino acids, etc., is generally beyond our scope. We treat a symbol as an abstract elementthat canrepresentany kindofunitinthe actual im- plementation (cid:150)althoughwe doprovideseveral examplesof practical uses,andoftenaim oureffortsat aparticular areaofapplication. Two basic sets of operations for symbols are common. Either the symbolsareconsideredatomic(cid:150)indivisibleunitssubjecttoonlyafew prede(cid:2)nedoperations,ofwhichpairwisecomparisonisacommonex- ample(cid:150)ortheyareassumedtoberepresentedbyintegers,andthereby possibletomanipulatewithallthecommonarithmeticoperations.We adoptpredominantlythelatter approach,sinceourprimarygoalisto develop practically useful tools, and in present computers everything isalways,atthelowestlevel,representedasintegers.Thus,restricting allowedoperationsbeyondthesetofarithmeticonesoftenintroduces an unrealistic impediment. Wedenote thesize ofthe inputalphabet,thesetof possiblevalues 10

Description:
grammer's mind; when the program is to be written, everything must. again be thors for different parts of the thesis, Arne Andersson, Alistair Moffat,. Kunihiko munity from his special angle, and imprinted me with his views and . it needs to be regarded as numeric, we normally assign it the value
See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.