ebook img

String algorithms and data structures PDF

0.66 MB·
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview String algorithms and data structures

⋆ String algorithms and data structures Paolo Ferragina Dipartimento diInformatica, Universit`adi Pisa, Italy 8 0 Abstract. The string-matching field has grown at a such complicated 0 stagethatvariousissuescomeintoplaywhenstudyingit:datastructure 2 andalgorithmicdesign,databaseprinciples,compression techniques,ar- n chitecturalfeatures, cacheand prefetchingpolicies. Theexpertisenowa- a days required to design good string data structures and algorithms is J therefore transversal to many computer science fields and much more 5 study on the orchestration of known, or novel, techniques is needed to 1 makeprogressinthisfascinatingtopic.Thissurveyisaimedatillustrat- ing the key ideas which should constitute, in our opinion, the current ] background of every index designer. We also discuss the positive fea- S tures and drawbacks of known indexing schemes and algorithms, and D devotemuchattentionto detailresearch issuesand open problemsboth . s on thetheoretical and the experimental side. c [ 1 Introduction 1 v 8 String data is ubiquitous, common-place applications are digital libraries and 7 product catalogs (for books, music, software, etc.), electronic white and yel- 3 low page directories, specialized information sources (e.g. patent or genomic 2 databases),customerrelationshipmanagementofdata,etc..Theamountoftex- . 1 tualinformationmanagedbytheseapplicationsisincreasingatastaggeringrate. 0 ThebesttwoillustrativeexamplesofthisgrowtharetheWorld-WideWeb,which 8 0 is estimated to provide access to at least three terabytes of textual data, and : the genomic databases,which are estimated to store more than fifteen billion of v base pairs. Even in private hands are common now collection sizes which were i X unimaginable a few years ago. r This scenario is destined to become more pervasive due to the migration of a current databases toward XML storage [2]. XML is emerging as the de facto standard for the publication and interchange of heterogeneous, incomplete and irregular data over the Internet and amongst applications. It provides ground rules to markup data so it is self-describingand easilyreadable by humans and computers.LargeportionsofXMLdataaretextualandincludedescriptivefields and tags. Evaluating an XML query involves navigating paths through a tree (or,ingeneral,agraph)structure.Inordertospeedupqueryprocessing,current ⋆ Address: Dipartimento di Informatica, Corso Italia 40, 56125 Pisa, Italy, [email protected], http://www.di.unipi.it/∼ferragin. Partially sup- ported by Italian MIUR projects: “Technologies and services for enhanced content delivery” and “A high-performance distributed platform”. 2 approaches consist of encoding document paths into strings of arbitrary length (e.g.book/author/firstname/)andreplacingtreenavigationaloperationswith string prefix queries (see e.g. [52,129,4]). In all these situations brute-force scanning of such large collections is not a viable approach to perform string searches. Some kind of index has to be necessarily built over these massive textual data to effectively process string queries(ofarbitrarilylengths),possiblykeepingintoaccountthepresenceinour computersofvariousmemorylevels,eachwithitstechnologicalandperformance characteristics [8]. The index design problem therefore turns out to be more challenging than ever before. The American Heritage Dictionary (2000, fourth edition) defines index as follows: pl. (in · dex · es) or (in · di · ces) “ 1. Something that serves to guide,pointout,orotherwisefacilitatereference,especially:a. Analphabetized list of names, places, and subjects treated in a printed work, giving the page or pages onwhich eachitem is mentioned. b. A thumb index. c. Any table, file, or catalog. [...]” Somedefinitionsproposedbyexpertsare“Themostimportantofthetoolsfor information retrieval is the index—a collection of terms with pointers to places where information about documents can be found” [119]; “indexing is building a data structure that will allow quick searching of the text” [22]; or “the act of assigningindextermstodocumentswhicharetheobjectstoberetrieved”[111]. Fromour pointofview anindex is a persistentdata structure thatallowsat query time to focus the searchfor a user-providedstring (or a set of them) ona verysmallportionofthe indexeddata collection,namely the locations atwhich the queried string(s) occur. Of course the index is just one of the tools needed to fully solve a user query, so as the retrieval of the queried string locations is just the first step of what is called the “query answering process”. Information retrieval(IR)models,rankingalgorithms,querylanguagesandoperations,user- feedbackmodels andinterfaces,andso on,allofthem constitute the restofthis complicated process and are beyond the scope of this survey. Hereafter we will concentrate our attention onto the challenging problems concernedwith the de- signofefficientandeffectiveindexingdatastructures,thebasicblockuponwhich every IR system is built. We then refer the reader interested into those other interesting topics to the vast literature, browsing from e.g. [79,114,163,22,188]. The right step into the text-indexing field.The publicationsregarding indexing techniques and methodologies are a common outcome of database and algorithmic research. Their number is ever growing so that citing all of them is a task doomed to fail. This fact is contributing to make the evaluation of the novelty, impact and usefulness of the plethora of recent index proposals more and more difficult. Hence to approach from the correct angle the huge field of text indexing,wefirstneedaclearframeworkfordevelopment,presentationand comparisonofindexingschemes[193].The lackofthis frameworkhasleadsome researcherstounderestimatethefeaturesofknownindexes,disregardimportant criteria or make simplifying assumptions which have lead them to unrealistic and/or distort results. 3 The design of a new index passes through the evaluation of many criteria, not just its description and some toy experiments. We need at a minimum to consideroverallspeed,diskandmemoryspacerequirements,CPUtimeandmea- sures of disk traffic (such as number of seeks and volume of data transferred), and ease of index construction. In a dynamic setting we should also consider indexmaintenanceinthepresenceofaddition,modificationanddeletionofdoc- uments/records;and implications for concurrency,transactionsand recoverabil- ity.Alsoofinterestforbothstaticanddynamicdatacollectionsareapplicability, extensibility andscalability.Indeed no indexing scheme is all-powerful,different indexes support different classes of queries and manage different kinds of data, so that they may turn out to be useful in different application contexts. As a consequence there is no one single winner among the indexing data structures nowadays available, each one has its own positive features and drawbacks, and we must know all of their fine details in order to make the right choice when implementing an effective and efficient search engine or IR system. In what follows we therefore go into the main aspects which influence the design of an indexing data structure thus providing an overall view of the text indexing field; we introduce the arguments which will be detailed in the next sections, and we briefly comment on some recent topics of researchthat will be fully addressed at the end of each of these subsequent sections. The first key issue: The I/O subsystem. The large amount of textual informationcurrentlyavailableinelectronicformrequirestostoreitintoexternal storage devices, like (multiple) disks and cdroms. Although these mechanical devicesprovidealargeamountofspaceatlowcost,theiraccesstimeismorethan 105timesslowerthanthetimetoaccesstheinternalmemoryofcomputers[158]. This gap is currently widening with the impressive technological progresses on circuit design technology. Ongoing research on the engineering side is therefore trying to improve the input/output subsystem by introducing some hardware mechanismssuchasdiskarrays,diskcaches,etc..Neverthelesstheimprovement achievable by means of a proper arrangement of data and a properly structured algorithmic computationondiskdevicesabundantlysurpassesthebestexpected technology advancements [186]. Larger datasets can stress the need for locality of reference in that they may reduce the chance of sequential (cheap) disk accesses to the same block or cylinder;theymayincreasethedatafetchcosts(whicharetypicallylinearinthe datasetsize);andtheymayevenaffecttheproportionofdocuments/recordsthat answer to a user query. In this situation a na¨ıve index might incur the so called I/O-bottleneck,thatis,itsupdateandqueryoperationsmightspendmostofthe timeintransferringdatato/fromthediskwithaconsequentsensibleslowdownof their performance. As a result, the index scalability and the asymptotic analysis of index performance, orchestratedwith the disk consciousness of index design, are nowadays hot and challenging research topics which have shown to induce a positive effect not limited just to mechanical storage devices, but also to all other memory levels (L1 and L2 caches, internal memory, etc.). 4 To design and carefully analyze the scalability and query performance of an index we need a computational model that abstracts in a reasonable way the I/O-subsystem. Accurate disk models are complex [164], and it is virtually impossible to exploit all the fine points of disk characteristics systematically, either in practice or for algorithmic design. In order to capture in an easy, yet significant, way the differences between the internal (electronic) memory and the external (mechanical) disk, we adopt the external memory model proposed in [186]. Here a computer is abstracted to consist of a two-level memory: a fast and small internal memory, of size M, and a slow and arbitrarily large external memory, called disk. Data between the internal memory and the disk are trans- feredinblocksofsizeB(calleddiskpages).Sincediskaccessesarethedominating factor in the running time of many algorithms, the asymptotic performance of the algorithms is evaluated by counting the total number of disk accesses per- formedduringthe computation.This isaworkableapproximationforalgorithm design, and we will use it to evaluate the performance of query and update al- gorithms.Howeverthere aresituations,likeinthe constructionofindexingdata structures (Sections 2.1 and 3.5), in which this accounting scheme does not ac- curatelypredicttherunningtimeofalgorithmsonrealmachinesbecauseitdoes not take into account some important specialties of disk systems [162]. Namely, disk access costs have mainly two components: the time to fetch the first bit of requested data (seek time) and the time required to transmit the requested data (transfer rate). Transfer rates are more or less stable but seek times are highly variable. It is thus well known that accessing one page from the disk in mostcasesdecreasesthe costofaccessingthe pagesucceeding it,sothat “bulk” I/Os are less expensive per page than “random” I/Os. This difference becomes much more prominent if we also consider the reading-ahead/buffering/caching optimizationswhicharecommonincurrentdisksandoperatingsystems.Todeal with these specialties and avoid the introduction of many new parameters, we will sometime refer to the simple accounting scheme introduced in [64]: a bulk I/O is the reading/writingof a contiguous sequence of cM/B disk pages, where c is a proper constant; a random I/O is any single disk-page access which is not part of a bulk I/O. In summary the performance of the algorithms designed to build, process or query an indexing data structure is therefore evaluated by measuring: (a) the number of random I/Os, and possibly the bulk I/Os, (b) the internal running time (CPU time), (c) the number of disk pages occupied by the indexing data structure and the working space of the query, update and construction algo- rithms. The second key issue: types of queries and indexed data. Up to now we have talked about indexing data structures without specifying the type of queries that an index should be able to support as well no attention has been devoted to the type of data an index is called to manage. These issues have a surprising impact on the design complexity and space occupancy of the index, and will be strictly interrelated in the discussion below. 5 Therearetwomainapproachestoindexdesign:word-based indexesandfull- text indexes.Word-basedindexes are designedto workon linguistic texts, or on documentswhereatokenizationintowordsmaybedevised.Theirmainideaisto storetheoccurrencesofeachword(token)inatablethatisindexedviaahashing function or a tree structure (they are usually called inverted files or indexes). To reduce the size of the table, common words are either not indexed (e.g. the, at, a) or the index is later compressed. The advantage of this approach is to supportveryfastword(orprefix-word)queriesandtoallowatreasonablespeed some complex searches like regular expression or approximate matches; while two weaknesses are the impossibility in dealing with non-tokenizable texts, like genomic sequences, and the slowness in supporting arbitrary substring queries. Section2willbedevotedtothediscussionofword-basedindexesandsomerecent advancements on their implementation, compression and supported operations. Particular attention will be devoted to the techniques used to compress the inverted index or the input data collection, and to the algorithms adopted for implementing more complex queries. Full-text indexes have been designed to overcome the limitations above by dealing with arbitrary texts and general queries, at the cost of an increase in the additional space occupied by the underlying data structure. Examples of such indexes are: suffix trees [128], suffix arrays [121] and String B-trees [71]. They have been successfully applied to fundamental string-matching problems as well to text compression[42], analysis of genetic sequences [88], optimization of Xpath queries on XML documents [52,129,4] and to the indexing of special linguistic texts [67]. Generalfull-text indexes are therefore the naturalchoice to perform fast complex searches without any restrictions on the query sequences and on the format of the indexed data; however, a reader should always keep in mind that these indexes are usually more space demanding than their word- basedcounterparts[112,49](cfr.opportunisticindexes[75]below).Section3will be devoted to a deep discussiononfull-text indexes, posing particular attention to the String B-tree data structure and its engineering. In particular we will introduce some novel algorithmic and data structural solutions which are not confined to this specific data structure. Attention will be devoted to the chal- lenging, yet difficult, problem of the construction of a full-text index both from a theoretical and a practical perspective. We will show that this problem is re- lated to the more generalproblemof string sorting, andthen discuss the known results and a novel randomized algorithm which may have practical utility and whose technical details may have an independent interest. Thethirdkeyissue:thespacevs.timetrade-off.Thediscussiononthe two indexing approaches above has pointed out an interesting trade-off: space occupancyvs.flexibility andefficiency ofthe supportedqueries.Itindeedseems that in order to support substring queries, and deal with arbitrary data col- lections, we do need to incur in an additional space overhead required by the morecomplicatedstructureofthefull-textindexes.Someauthorsarguethatthis extra-spaceoccupancyisafalse problembecauseofthe continueddecline inthe costof externalstoragedevices. Howeverthe impact of space reduction goes far 6 beyond the intuitive memory saving, because it may induce a better utilization of(thefast)cacheand(the electronic)internalmemorylevels,mayvirtuallyex- pand the disk bandwidth and significantly reduce the (mechanical) seek time of disk systems. Hence data compressionis an attractive choice, if not mandatory, not only for storage saving but also for its favorable impact on algorithmic per- formance.Thisisverywellknowninalgorithmics[109]andengineering[94]:IBM hasrecentlydeliveredtheMXTTechnology(MemoryeXpansionTechnology)for itsx330eServerswhichconsistsinamemorychipthatcompresses/decompresses dataoncachewritebacks/missesthusyieldingafactorofexpansiontwoonmem- orysizewithjustaslightlylargercost.Itisnotsurprising,therefore,thatweare witnessing in the algorithmic field an upsurging interest for designing succinct (orimplicit) datastructures(see e.g.[38,143,144,142,87,168,169])thattryto re- duce as much as possible the auxiliary information kept for indexing purposes without introducing any significant slowdown in the operations supported. Sucharesearchtrendhasleadtosomesurprisingresultsonthedesignofcom- pressed full-text indexes [75] whose impact goes beyond the text-indexing field. These results lie at the crossing of three distinct research fields— compression, algorithmics, databases— and orchestrate together their latest achievements, thus showing once more that the design of an indexing data structure is nowa- days an interdisciplinary task. In Section 4 we will briefly overview this issue by introducing the concept of opportunistic index: a data structure that tries to takeadvantageofthecompressibilityoftheinputdatatoreduceitsoverallspace occupancy. This index encapsulates both the compresseddata and the indexing information in a space which is proportional to the entropy of the indexed col- lection,thus resultingoptimalinaninformation-contentsense.Yetthese results are mainly theoreticalin their flavorand open to significantimprovementswith respect to their I/O performance. Some of them have been implemented and tested in [76,77] showing that these data structures use roughly the same space required by traditional compressors—such as gzip and bzip2 [176]— but with addedfunctionalities:they allowto retrievethe occurrencesofanarbitrarysub- stringwithintextsofseveralmegabytesinafewmilliseconds.Theseexperiments show a promising line of researchand suggestthe designof a new family of text retrieval tools which will be discussed at the end of Section 4. The fourth key issue: String transactions and index caching. Not onlyisstringdataproliferating,butdatastoresincreasinglyhandlelargenumber ofstring transactions thatadd, delete,modify or searchstrings.As a result, the problem of managing massive string data under large number of transactions is emerging as a fundamental challenge. Traditionally, string algorithms focus on supporting eachof these operations individually in the most efficient manner in the worst case. There is however an ever increasing need for indexes that are efficient on an entire sequence of string transactions, by possibly adapting themselves to time-varying distribution of the queries and to the repetitiveness present in the query sequence both at string or prefix level. Indeed it is well known that some user queries are frequently issued in some time intervals [173] or some search engines improve their precision by expanding the query terms 7 with some of their morphological variations (e.g. synonyms, plurals, etc.) [22]. Consequently, in the spirit of amortized analysis [180], we would like to design indexingdatastructuresthatarecompetitive(optimal)overtheentiresequence of string operations. This challenging issue has been addressed at the heuris- tic level in the context of word-based indexes [173,39,125,131,101]; but it has been unfortunately disregarded when designing and analyzing full-text indexes. Here the problem is particularly difficult because: (1) a string may be so long to do not fit in one single disk page or even be contained into internal mem- ory, (2) each string comparison may need many disk accesses if executed in a brute-force manner, and (3) the distribution of the string queries may be un- knownorvaryoverthetime.Afirst,preliminary,contributioninthissettinghas been achievedin [48] where a self-adjusting and external-memoryvariantof the skip-list data structure [161] has been presented. By properly orchestrating the cachingofthis datastructure,the cachingofsome query-stringprefixes andthe effectivemanagementofstringitems,theauthorsproveanexternal-memoryver- sion for stringsof the famous Static Optimality Theorem[180]. This introduces a new framework for designing and analyzing full-text indexing data structures and string-matching algorithms, where a stream of user queries is issued by an unknown source and caching effects must then be exploited and accounted for when analyzing the query operations. In the next sections we will address the caching issue both for word-based and full-text indexing schemes, pointing out some interesting researchtopics which deserve a deeper investigation. The moral that we would like to convey to the reader is that the text in- dexing field has grown at a such complicated stage that various issues come into play when studying it: data structure design,databaseprinciples,compres- siontechniques,architecturalconsiderations,cacheandprefetchingpolicies.The expertise nowadays required to design a good index is therefore transversal to many algorithmic fields and much more study on the orchestration of known, or novel, techniques is needed to make progress in this fascinating topic. The rest of the survey is therefore devoted to illustrate the key ideas which should constitute, in our opinion, the current backgroundof every index-designer. The guiding principles of our discussion will be the four key issues above; they will guide the description of the positive features and drawbacks of known indexing schemes as well the investigation of research issues and open problems. A vast, butobviouslynotcomplete,literaturewillaccompanyourdiscussionandshould be the reference where an eager reader may find further technical details and research hints. 2 On the word-based indexes Therearethreemainapproachestodesignaword-basedindex:invertedindexes, signature files and bitmaps [188,22,19,63]. The inverted index— also known as invertedfile,postingfile,orinnormalEnglishusageasconcordance—isdoubtless thesimplestandmostpopulartechniqueforindexinglargetextdatabasesstoring 8 natural-languagedocuments. The other two mechanisms are usually adopted in certain applications evenif, recently, they have been mostly abandonedin favor ofinvertedindexesbecausesomeextensiveexperimentalresults[194]haveshown that: Inverted indexes offer better performance than signature files and bitmaps, intermsofbothsizeofindexandspeedofqueryhandling[188].Asaconsequence, the emphasis of this section is on inverted indexing; a reader interested into signaturefilesand/orbitmapsmaystartbrowsingfrom[188,22]andhavealook to some more recent, correlated and stimulating results in [33,134]. Aninvertedindex is typically composedoftwoparts:thelexicon, alsocalled the vocabulary, containing all the distinct words of the text collection; and the inverted list,alsocalledtheposting list,storingforeachvocabularytermalistof all text positions in which that term occurs. The vocabularytherefore supports a mapping from words to their corresponding inverted lists and in its simplest form is a list of strings and disk addresses. The search for a single word in an inverted index consists of two main phases: it first locates the word in the vocabulary and then retrieves its list of text positions. The search for a phrase or a proximity pattern (where the words must appear consecutively or close to each other, respectively) consists of three main phases: each word is searched separately, their posting lists are then retrieved and finally intersected, taking care of consecutiveness or closeness of word positions in the text. Itisapparentthattheinvertedindexisasimpleandnaturalindexingscheme, andthis hasobviouslycontributedtoits spreadamongthe IRsystems.Starting from this simple theme, researchersindulged theirs whims by proposing numer- ous variations and improvements. The main aspect which has been investigated is the compression of the vocabulary and of the inverted lists. In both cases we are faced with some challenging problems. Since the vocabulary is a textual file any classical compression technique might be used, provided that subsequent pattern searches can be executed effi- ciently. Since the inverted lists are constituted by numbers any variable length encoding of integers might be used, provided that subsequent sequential decod- ings can be executed efficiently. Of course, any choice in vocabulary or inverted lists implementation influences both the processing speed of queries and the overallspaceoccupiedby the invertedindex.We proceedthento commenteach ofthesepointsbelow,referringthereaderinterestedintotheirfinedetailstothe cited literature. The vocabulary is the basic block of the inverted index and its “content” constraints the type of queries that a user can issue. Actually the index de- signer is free to decide what a word is, and which are the representative words to be included into the vocabulary.One simple possibility is to take each of the wordsthatappearinthedocumentanddeclarethemverbatimtobevocabulary terms. This tends both to enlarge the vocabulary, i.e. the number of distinct terms that appear into it, and increase the number of document/position iden- tifiers that must be stored in the posting lists. Having a large vocabulary not only affects the storage space requirements of the index but can also make it harder to use since there are more potential query terms that must be con- 9 sidered when formulating a query. For this reason it is common to transform each word in some normal form before being included in the vocabulary. The two classical approaches are case folding, the conversion of all uppercase letters to their lowercase equivalents (or vice versa), and stemming, the reduction of each word to its morphological root by removing suffixes or other modifiers. It is evident that both approaches present advantages (vocabulary compression) and disadvantages (extraneous material can be retrieved at query time) which should be taken into account when designing an IR system. Another common transformation consists of omitting the so called stop words from the indexing process (e.g., a, the, in): They are words which occur too often or carry such smallinformationcontentthattheiruseinaquerywouldbeunlikelytoeliminate any documents. In the literature there has been a big debate on the usefulness of removing or keeping the stop words. Recent progresses on the compaction of the invertedlists haveshownthat the space overheadinduced by those wordsis notsignificant,andisabundantlypayedforbythesimplificationintheindexing process and by the increased flexibility of the resulting index. Thesizeofthevocabularydeservesaparticularattention.Itisintuitivethat itshouldbe small,but moreinsightonits cardinalityandstructuremustbe ac- quiredinordertogointomorecomplexconsiderationsregardingitscompression and querying. An empirical law widely accepted in IR is the Heaps’ Law [91], which states that the vocabulary of a text of n words is of size V = O(nβ), where β is a small positive constant depending on the text. As shown in [16], β is practicallybetween0.4 and0.6sothe vocabularyneeds space proportionalto thesquarerootoftheindexeddata.Henceforlargedatacollectionstheoverhead ofstoringthe vocabulary,eveninitsextendedform,isminimal.Classicalimple- mentationsofasetofwordsviahash tablesandtriestructuresseemappropriate forexactwordorprefix wordqueries.As soonasthe useraims formorecompli- catedqueries,like approximateorregular-expressionsearches,itispreferableto keepthevocabularyinitsplainformasavectorof wordsandthenanswerauser query via one of the powerful scan-based string-matching algorithms currently known [148]. The increase in query time is payed for by the more complicated queries the index is able to support. As we observed in the Introduction, space saving is intimately related to time optimization in a hierarchical memory system, so that it turns out to be natural to ask ourselves if, and how, compression can help in vocabulary storage and searching. From one hand, vocabulary compression might seem useless because of its small size; but from the other hand, any improvement in the vocabulary search-phase it is appealing because the vocabulary is ex- amined at each query on all of its constituting terms. Numerous scientific re- sults[9,118,82,81,184,65,139,108,154,178,57,140,149,106]haverecentlyshownhow to compress a textual file and perform exact or approximate searches directly on the compressed text without passing through its whole decompression. This approachmay be obviously applied to vocabulariesthus introducing two imme- diate improvements: it squeezes them to an extension that can be easily kept into internal memory even for large data collections; it reduces the amount of 10 dataexaminedduringthequeryphase,anditfullyexploitstheprocessingspeed of current processors with respect to the bandwidth and access time of internal memories, thus impacting fruitfully onto the overall query performance. Exper- iments have shown a speed up of a factor about two in query processing and a reductionofmorethanafactorthreeinspaceoccupancy.Nonethelessthewhole scanning of the compressed dictionary is afforded, so that some room for query time improvement is still possible. We will be back on this issue in Section 4. Most of the space usage of inverted indexes is devoted to the storage of the inverted lists; a proper implementation for them thus becomes urgent in order to make such an approach competitive against the other word-based indexing methods: signature files and bitmaps [188,194]. A large research effort has been therefore devoted to effectively compress the inverted lists still guaranteeing a fast sequential access to their contents. Three different types of compaction approaches have been proposed in the literature, distinguished according to the accuracy to which the inverted lists identify the location of a vocabulary term, usuallycalledgranularityoftheindex.Acoarse-grainedindexidentifiesonlythe documents where a term occurs;an index ofmoderate-grain partitions the texts into blocks and stores the block numbers where a term occurs; a fine-grained index returns instead a sentence, a term number, or eventhe characterposition of every term in the text. Coarse indexes require less storage (less than 25% of the collection size), but during the query phase parts of the text must be scanned in order to find the exact locations of the query terms; also, with a coarse index multi-term queries are likely to give rise to insignificant matches, because the query terms might appear in the same document but far from each other. At the other extreme, a word-level indexing enables queries involving adjacencyandproximitytobeansweredquicklybecausethedesiredrelationship can be checked without accessing the text. However, adding precise locational informationexpandstheindexofatleastafactoroftwoorthree,comparedwith a document-level indexing since there are more pointers in the index and each onerequiresmorebitsofstorage.Inthiscasetheinvertedliststakenearly60%of the collectionsize.Unless a significantfractionofthe queriesareexpected to be proximity-based, or “snippets” containing text portions where the query terms occur must be efficiently visualized, then it is preferable to choose a document- level granularity;proximity and phrase-based queries as well snippet extraction can then be handled by a post-retrieval scan. Inallthosecasesthesizeoftheresultingindexcanbefurthersqueezeddown by adopting a compression approach which is orthogonal to the previous ones. The key idea is that each inverted list can be sorted in increasing order, and therefore the gaps between consecutive positions can be stored instead of their absolute values. Here can be used compression techniques for small integers. As the gaps for longer lists are smaller, longer lists can be compressed better and thus stop words can be kept without introducing a significant overhead in the overall index space. A number of suitable codes are described in detail in [188], more experiments are reported in [187]. Golomb codes are suggested as the best ones in many situations, e.g. TREC collection, especially when the

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.