Table Of ContentIntroduction to Genomic Information Retrieval
Genomic information retrieval (GIR) is an emerging area of intersection
betweeninformationretrieval(IR)andbioinformatics.Bioinformaticianshave
defined their discipline as the “research, development, or application of com-
putational tools and approaches for expanding the use of biological, medical,
behavioral or health data, including those to acquire, store, organize, archive,
analyze, or visualize such data.”1 This definition clearly overlaps with that
of information retrieval. Indeed, IR practitioners can bring substantial exper-
tisetobioinformaticsbyapplyingandinnovatingwithtechniquesusedintext
and multimedia retrieval for storing, organizing, analyzing, and visualizing
data.
The application domain of GIR is broad. Expertise can be applied to the
retrieval and management of collections of genomic data (such as DNA and
proteinsequences),medicalandbiologicalliterature,microarraydata,orother
bioinformaticsdomains.Problemsthatcanbesolvedincludeunderstandingthe
structure,function,androleofproteins,discoveringevidenceforthecausesof
disease, and annotating medical and biological artifacts. However, the skills
neededtoworkonsuchproblemsarebroad:sufficientunderstandingofbiology
and genetics, information retrieval expertise, and skills in fundamental com-
puter science techniques such as compression, string matching, data mining,
anddatastructuresandalgorithms.
Thisspecialsectionservestobringtogetherarticlesthatshowhowdiverse,
important bioinformatics problems can be solved with GIR. As editor of this
section, I was pleased by the response to the call for papers, and particularly
impressedbythebreadthofworksubmittedforreview.
ArticlesInThisSection
Thisspecialsectioncontainstwoarticlesthatdescribesolutionstoverydiffer-
entproblemsinGIR.
Korodi and Tabus describe, to my knowledge, the most effective DNA com-
pressionschemetodate.CompressionofDNAisinterestingforbothpractical
reasons (such as reduced storage and transmission costs) and functional rea-
sons(suchasinferringstructureandfunctionfromcompressionmodels).DNA
consists of four symbols, and so compaction of the ASCII representation to
two bits per symbol is trivial. However, compression to less that two bits per
symbol is much harder: for example, compression with the well-known bzip2
algorithm is ineffective, leading typically to expansion. Prior to the work de-
scribedinKorodiandTabus’sarticle,themosteffectiveschemes(describedin
detail in the article) achieved compression to between 1.41 and 1.86 bits per
symbol.Thisarticledescribesaschemethatisalwaysmoreeffectivethanother
approaches, achieving between 1.36 and 1.82 bits per symbol. Their approach
1ThisisaworkingdefinitionproposedbytheUSNationalInsitutesofHealth.Theworkingdocu-
mentcanbefoundashttp://grants1.nih.gov/grants/bistic/CompuBioDef.pdf.
ACMTransactionsonInformationSystems,Vol.23,No.1,January2005,Pages1–2.
2 • Introduction
isbasedonanormalizedmaximumlikelihoodmodel,withheuristicsthatmake
theschemefastondesktophardware.
ThearticleofSanderetal.proposesandexaminestechniquesforanalyzing
Serial Analysis of Gene Expression (SAGE) data. SAGE data is a new but
importantresourcethatallowsthecomparisonofnormalhumantissues,cells,
and cancers. The work in this article shows that SAGE data is robust, that
is, that data produced from different laboratories is comparable and can be
usedtogetherforlargestudies.Importantly,theworkalsosuggestsinteresting
similaritiesbetweenbrainandovariancancer,andshowsthatbrainandbreast
cancer have strong gene expression signatures. This may be an interesting
direction for further biomedical research. The GIR techniques described and
usedincludemethodsforpreprocessingtheSAGEdata,clusteringandselecting
datasubsetsofinterest,andmethodsforpresentingandunderstandingSAGE
experiments.
ACKNOWLEDGMENTS
Ithanktherefereesfortheirhardworkinreviewingandprovidingfeedbackon
thepapers,insomecases,toverytightdeadlines.Ialsothanktheauthorsofall
submittedpapers.Last,IamgratefultotheEditor-in-Chief,GaryMarchionini,
forsupportingtheGIRspecialsection,andtoAssociateEditorJavedMostafa
forhishands-onsupportandmentoringthroughoutthereviewing,discussion,
andfinalizingofthespecialsection.IhopethatTOISreadersenjoythespecial
sectiononGenomicInformationRetrieval.
HUGHE.WILLIAMS
RMITUniversity
Melbourne,Australia
ACMTransactionsonInformationSystems,Vol.23,No.1,January2005.
An Efficient Normalized Maximum Likelihood
Algorithm for DNA Sequence Compression
GERGELYKORODIandIOANTABUS
TampereUniversityofTechnology
This article presents an efficient algorithm for DNA sequence compression, which achieves the
bestcompressionratiosreportedoveratestsetcommonlyusedforevaluatingDNAcompression
programs.Thealgorithmintroducesmanyrefinementstoacompressionmethodthatcombines:
(1) encoding by a simple normalized maximum likelihood (NML) model for discrete regression,
throughreferencetoprecedingapproximatematchingblocks,(2)encodingbyafirstordercontext
coding and (3) representing strings in clear, to make efficient use of the redundancy sources in
DNA data, under fast execution times. One of the main algorithmic features is the constraint
on the matching blocks to include reasonably long contiguous matches, which not only reduces
significantlythesearchtime,butalsocanbeusedtomodifytheNMLmodeltoexploittheconstraint
forgettingsmallercodelengths.ThealgorithmhandlesthechangingstatisticsofDNAdatainan
adaptivewayandbypredictivelyencodingthematchingpointersitissuccessfulincompressinglong
approximatematches.ApartfromcomparisonwithpreviousDNAencodingmethods,wepresent
compressionresultsfortherecentlypublishedhumangenomedata.
CategoriesandSubjectDescriptors:E.4[CodingandInformationTheory]—Datacompaction
andcompression;G.3[ProbabilityandStatistics]—Correlationandregressionanalysis;Markov
processes;J.3[LifeandMedicalSciences]—Biologyandgenetics
GeneralTerms:Algorithms,Theory
AdditionalKeyWordsandPhrases:Approximatesequencematching,DNAcompression,normal-
izedmaximumlikelihoodmodel
1. INTRODUCTION
Beingthedefinitivecodebookforthetraitsandbiologicalfunctionalbehavior
ofeverylivingorganism,DNAsequencesarethemostchallenginginformation
sourcesthathumanstrytodecipheranduse.Anenormousquantityofdifferent
DNAsequencesexists,thesizeofeachsequencevaryingintherangeofmillions
tobillionsofnucleotides.Inbothscientificandcommercialcommunitiesthereis
intenseactivitytargetedatsequencingtheDNAofmanyspeciesandstudying
thevariabilityofDNAbetweenindividualsofthesamespecies,whichproduces
Authors’address:InstituteofSignalProcessing,TampereUniversityofTechnology,P.O.Box553,
FIN-33101Tampere,Finland;email:{gergely.korodi,ioan.tabus}@tut.fi.
Permissiontomakedigitalorhardcopiesofpartorallofthisworkforpersonalorclassroomuseis
grantedwithoutfeeprovidedthatcopiesarenotmadeordistributedforprofitordirectcommercial
advantageandthatcopiesshowthisnoticeonthefirstpageorinitialscreenofadisplayalong
withthefullcitation.CopyrightsforcomponentsofthisworkownedbyothersthanACMmustbe
honored.Abstractingwithcreditispermitted.Tocopyotherwise,torepublish,topostonservers,
toredistributetolists,ortouseanycomponentofthisworkinotherworksrequirespriorspecific
permissionand/orafee.PermissionsmayberequestedfromPublicationsDept.,ACM,Inc.,1515
Broadway,NewYork,NY10036USA,fax:+1(212)869-0481,orpermissions@acm.org.
(cid:1)C 2005ACM1046-8188/05/0100-0003$5.00
ACMTransactionsonInformationSystems,Vol.23,No.1,January2005,Pages3–34.
4 • G.KorodiandI.Tabus
huge amounts of information that needs to be stored and communicated to a
large number of people. Therefore there is a great need for fast and efficient
compressionofDNAsequences.
The traditional way to tackle the lossless compression of data is to make
use of statistically significant repetitions of the symbols or groups of symbols
in the data. Many of the commonly used general purpose methods eliminate
the redundancies of exact replicas of a certain subsequence by using efficient
referencingtoitsclosestoccurrenceinthepast.
Unfortunately, these general-purpose compression algorithms typically fail
toattainanysizereductionwhenappliedtoDNAsequences,whichimpliesthat
thesesequencesrarelypossessthekindofregularities(likehighpredictability
ofthenextsymbolbasedonitscontext)thataretypicalfor,sayhumancreated
text.However,itiswellknownthatredundancyiscommonlypresentinDNA,
butitinvolvesmoresubtleforms,suchaspalindromematchesandapproximate
repetitions.TheeffectivehandlingofthesefeaturesisthekeytosuccessfulDNA
compression.
The direct practical advantages of compressing DNA sequences stem from
the already large, and rapidly increasing amount of available data. Although
in recent years mass storage media and devices became easily affordable and
ubiquitous, data compression now has an even more important role in reduc-
ing the costs of data transmission, since the DNA files are typically shared
and distributed over the Internet. Space-efficient representation of the data
reduces the load on FTP service providers, as the transmissions are done
faster, and it also saves costs for clients who access and download the se-
quences. Since the price of transmission is proportional to the sizes of ac-
cessed files, even savings of 10–20% are useful. In this article we demon-
stratethatthissizereductioncanbeeasilyaccomplishedfortypicaleukaryotic
data.
The compression of DNA has an intrinsic value in itself, but additionally it
provides many useful clues to the nature of regularities that are statistically
significantintheDNAdata,indicatinghowdifferentpartsofthesequenceare
inrelationwitheachother,howsensitivethegenomeistorandomchangessuch
ascrossoverandmutation,whattheaveragecompositionofasequenceis,and
wheretheimportantcompositionchangesoccur.Duetoitsmanypotentialuses
asastatisticalanalysistoolforDNAsequences,anyDNAcompressionmethod
isfacedwithallofthechallengesthatmodelingtoolsarefacing:maximizingthe
statistical relevance of the model used, and choosing between various models
basedontheoverallperformance-complexitymerit.
One example of a statistical question easily solved by an efficient compres-
sionschemeiswhetherornotacertainapproximatematchoftwosubsequences
isstatisticallyrelevant.Tosolvethis,onesimplyhastocheckwhetherencod-
ing one subsequence based on a reference to the other subsequence and also
includingthedescriptionofthemismatchesisabletocompresstheoverallDNA
sequenceornot.Tomaximizethelikelihoodofacorrectanswer,thecompression
programthatisusedhastobethebestknowncompressor,otherwisestatisti-
callyrelevantmatcheswillremainundetected.Thisshowsthatdesigningthe
mostefficientcompressorisagoalofparamountimportanceforstatisticaland
ACMTransactionsonInformationSystems,Vol.23,No.1,January2005.
AnEfficientNMLAlgorithmforDNASequenceCompression • 5
biologicalinference[Lietal.2001;Lietal.2003],evenifthegainsoverother
compressors are just marginal when judging only from the storage savings
benefits.
The complexity of the models that are implicitly or explicitly used by each
specificlosslesscompressionmethodforcapturingtheregularitiesinthedata
should be weighted carefully when comparing various compression methods,
due to its impact on various aspects such as: statistical relevance of model
parameters, simplicity of parameter estimation, simplicity of using the model
forencoding,andattainablecompressionrate.
Thepresentarticlediscussesacompressionmethodaimedatachievingvery
efficientcompressionwithafastexecutiontime.Thefoundationsforthiswork
havebeenlaiddowninTabusetal.[2003a],whichisinturnbasedontheNor-
malizedMaximumLikelihoodmodelpresentedinTabusetal.[2003b].Whereas
inTabusetal.[2003a]thesimplicityofthemodelwasoneofthestipulatedobjec-
tives,herewerelaxthisrestrictioninfavorofincreasedcompressionefficiency
and practical feasibility. As a result, in Section 2 we introduce the framework
forouralgorithm,morecomplexthantheoneconsideredinTabusetal.[2003a],
followedbyadetaileddiscussionandanalysisinSection3regardingimplemen-
tationissues,solutionsandperformanceevaluations.
InSection4wepresenttheempiricalevaluationofourmethod,concentrating
on constraints such as compression efficiency, processing speed and memory
requirements.Weshowthatatthetimeofthiswritingourprogramsurpassesin
termsofcompressibilityallpreviouslypublishedresultsonadefactosetoftest
sequences.Thepracticalfeasibilityofourprogramisdemonstratedbyencoding
theentirehumangenome,achievingacompressionrateof1.66bits/base.
1.1 ConnectionswithPreviousWorks
Thefirstcompressionalgorithmdedicatedexclusivelytonucleicacidsequences,
andhencesignificantlysurpassinggeneral-purposemethods,waspublishedby
GrumbachandTahi[1993].Theirprogram,calledBiocompress,usedaLempel-
Ziv style substitutional algorithm [Ziv and Lempel 1977] that, apart from the
exactdirectmatches,wasalsoabletodetectcomplementarypalindromes,which
are exact matches of an invertible transform of the subsequence, obtained by
takingthesequenceinreversedorderandreplacingeachbasewithitscomple-
ment(A↔T andC ↔G).Suchpalindromematchesareingeneralasfrequent
asdirectmatches,beingaredundancysourceworthexploitingforcompression.
Inalatermethod,calledBiocompress-2GrumbachandTahi[1994],improved
Biocompress by adding an order-2 context coder followed by arithmetic cod-
ing[Rissanen1976].
A new generation of DNA compression algorithms appeared when practi-
cal use was made of the data redundancy involved in approximate repeti-
tions[Chenetal.2001,2002].Especiallyefficientaretwoalgorithmsintroduced
inChenetal.[2001]byChen,KwongandLi:thefirstversion,GenCompress-1,
againusedtheLempel-Zivalgorithmandanorder-2contextcoder,butherethe
substitutional method only used replacement operations when matching two
subsequences. In the second version, GenCompress-2, insertion and deletion
ACMTransactionsonInformationSystems,Vol.23,No.1,January2005.
6 • G.KorodiandI.Tabus
operations were allowed in addition to the replacements, when looking for an
approximate match. However, this did not prove to offer consistent improve-
mentsoverGenCompress-1,accordingtothereportedresults[Chenetal.2001],
leavingopentheusefulnessofconsideringinsertionsanddeletionsinaddition
to substitutions in their scheme. In both GenCompress-1 and GenCompress-2,
alongwiththepointertothematchedsubsequence(includingoffsetandlength)
theencoderhastoprovidecodesfortheeditoperationsinvolvedintheapprox-
imate match, in order that compression would be lossless. Establishing codes
fortheeditoperationwasanempiricalprocess,andassuch,subjectedtoover-
trainingwithrespecttothetestdatathatwasused.
Anearliersubstitutionalmethod,Cfact[Rivalsetal.1995]usedatwopass
coding strategy: in the first pass the exact matches worth providing a gain
in compression were detected in the overall file, and in the second pass the
encodingwasperformed.Assuch,itcanbethoughtofasaprecursortooneof
thebestDNAcompressionalgorithms,DNACompress,introducedbyChen,Li,
Ma,andTromp [Chenetal.2002],whichalsoemploysatwo-passstrategy,and
isagainbasedonasubstitutional(Lempel-Zivstyle)compressionscheme.Inthe
firstpassaspecializedprogramcalledPatternHunterisusedasapreprocessor
for finding significant approximate repetitions. The second pass then encodes
these by a pointer to their previous occurrence, and it also encodes the edit
operations to correct the approximations of the match, while the rest of the
sequencesarecodedusinganorder-2contextarithmeticcoder.
ThesuccessofDNACompressshowstheimportanceofthemethodusedfor
findingsignificantapproximatematches,whichinonewayoranothershouldbe
partofasubstitutionalcompressionalgorithm.Webrieflymentionafewmajor
methods for aligning sequences, because they are well related to the approxi-
mate matching problem, and furthermore because their principle of search is
worthexploitinginacompressionalgorithmfortheapproximatematchsearch
task.
Locatinginadatabasethesequencesthataresignificantlywellalignedwith
a given sequence was a task first solved in the seventies by exact computing
methodsbasedondynamicprogramming,butthemethodswereslowforsearch-
inginlargedatabasesinapracticaltime.WilburandLipmandevelopedseveral
algorithmsforaligningtwonucleotidesequences,suchasinWilburandLipman
[1983], which was based on matching fixed-length seeds, that is, consecutive
strings of nucleotides. This approach became popular and it was later refined
infurtheralgorithmswiththesameobjective,suchastheFASTAalgorithmby
PearsonandLipman[1988],andthefamilyofBLASTprograms[Altschuletal.
1990], which became very popular ways of comparing one sequence against
all the sequences of a database. A more recent search algorithm attaining
significantly faster speed than BLAST programs at the same level of sensi-
tivity is PatternHunter by Ma, Tromp and Li [Ma et al. 2002]. In contrast
to earlier algorithms, it uses strings of nonconsecutive symbols as seeds for
search.
Allofthementionedmethodsarewelloptimizedforthetaskoffindingalign-
mentscoresforsequencesofarbitrary(notnecessarilythesame)lengths,and
theyoperateintwosteps:firstlocatingapotentialgoodmatchusingaseed,then
ACMTransactionsonInformationSystems,Vol.23,No.1,January2005.
AnEfficientNMLAlgorithmforDNASequenceCompression • 7
subsequently trying to extend the seed sequence at both sides and evaluating
after each extension the matching (or alignment, if gaps are included) score,
to get an approximate match as long as possible. This second part makes the
processevenmorecomputationallyexpensive.Inourcompressionprogramthe
approximatesearchtaskismoreparticular:itneedsonlytocompareblocksof
the same size, and although we use the same principle of “seed” based accel-
eration search, the subsequent operations are less expensive than those used
by the BLAST family of programs, and we optimize them to get a very fast
encodingalgorithm.Atthedecoderside,thesearchprocessisnotneeded,and
hence the decoding time is much faster than the encoding time, which is true
forallsubstitutionalcompressionmethods.
Several other methods that do not use the substitution principle have also
been proposed for DNA compression. Some of them only offer estimates of
the achievable compression, or estimates of the entropy [Loewenstern and
Yianilos 1997; Lanctot et al. 2000]; others are full compression programs.
From the latter category we mention a combination of context tree weight-
ing(CTW)methodwithLempel-ZivcodingcalledCTW+LZ[Matsumotoetal.
2000],whichachievesverygoodcompressionresults,butitisveryslow,anda
Burrows-WheelerTransform-basedcompressor[Adjerohetal.2002],whichat-
tainsonlymodestcompressionresults.Thoughseveralfundamentallydifferent
algorithmshavealsobeenproposedinthe10yearsfollowingtheintroduction
of Biocompress, Lempel-Ziv style algorithms always represented the state-of-
the-artinDNAcompression.
A departure from Lempel-Ziv style algorithms is the method NMLComp
presented in Tabus et al. [2003a], which replaces the Lempel-Ziv method by
encodingwiththeNormalizedMaximumLikelihoodmodelfordiscreteregres-
sion[Tabusetal.2003b].InTabusetal.[2003a]itwasdemonstratedthatthe
NMLmodeliswellsuitedforencodingapproximatematchesbasedonreplace-
mentoperationsandalsothatitcanbeimplementedwithverylowalgorithmic
andcomputationalrequirements.
In our present article we introduce refinements and algorithmic improve-
ments for the method introduced in Tabus et al. [2003a], directed towards
speedinguptheexecutionprocessandobtainingbettercompressionresults.
The kernel of the method was presented in Tabus et al. [2003a] and it con-
sistsofblockparsingtheDNAsequenceandusingoneofthethreeoptionsfor
encoding: the first uses a reference to a previously encoded DNA segment for
conditionally encoding the current block, by means of a simple NML model,
the second uses an order-1 context coder, and the third uses “in clear” repre-
sentation of the current block. The present article refines the kernel scheme,
proposingimprovementsinthewaythereferencetothepreviousmatchisen-
coded,howtheveryexpensivesearchofthepreviousmatchcanbeaccelerated
byconstrainingthematches,withjustasmallpriceinthecodingperformance,
andhowthenewmatchconstraintscanbeusedtoderiveamoreefficientNML
model,whichisaseasytohandlealgorithmicallyaswastheoneinTabusetal.
[2003a].Additionally,moreadaptivepoliciesareintroduced,suchastheselec-
tion of the block size used in parsing, and scalable forgetting factors for the
memory model. As a result of these and other algorithmic improvements, our
ACMTransactionsonInformationSystems,Vol.23,No.1,January2005.
8 • G.KorodiandI.Tabus
program achieves the best compression ratios published to date for a de facto
standardtestset,withveryfastexecution.
1.2 OrganizationoftheArticle
In Section 2 we start reviewing the three encoders that compete for encoding
a block of data, and present all components of Tabus et al. [2003a] which are
necessarytomakethispaperself-contained.TheNMLencoderprincipleisdis-
cussedandsomeimportantissuesareelaboratedinmoredetailthaninTabus
et al. [2003a], for example, the analysis of the threshold for deciding which
stringstoincludeinthenormalizationsum,inSections2.1.3and2.1.4.These-
lectionoftheblocksizebecomesamorerefinedprocess,beingintertwinedwith
decisionsoverotherimportantparameters,forexample,forgettingmemory,in
anoverallalgorithm,whichispresentedinSection2.3.Section3containsother
major algorithmic novelties in the article, first addressing the acceleration of
theapproximatematchsearch,andthenconsideringimprovementsintheper-
formance,whichcanbeachievedbyexploitingtheparticularsearchalgorithm
used. A new version of NML encoding is derived, that uses an additional con-
straint in the block model, namely that in each approximate match found, a
contiguousrunofmatchedcharactersthathasatleasta(given)fixedlengthis
guaranteedtoexist.Thealgorithmicimprovementsareassessedoverartificial
randomsequences(whereexpectedvaluescanbecomputed)andoverabench-
mark of DNA sequences. In Section 4 we compare our results with previous
DNAcompressionprogramsoversomeDNAsequences,thenevaluatevarious
options in our algorithm by tuning it to different constraints, and finally we
presenttheresultsofourprogramwhencompressingallofthe3billionbases
ofhumanchromosomes,revealinganaveragecompressibilityof1.66bits/base
forthewell-specifiedbases.
2. ALGORITHMDESCRIPTION
In this section we present an algorithm with the objective of attaining good
compression of DNA sequences while being practically low-cost, especially on
the side of the decoder. In order to efficiently process various types of DNA
sequences,webuildupourencoderfromthreemodules,eachwithseveralpa-
rametersandtargetingittoaddressdifferenttypesofredundancyintheinput
sequence.
In Sections 2.1 and 2.2 we present these three algorithms, while the final
encoderisdescribedinSection2.3.
2.1 TheNMLModelforDiscreteRegression
TheNMLmodelfordiscreteregressionisapowerfultooltodetecthiddenreg-
ularities in the input data [Tabus et al. 2003b]. Such regularities relevant to
DNAcompressionaretheapproximaterepetitions,bywhichonesequencecan
beefficientlyrepresentedusingknowledgeofasequencethatoccurredearlier,
by replacing, deleting or inserting characters in the earlier one to match the
later[Chenetal.2001].Ourstrategytohandleapproximaterepeatsistofocus
on blocks of relatively small length and study their encoding based on substi-
tution operations alone (not considering insertions and deletions). Thus our
ACMTransactionsonInformationSystems,Vol.23,No.1,January2005.
AnEfficientNMLAlgorithmforDNASequenceCompression • 9
firstobjectiveistomatchtwosequencesonlybypossiblysubstitutingacertain
numberofbasesinthefirstsequence,andtoderiveanefficientrepresentation
for the operations needed to recover the second sequence from the first. This
demand is efficiently accomplished by a simplified version of the NML model,
firstdescribedinTabusetal.[2003a].Theideabehindthisapproachistosplit
theinputsequenceintofixed-lengthblocks,andtofindforeachblockanappro-
priate regressor sequence in the already encoded—thus known—data to code
thatblock.TheregressorsequenceshouldhaveminimalHamming-distanceto
thecurrentblock,amongallpossiblesequencesinthepast.
Byourindependentaddressingoftheblocks,wecanmodellongapproximate
repeats well, since the block concatenation can model substitution operations
(insideblocks),andinsertionsanddeletions(byshiftingtheregressorblocks).
AmorecomplexNMLmodelwouldenableustocaptureandeliminatemore
sophisticated redundancies from the DNA sequence. However, such a model
wouldrequireadifferentchoiceoftheregressordata,whichnaturallyimposes
largercostsforfindingthebestmatchingsequence,bothintermsofspeedand
memoryrequirements.Sinceinpracticalapplications,thedatatobeprocessed
canhaveenormoussizes(atypicalchromosomeofaeukaryoteorganismvaries
in the range of several tens up to several hundreds of millions of base pairs)
the advantage of using complex NML models which may offer only slightly
better compression rates, but which at the same time increase computational
requirementsconsiderablyisnotyetclear.Forthisreasoninthefollowingwe
baseouralgorithmonthesimplerNMLvariantfirstdescribedinTabusetal.
[2003a]. In the present section we give an overview of this algorithm, as well
as improvements that have not been discussed in Tabus et al. [2003a] and a
suitablewayofintegratingthemintoanefficientimplementation.
2.1.1 The Probability Model. From the statistical point of view DNA se-
quencesaremessagess=s0,s1,s2,...,sNs−1 emittedbyasourcewithalphabet
size of M = 4 symbols [Grumbach and Tahi 1993]. In biological literature
thesesymbols,calledbases,arenotedbytheletters A,C,G andT,butforcon-
venience we associate to them the numbers 0,1,2 and 3 in respective order;
thus s ∈ {0,1,2,3}. In our method we split the sequence s into successive,
k
nonoverlappingblocksoflengthn,whereonegenericblocks(cid:1)n,...,s(cid:1)n+n−1 will
be referred to as yn = y0,..., yn−1. For convenience, we drop the reference to
therunningindex(cid:1)fromtheblocknotation,althoughwealwaysconsider ynto
bethe“current”block,with y0 =s(cid:1)n,..., yn−1 =s(cid:1)n+n−1forsomeinteger(cid:1).Each
block ynisencodedwiththehelpofsomeregressorblocksk,...,sk+n−1 written
as xn = x0,...,xn−1 inthealreadyencodedstream,whichdoesnotnecessarily
startatapositionthatisamultipleofn.Wedenotebyθ,theprobabilityofthe
event that the symbol at any position in the current block matches the corre-
sponding symbol in the regressor block. If the symbols do not match, there is
an additional choice of 1 in M −1 to specify the correct base y , and thus the
i
probabilitymodelis
(cid:1)
θ if y = x
P(y |x ;θ)= i i , (1)
i i ψ if y (cid:4)= x
i i
ACMTransactionsonInformationSystems,Vol.23,No.1,January2005.
10 • G.KorodiandI.Tabus
whereψ = 1−θ .Fromnowonweassumethatthesymbolswithinthesequence
M−1
areindependent—thoughthisisnottrueforrealDNAsequences,itdoespro-
vide us a simple model, which turns out to be very efficient for our purposes.
Theprobabilitymodel(1)extendedtoblocksofnsymbolsisthen
(cid:2) (cid:2)
P(yn|xn;θ)=θ in=−01χ(yi=xi)ψ in=−01χ(yi(cid:4)=xi) =θnmψn−nm, (2)
whereχ(condition)is1ifconditionistrue;0otherwise.Thenumberofmatching
bases in blocks xn and yn is indicated by n . Consequently, the maximum
m
likelihood estimate of the parameter θ is θˆ(yn,xn) = nm, and the maximized
n
likelihoodis
P(yn|xn;θˆ(yn,xn))=(cid:3)nm(cid:4)nm(cid:5) n−nm (cid:6)n−nm. (3)
n n(M −1)
Itturnsoutthattherearemanypossibleblocksthataredissimilartotheextent
thatthereisnogaininusingxnforthemasregressor.Collectingonlythe“good”
blocksforwhichNMLisprofitableintothesetY ,wenormalizethemaximized
xn
likelihoodtoobtaintheuniversalmodel
P(yn|xn;θˆ(yn,xn))
Pˆ(yn|xn) = (cid:2)
P(zn|xn;θˆ(zn,xn))
zn∈Yxn (cid:7)nm(cid:8)nm(cid:3) n−nm (cid:4)n−nm
= (cid:2) (M −n1)n−m(cid:5)n(nM(cid:6)−1(cid:7))m(cid:8)m(cid:3) n−m (cid:4)n−m, (4)
m∈(cid:5)n m n n(M−1)
where the second expression makes use of the fact that the “good” blocks are
clearly those with high probability in (3), and since this probability depends
onlyonn ,constrainingY isstraightforwardlydonebyconstrainingn toa
m xn m
set(cid:5) ⊆{0,...,n}.
n
2.1.2 TheNMLCodeLength. From(4)itfollowsthatthecodelengththat
isspentonencodingtheblock yn knowingtheregressorblock xn is
L (n ,(cid:5) )=−log Pˆ(yn|xn). (5)
NML m n 2
(Theargumentsofthecodelength,n and(cid:5) indicatethatthislengthdepends
m n
onlyonthenumberofmatchingbasesandthesettowhichn isconstrained.)
m
Theblockxnismostconvenientlyspecifiedbyapointertoitsstart,andonebit
indicatingifdirectorpalindromematchingisused.Sincethe(cid:1)thblock ynstarts
atposition(cid:1)n,andweexcluderegressorsequencesoverlappingthisblock,the
pointer to xn can be encoded with log ((cid:1)n−n+1) bits. However, to prevent
2
explodingrun-timerequirements,inpracticeitisoftenimportanttointroduce
awindowW ={(cid:1)n−n−w+1,...,(cid:1)n−n}ofsizewinwhichthesearchforthe
starting position of the regressor is carried out. This way the pointer can also
be sent in log w bits, so the number of bits needed to completely specify the
2
block xn iseither1+log ((cid:1)n−n+1)or1+log w,whicheverissmaller.This
2 2
meansthattheoverallnumberofbitstheNML-algorithmspendsonencoding
yn is
L (n ,(cid:5) )=L (n ,(cid:5) )+log min{((cid:1)−1)n+1,w}+1. (6)
1 m n NML m n 2
ACMTransactionsonInformationSystems,Vol.23,No.1,January2005.