Safe and complete contig assembly via omnitigs ∗ Alexandru I. Tomescu1 and Paul Medvedev2,3,4 1Helsinki Institute for Information Technology HIIT, Department of Computer Science, University of Helsinki P.O. Box 68, FI-00014, Helsinki, Finland E-mail: [email protected], Fax: +358 9 876 4314 2Department of Computer Science and Engineering, The Pennsylvania State University, USA 3Department of Biochemistry and Molecular Biology, The Pennsylvania State University, USA 6 1 4Genome Sciences Institute of the Huck, The Pennsylvania State University, USA 0 2 g Abstract. Contigassemblyisthefirststagethatmostassemblerssolvewhenreconstructingagenome u from a set of reads. Its output consists of contigs – a set of strings that are promised to appear in any A genomethatcouldhavegeneratedthereads.Fromtheintroductionofcontigs20yearsago,assemblers 6 have tried to obtain longer and longer contigs, but the following question remains: given a genome 1 graph G (e.g. a de Bruijn, or a string graph), what are all the strings that can be safely reported fromGascontigs?Inthispaperweanswerthisquestionusingamodelwherethegenomeisacircular ] covering walk. We also give a polynomial time algorithm to find such strings, which we call omnitigs. M Our experiments show that omnitigs are 66% to 82% longer on average than the popular unitigs, and Q 29% of dbSNP locations have more neighbors in omnitigs than in unitigs. Keywords:genomeassembly,contigassembly,safeandcompletealgorithm,graphalgorithm,omnitig . o i b - q [ 2 v 2 3 9 2 0 . 1 0 6 1 : v i X r a ∗A preliminary version of this paper appeared in RECOMB 2016 1 Introduction The genome assembly problem is to reconstruct the sequence of a genome using reads from a sequencing experiment. It is one of the oldest bioinformatics problems; nevertheless, recent projects such as the Genome 10K have underscored the need to further improve assemblers [9]. Current algorithms face numerous practical challenges, including scalability, integration of new data types (e.g. PacBio), and representation of multiple alleles. While handling these challenges is extremely important, assemblers do not produce optimal results even in very simple and idealized scenarios. To address this, several papers have developed better theoretical underpinnings [10,26,22,37,23,40], often resulting in improved assemblers in practice [32,42,38,1]. In most theoretical studies, the assembly problem is formulated as finding one genomic recon- struction, i.e. a single string that represents the sequence of the genome. However, the presence of repeats means that a unique genomic reconstruction usually does not exist. In practice, assem- blers instead output several strings, called contigs, that are “promised” to occur in the genome. We refer to this restatement of the genome assembly problem as contig assembly. Contigs can then be used to answer biological questions (e.g. about gene content) or perform comparative ge- nomic analysis. When mate pairs are available, contigs can be fed to later assembly stages, such as scaffolding [34,2,20] and then gap filling [35,3]. Assemblers implement different strategies for finding contigs. The common strategy is to find unitigs,anideathatcanbetracedbackto1995[15].Unitigshavethedesiredpropertythattheycan bemathematicallyproventooccurinall possiblegenomicreconstructions,underclearassumptions on what “genomic reconstruction” means. We will refer to strings that satisfy such a property as beingsafe (Definition3),andwillsaythatacontigassemblyalgorithmissafeifitoutputsonlysafe strings. Though most assemblers have a safe strategy at their core, they also incorporate heuristics to handle erroneous data and extend contig length (e.g. bubble popping, tip removal, and path disambiguation). Properties of such heuristics, however, are difficult to prove, and this paper will focus on core algorithms that are safe. Whiletheunitigalgorithmissafe,itdoesnotidentifyallpossiblesafestrings(seeFigure2).An improved safe algorithm was used in the EULER assembler [32], and further improvements were suggested based on iteratively simplifying the graph used for assembly [32,21,12,17]. However, we will show that these algorithms still do not always output all the safe strings. In fact, since the initial consideration of contig assembly 20 years ago, the fundamental question of finding all the safe strings of a graph remains poorly studied. In this paper, we answer this question by giving a polynomial-time algorithm for outputting all the safe strings in the common genome graph models (de Bruijn and string graphs) when the genomeisacircularcoveringwalk(Section6).Thekeyingredientforthisresultisagraph-theoretic characterizationofthewalksthatcorrespondtosafestrings(Section5).Wecallsuchwalksomnitigs and our algorithm the omnitig algorithm. In our experiments on de Bruijn graphs built from data simulated according to our assumptions, maximal omnitigs are on average 66% to 82% longer than maximal unitigs, and 29% of dbSNP locations have more neighbors in omnitigs than in unitigs. Our results are naturally limited to the context of our model and its assumptions. Intuitively, we assume that (i) the sequenced genome is circular, (ii) there are no gaps in coverage, and (iii) there are no errors in the reads. A mathematically precise definition of our model will be presented in Section 4. We argue that such a model is necessary if we want to prove even the simplest results aboutunitigs(Section4).Similartopreviousstudies,wealsodonotdealwithmultiplechromosomes or the double-strandedness of DNA and assume the genome is represented by a covering walk. As 2 with previous papers that developed better theoretical underpinnings [31,10,26,23], it is necessary to prove results in a somewhat idealized setting. While this paper falls short of analyzing real data, we believe that omnitigs can be incorporated into practical genome analysis and assembly tools – similar to the way that error-free studies of de Bruijn [31] and paired de Bruijn graphs [23] became the basis of practical assemblers [32,40,1]. 2 Related work The number of related assembly papers is vast, and we refer the reader to some surveys [24,28]. For an empirical evaluation of the correctness of several state-of-the-art assemblers, see [36]. Here, we discuss work on the theoretical underpinnings of assembly. There are many formulations of the genome assembly problem. One of the first asks to recon- struct the genome as a shortest superstring of the reads [30,16,15]. Later formulations referred to a graph built from the reads, such as a de Bruijn graph [10,32] or a string graph [26,38]. In an (edge- centric) de Bruijn graph, the reconstructed genome can be modeled as a circular walk covering every edge exactly once—Eulerian [32]—or at least once—a Chinese Postman tour [21,22,27,13]. In a string graph, the reconstructed genome can be modeled as a circular walk covering every node exactly once—Hamiltonian—[11,29], or at least once [27]. These models have also been considered in their weighted versions [21,27,29], or augmented to include other information, such as mate- pairs [33,23,14]. Each such notion of genomic reconstruction brought along questions concerning its validity. For example, under which conditions on the sequencing data (e.g., coverage, read length, error rate) is there at least one reconstruction [19,25], or exactly one reconstruction [5,18,32]. If there are many possible reconstructions, then what is their number [17,8] and in which aspects one isdifferentfromallothers[8].Incontrasttotheframeworkofthispaper,mostoftheseformulations deal with finding a single genomic reconstruction as opposed to a set of safe strings (i.e. contigs). There are a few notable exceptions. In [4], Boisvert and colleagues also define the assembly problem in terms of finding contigs, rather than a single reconstruction. Nagarajan and Pop [27] observe that Waterman’s characterization [41] of the graphs with a unique Eulerian tour leads to a simplealgorithmforfindingallsafestringswhenagenomicreconstructionisanEuleriantour.They also suggest an approach for finding all the safe strings when a genomic reconstruction is a Chinese Postman tour [27]. We note, however, that in the Eulerian model, the exact copy count of each edge should be known in advance, while in the Chinese Postman model (minimizing the length of the genomic reconstruction), the solution will over-collapse all tandem repeats. Furthermore, these approaches have not been implemented and hence their effectiveness is unknown. In practice, the most commonly employed safe strings are the ones spelled by maximal unitigs, where unitigs are paths whose internal nodes have in- and out-degree one. Figure 2 shows an example of the output of the unitig algorithm, and also illustrates that it does not identify all safe strings. The EULER assembler [32] takes unitigs a step further and identifies strings spelled by paths whose internal nodes have out-degree equal to one (with no constraint on their in-degree). It can be shown that such strings are also safe. However, the most complete characterization of safe strings that we found is given by the Y-to-V algorithm [22,12,17]. Consider a node v with exactly one in-neighbor u and more than one out-neighbors w ,...,w . The Y-to-V reduction applied to v 1 d removes v and its incident edges from the graph and adds nodes v ,...,v with edges from u to 1 d v and from v to w , for all 1 (cid:54) i (cid:54) d. The Y-to-V reduction is defined symmetrically for nodes i i i with out-degree exactly one and in-degree greater than one. Figure 1 illustrates the definition. The 3 w1 v1 w1 w1 w1 v1 u v w2 u v2 w2 w2 v u w2 v2 u = = . ⇒ . . . ⇒ . . . . . . . . . . . . . . wd vd wd wd wd vd (a) (b) Fig.1.TheY-to-Vreductionappliedtonodev.InFig.1(b)vhasin-degreeexactlyone;inFig.1(b)vhasout-degree exactly one. Y-to-V algorithm proceeds by repeatedly applying Y-to-V reductions, in arbitrary order, for as long as possible. The algorithm then outputs the strings spelled by the maximal unitigs in the final graph (see Figure 2d for an example). The Y-to-V algorithm can also be shown to be safe, but, as we will show in Figure 2, it does not always output all the safe strings. We are not aware of any study that compares the merits of Y-to-V contigs to unitigs, and we therefore perform this analysis in Section 8. 3 Basic definitions Given a string x and an index 1 (cid:54) i (cid:54) x , we define pre(x,i) and suf(x,i) as its length i prefix | | and suffix, respectively. If x and y are two strings, and suf(x,k) = pre(y,k) for some k (cid:54) x 1, | |− then we define x ky as x[1.. x k] concatenated with y. This captures the notion of merging two ⊕ | |− overlapping strings. A k-mer of x is a substring of length k. Let R be a set of strings, which we equivalently refer to as reads. The node-centric de Bruijn graph built on R, denoted DBk (R), is nc the graph whose set of nodes is the set of all k-mers of R, in which there is an edge from a node x to a node y iff suf(x,k 1) = pre(y,k 1) [7]. The edge-centric de Bruijn graph built on R, denoted − − DBk (R) is defined similarly to DBk (R), with the difference that there is an edge from x to y iff ec nc suf(x,k 1) = pre(y,k 1) and x k 1y is a substring of some string in R [10]. The weight of the − − − ⊕ edges of DBk (R) and DBk (R) is k 1. nc ec − Let G be a graph, possibly with parallel edges and self-loops. The number of nodes and edges in a graph are denoted by n and m, respectively. We use N (v) to denote the set of in- − neighbors and N+(v) to denote the set of out-neighbors of a node v. A walk w is a sequence (v ,e ,v ,e ,...,v ,e ,v ) where v ,...,v are nodes, and each e is an edge from v to v , 0 0 1 1 t t t+1 0 t+1 i i i+1 and t (cid:62) 1. Its length is its number of edges, namely t+1. A path is a walk where the nodes are − all distinct, except possibly the first and last nodes may be the same, in which case it will also be called a cycle. Walks and paths of length at least one are called proper. A walk whose first and last nodes coincide is called circular walk. A path (walk) with first node u and last node v will be called a path (walk) from u to v, and denoted as u-v path (walk). A walk is called node-covering if it passes through each node of G, and edge-covering if it passes through each edge of G. The notions of prefix and subwalk are defined for walks in the natural way, e.g., by interpreting a walk to be a string made up by concatenating its edges. In particular, we say that a walk w is a subwalk of a 1 circular walk w if w interpreted as string is a substring of w interpreted as circular string. In 2 1 2 this paper we allow strings and walks to have overlapping extremities when viewed as substrings of 4 a circular string, i.e., when aligned to a circular string (see e.g. the two omnitigs from Figure 2(f) which have an overlapping tail and head). Let (cid:96) be a function labeling the nodes of G and let c be a function giving weights to the edges (intuitively, c should represent the length of overlaps). One can apply the notion of string spelled by a walk w = (v ,e ,v ,e ,...,v ,e ,v ) by defining the string spelled by w as spell(w) = 0 0 1 1 t t t+1 (cid:96)(v ) c(e0)(cid:96)(v ) c(e1) c(et)(cid:96)(v ). When the walk w is circular (thus v = v ), then spell(w) 0 1 t+1 t+1 0 ⊕ ⊕ ···⊕ will be interpreted as the circular string obtained by overlapping the strings (cid:96)(v ) and (cid:96)(v ). 0 t+1 4 Problem formulation There are various theoretical approaches to formulating the assembly problem. Here, we adopt a model that captures the most popular ones: the node-centric de Bruijn graph, the edge-centric de Bruijn graph, and the string graph [26]. We generalize these using the notion of a genome graph: Definition 1 (Genome graph). A graph G with edge-weights given by c and node-labels is a genome graph if and only if (1) for every edge e = (u,v), suf(u,c(e)) = pre(v,c(e)), and (2) for any two walks w and w , w is a subwalk of w if and only if spell(w ) is a substring of spell(w ). 1 2 1 2 1 2 Both node- and edge-centric de Bruijn graphs are genome graphs, directly by their definition. Similarly, the interested reader can verify that string graphs, as commonly defined in [26,27,22,37], are genome graphs. Intuitively, the first condition states that the edge-weights represent the length ofoverlapsbetweenstrings,whilethesecondconditionprohibitsacertainredundancyinthegraph. Itcanbebrokenif,forexample,therearenodeswithduplicatelabels,orifsomelabelsaresubstrings of others. Or, for strings graphs, it can be broken if transitive edges are not removed from the graph [26]. We now augment a genome graph with a rule defining a “genomic reconstruction.” Definition 2 (Graph model). A graph model is defined by G An algorithm that transforms a set of reads R into a genome graph, denoted by (R). • G A rule that determines whether a walk in (R) is a genomic reconstruction. • G Intuitively, a genomic reconstruction spells a genome that could have generated the observed set of reads R. In this paper, we consider two graph models. In the edge-centric model, a genomic reconstruction is a circular edge-covering walk; its underlying genome graph can be e.g. an edge- centric de Bruijn graph. In the node-centric model, a genomic reconstruction is a circular node- covering walk; its underlying genome graph can be a node-centric de Bruijn graph or a string graph. As mentioned in the introduction, we assume, without always explicitly stating it onwards, that (R) contains at least one genomic reconstruction, and for technical reasons—see the proof of G Lemma 1—that (R) is always different from a single cycle. In fact, in this latter case the assembly G problem is trivial. We now define the strings that belong to all genomic reconstructions. Definition 3 (Safe string). Given a set of reads R and a graph model , a string s is said to be G a safe string for (R) if for every genomic reconstruction w of (R), s is a substring of spell(w). G G In particular, for a node-centric (respectively, edge-centric) graph model , a string s is safe if G for every circular node-covering (respectively, edge-covering) walk w, s is a substring of spell(w). It also follows from the definitions (again assuming no gaps in coverage and no errors in the reads) 5 TA GT TA GT TA GT TA GT TA GT TA GT CG TA GT TA GTAC CG AC CG AC CG AC CG AC CG CG ACCG CGGATC TCAG AC AC ACCG CG ACCG ACCG CGCCGGCG GGTGGCGGTGAAAGGCATAACCCAGTCA AC AC CG GT TA AC GATA GGGT GATA GGGT CG CG CGGG GTGATAACACACAC CGCG GTGG TAGA ACAC CG GG GA AC AC CG GT TA AC G(Aa) GraGpAhGGG (bG)GTraGnAsformeGdAGgGraphGGGT (c) MaximalunitigsofG A(Cd) MCaGAxiCAmCGaGlCuCGnGiGtAigGGsGToAfCGGTATA AACC (a)TGArap(ah)GGGrTa(pbh)GGATran(sbfT)oArGTmGreadnsgforarGmpThCeGdGGAgTra(pch) MGGGTax(imcA)aCMluanxCiitGmigasloufnGitigsofG (d) Max(idm)AaMlCuanxCiitmGigasloGufGnGitTigGsAofAGCT AC CG(a)GraphGAC(b)TransformedgraphGT (c)MaximalunitigsofG (d)MaximalunitigsofGT CG GT TA AC TTAA GGTT TTAA GGTATTTTCATAAA CGGGGTGTTT TATTACTAAA GCGGTGGTCTTG ACCGCGGG GA AC AC CG GT TA AC CG GT TA AC A AACC CCGG AACC CCCC(GGaGAGGAA)ACA(CCaGGC)ArGaprhapGCGhCCGGCGGGGGACCG(ACCbCGGCGG)(bT)CGGrCGGaTGTGGTGnAGrsaAACAfAnoCCGTGCTGsrAAAfmAAoremdAAeAAgCCCGCCdCCrCCCCGGGGaCCgpGGGGGrGGhGapGhTG(Tc)A(AcMCC)ACCCaMACCCGGACCGxACCCGGCGGiCaCGGCmxGGaiCGGGmlGTGCGGCGGGCGGuaGTGGTGGGTGGnlTTiutTGGniAAgTGATGTGistAAAATAATigoAAsfAAAGoCCCAAAAAAfCCCCACCAGCCGGGG (dAGAAA)CCCAA(AGdGAMCCC)CGaCMCCxGGGCCCiCamCGGAxGCGGaiAmlGGGGuTaTGGGGnlGTTTiCuGtCniTTGgGiAstAATGTTiGgoAAAsAfAGAAoCfCCAATAGACCCCT GGAA GGGG GGAA GGGGGGGAGAAA GGGGGGGG GGGAGAAA GGGGGGGG AACC CCGG GGGG GGAA AACC CC ACAAACACCCCCGTCCCGGGTGGGGGGGGTTGAGGGAAAACAAACCC A ((aa))GGrraapphhGG ((bb))TTrraannssffoorrmmeeddggr(raa((a(apa)pa))h)hGGGGGrGarrTraaTpapphp(h(hchcG))GGGMCMG(aab(x(x(bb)ibim))m)TGCaTTTraTGlarlrranuauannsnnnfTssiGositffAtfrooiTiogmrgrrsmmsmeodAeoeTefddfCdgAGGrgggarrrpaaCaAphppGChhhGGGTGGTTCT(GGc(((()(ccdcd))M)))GGMMMMaAMGxaaaaaixxmxxxiiiiAmmiGammmClAaaaaaulllllnuuuuuiCnnAtnnnGiiiCiitgtitttiisiiggigggssosssfCoooooGGfffffGGGGGTT G(d(((A)ddd))M)MMaMAxaaaixxmxiiimmamAlCaaaulllnuuuinntniiigitttsiiigggosssfoooGfffTGGGTTT CG GT TA CAGC GTCGTAGGACGACGACGG CGGA AC CG CG GCGG GGAG AGCA CAGC GCTG TGAT ATCA CAGC CG A CG GG GA CAGC GGCGGAGTACTACGACGT CGTA AC CG G C G G (e) Ma(xei)mMalax(oeim)mnMailtaioxgimsmnoafiltGoigmsnoitfigGsofG (f) Th(ef)or(Tifg)hienTaohlreiggoerinnigoaimnlagelegnaenonmdomeitesanacndodnittiitsgscscoonnttiiggss Fig.2.FTigh.e1o.FuTitghp.eu1t.ouTotFfhpeiutghto.e1uo.tftpTthuhhrteee(eoetof)huarttMlephgeueaotxartoilihmgtfrhoetarhmelietoashtmlmhgoronnserieGtioGttianghhlsmgetohsreeiAdtoAhengmdetgs-hecoe-eCCncneettdnhrgteireci-ec(cddfedg)eneetT-BcrBhiecrnreutudroiiiejjrcnnigBdiGggernruraBaGailpjrpnGGguGhhCeijgnGnrGoaAgmpfrrAAfahAeorpmoaGhmnG(Cdfar(CCi)oCftaGrT,mso)bm,cu(obai(nla)utt,)ii,fglbrtsbouumfiilrltottmffhrroeommtchirtethchuecelicarcircrircucululaalrarr sretrdiuncgtFsrietiosrirdngetinunrd.ic(gu1nitfsc.gi)iotn.aTisrtniopwetnEh(nrdipoeiesaun()liemcisac.oge)htpua.EdaiiptopxnaEpcnltpiicoumsrtaeo(lhwetintdicesrdeaohn)itCudconlat.iomocfogGgpuctEtndotapnionitoaehixnnliiosnitcneigtemGCidh(iogdidgetGeTsiadsh)scrl;ta.eCiraaoospu(deGEnwnCdpnerTntdaGdal)ianiiotcAragnweihtaadgldtdhgnnawsehiceo;stdstnoeoCAra(hmndidGtsaCntretah)rhaienoasagaertxsmdnwenhuieairasdmeCssnenlGurCGatGsmotadclGriuathannlrcaislaarenotogxnwctoignimaGrnhdmnngenoGgegnsrtatnrauthialhgaeatshelpodrriteptegacmaGhrgihhsnepneeAwwnosgh-oGGuincahhwtflgeriGTtTeegcGhrniteAsneaThetil;oilgGCsproeGesniiiiflhgcsninsswnGhrtiGhasdnhto(hC;hphoeewe(TeiGhoi()wfnesnBwwe.)ilG)n;ts.phrn(h.iTu;nebasn;i(e(sihor)ncjnilbtsnop)(oiom)tiwecasnh)guhrtmmotne.trholh(rGi;aaoweceo(eemrp)nGuGbrGrn.reChlo)meeam;ae(xxrrdnbmGtareaaiouh)eemxdmxoxcedimfturaimtarhpumecAimomoleoltacrmernpaiuealt,emoAlslnixdenrtuo(a,iiehausaunmtnxdtreci)Tiinhuasgemytta,ecrsiiilabaogtGyetaaiipnsolruurogGaGaGefepsinnurspollestinGaaafiptcioralaipftl;saGgerirlifpocsgeobs(o;aasmlGlcacipebo)(ccpiofplc;riaafpertct)lchGnlbi(huGiuacebcdtlel;aln;aheaa)bcdbGr(er(YialclcrteeeeTG)-)nYchddtauadoTtgeth-gnnhlt-heeaadVoe-dGhe-Ysrc-caGoVYTGo-YsvTtv-eTt-oehrtohir-o-anhVaiV-ngsasVgs teilwdlugosetm-rctwaaowtatwvxoweleaiksmormlskitmnasaohlxgfaaowiGuxftmwainG,lmtaakihal,tlsakneiualgodsnnYufsdio,ontG-wCCfniistogt,aGGehoGnislCga-k;oecV,snsGwa(;cdadnoGGan(anf)TGdonblGdtG)nigenhb,eGotoeeoharTG(cnbniemdaoAAtedtnbh)mGaa(;tcomixeAbanana()iAAiexemenenddMiCC)ocemaboAadbflatnaerCethxooflsaCCibrmemoimooennGGbmnmmeaCoottidlnbhttGaatitioeigxGGfhatCanCmrisiieoGTlogenmwtnGomsdeohifadTattoeyfhGitlGTfrfrghAserAooAAG;sobremomTmiy;nbouAinytatntttAAhhiphhactCCeTetTiuiierhAsrgcotciiCbtsopsruhyatCCcaloephluarGGralfatrleCaibrsGrcctpyrGaiuirbe;cfpalcer(uayieumfcnrl)rsaliamuarterTtcrrtxcuuheiahapniltxitarmeaseigarcotrompsupnimpr,oll.aepieCnaubgr,rlm.rtiCCeteCnta,icuhpatctateiluehaoyurgtlAneasmieyao.eAAnrArnuaoe.ietrmtxaeaslTeatsaoimoaTTloTusnncoptdip.lrcecuiNit,urtscoltauc(htrlodeeane)yrttdhideggadaoesrgt-eecesto-ahcvnloeisosvrotieenrccxgiinoargncmutaplailner CCCCGGGG GGGGTGTG thTGTGeAAAAstAarAAAAisAaFnCCCCssngiusgFssonmut.tiCCoro1CCmgeipGGtfn.AaoGG.etpgo(1suTietsn.ootnhi)iuGGnpnnToGG.eoamGTusnGTht(osaeteespusa)ssutt.roosaAaoCCGTpiGTeomuuCCnsuCCEroGGAAuCCAAtnsemGGtGGnnatnppuGGplacosuytemonpuhsittcfGGAAtoeesatAApsGGeGGiTGCCcotsucGGnrCCaotshooTGTGfeemTGisofsnnenoste:antshsnnptTGCCirla(:aCCesghsOyteTGTGyAAeGGsTGtriGGr(c)AAtAAsOoarieyuseAAhesutieranM)rnsrouemftdAAessaMegaromer:AAaAAlnCCppAAxsgasarCCCCtmea.irwoCCopOtixymcloorrLgdnrieoiivmunaCCtoonsteedlheCCCCsrrGGsavoaglCCti:emaosGGGGtelsGGmemlrmhpw.soOavymemrmLnGGonaeouvo=itGGsGGGTdeknnntGGarevoatGTiGTeeiokGTngrettm(stlnpwcshievhgGTreesmtosst0oeo=hvGTAAGTodhGT,nevsaeAAeAAmeedesoAAekn(ltig0mevsmeheeemAA,d0iste-vvmeCCAAgAA,icphAAaem1eeesCCwCClknCCeip-,n0oemcehp,etlsmeCCssterev1ltnphiieiGGsCCCC1,csmeCCtclelt,dorGGvGGiiGGde(iicimtp2escndfme1ii)s)l(t,maeedefiB(pvTrss)bseeaip2reisltm)herTusi)Blud.ceeesidphrmbajsi(udnoutpelbeteimripus)crojipgadiangiotnrrprtstreiiiaphensigousgotaptudeiaieirpnnirnshoalrgmesiesampnttpgudG.lryihepisargmtgnnH:.opxytefGopiiirprtnH:enamomoothorenfmntipeaermeaaorhrsole,tnntaeee.am(ryusndwat,tetaH:.nyd)ghdn(ew,iH:tteaetgidebhhtie)rot-eegs,ucheiaerbu-stoiec,cbeatslsnobeunt,outwecnintftisnwfrotltthterretvinGihieroigegrtteicfemo;visciroggbuaoetd(busdmtclshnsneaghet)eeaiilheortrttBgBttvihrviahcoggirreeetsietrruuuhaciactstiYituclihjmlhjurghnhgln-acaacotmoutorhgrtgrl-irsaVtirsautrhauphcmpchhmhh FsrtwFsrtwwetwaetirdaigordilgounkil.unkm.csg1mcsgt1t.aoii.aooixfnioTxfnnTinGmihG(mhi(e,esiae,es)aal)a.oala.nopuunEpuudEpntdpntapliaopilticuoeitcniuhengitdhegtdsecso;tcco;tofcoao(fona(dtnnndtnthn)thio)beigobtegdtcmtedhetehhii(teehisor(eehosoaCerbc)CmrbdeG)dmtdteSGuetrMaaeerMaaaaoGidomoliaiaxatlnwainxafauunwonlioGidom.asrtwAFsrtwlsdiexmatlvrgnwweteFsrtwtdFsrtwexmtteuugntobdaaisFsrtwirwyretowddetfidap-iimosgotaowahetiartwAailiirttlerdsmundrtabnnauuanhkgniokygworerfrldiafiul.ilp-hisstgnofuleasumranmkinclkiicotsndgueessg.ng.srau1onoksteortlrmuml.lchttocsgonsgtoniu1t1tlhmkoimucaano.aessosGidomwgmiriittg1gmmiaeiotlmrmotottco.taosnmoe.axotnifxnofntiitumkoiToein.shaomonifsontengiohsxfnxifnimatinstoTGsaGTorbgtotixtfnnt-msmuenkibhebTnhithnoai((fuyGninhpoeGtoxfhmfiiima-imonhsnhletheGrbe,,e(snsr(-imooublaanalthioieihnimrit(wgGn)re)o,toeecime,estmyisaiganfanelaclreevt,eaai.oi.soinaoe)ssnmia)liaaomoannasnnolil,)glsao.rypoapu.uiotouetaaoniggrletvotEEatn.svaosoonddgtslhngptuaaupusuoppnnlnntthshaEerEapeulesuntfdcGidomwhadh0kaanepgEupsvhnfttpnsfltliidto0pueaegoosupnftaiuaitteccaopGrbsarpogdus-clitliir0beebGoaepronhaupngnwoitiethhoirtlai0cimttocnsmthaueoeuggtftaddbaeeietechn;odyiglnfiiuheeshofihbpa-broefeu;dsmngeiiststadogbrhhtpdsnnuineetgeegccioetdpynarrhei;;vttbgiseiasmesncctynhnpfvgrdooooieaceosnnfectsooosn;wtr;teteaate((rcplccowei;sb-ta1oonnvtogspfGnitocifmotpvodda-artatnnnis(opGnoa(cf;gothnntocnattctrnh(csnhh0hdeeidt))1utneeottnnpkchTvmt0mehpniiagde(nfooeitntncpTntaaehbbehieggnn)ogg)iiarbtndbi-ehinnc1ttisobie)dodorhoanvemobsnbeggioeeesgtimmorrihheslera)obetc,teetggtdaifdsleesSosriiocaft1p:ten(medaso.bhrnheehupssysotoeftiteisuehtraherieovtdi,iiba(ati(icsCCeSneeph.esveishrslabsrhbbocostLon.)DnrhmemlehOyeeshaerSfddsoeGCGC:avhteaibrrs1rbpetibta)Cmd)ts.moI(lmrrenodebLerdrdteGM(snehG(crmaaao0arCSi.d(etufuttBfeGebabaeba0aiepweSfectmertOvrmMeaeMgiIiwepay)xxaacGoaegeeaermpn)bwnwa)nsnataaruIaaaa.oaefe)1ntuiantiBaxiiSxeknm.tle.aaaanwtfBnwdidnTewnexuxsmenm.alaagraifatnhonT,nwsralnIi;tniSalmicrlxrtntrdd;daidihexmm.(exmnoIlmgnbryrgnwaotfhnof(dpiveemmmrtguaitra(ngohhtinbdfrdiunirgsi.aos=bdaronemhhdeei1fflofmIlnhptRamoitauamoeiSi)mieesniiuosdrrastarrbrja)aetphefgehgtvsjflr=Ifblaonodntxoeetoiahioiavemlhflngtskxraosram.itheesftteenea)(rhrmsrmmaaSaorofemlaotmgretaroegtoe1armlhalhhooirvovetoin-ghesheneeInmraaimgmoauthim,gorsmodnelrt(gmoSdmmmnmi.anomeonmomgross0atfr0vieremsnyeoonvrfrhvmnittmnoreteunreugaminimipirnmg,s.naeasaaedfrepnnhInsthtsenamn0h�ems0ntirceSotplloisulrmuiiliebnreirpftiaasasutl.ietlinhentInhaett,egerfnga.oaocoahcteohtltml�rauie.liDil0arrhmcsifreikorrel,eitSlieatieesgnhsritnig,tgengnnoomtcekctgtilvx,srgioggenoisdIcitxtuaotgodhGt0tigesgnneiksntim.evttgGoooeBotatfesi,vhseneaihgnuigmhohinenhtvh,htnromuonggerssnfant1fgmttnsooookegnI1gsghsvendhatcetGoienevneoei0hcihhifkeerggfofnioGtfshfGroy,rontaettesgraote1grfrytanncc1rseeeeegfhasraiGmrahmio�hfeleGgGisttrcmgrtgotgroli;trdoi;r(aG,girbaoepbeoepstemramnaaenchpheh1nehnoeGomrnaaueaRgheir;m(darin;yndyukhsesbheptbpvra;dann,neeeonbsegspnmmesaenpfke1rdgpewso;giwainyay0hbhpvrgniao(yn-avonatho)GeetG(i,ecaenitasmtnaiiwkaecwyfaehehaitshna2mas�awsan,t1tc-ah+-ahrvrGiGnreplic-daT)ctGlTtirGilecciea)cedi)hrhsogtkgoeicimianlisamgh,2ehniGhseimkseeet,eeptsconhTincccqsTnriperseonemcaiTitser)pieirlhbiinicnnlbiahgci1tcaueicnbTeanssvruistsoespiretmprppepcqiacspsutiortpueociuililimsgeipcosticanu)1becaisllgtasirsraeccsrupnsplniGwlulefsstvpllraairiupcufgp-iciiraohielcairnuiharrlsnieiapacinclncsaiaclsnpoinalecmtelaudrtlaccoctas(lwGnrtsovrklonaio(Ghuaihr(araaicsinusosirfh,arcrmnilneliliptnwecpdawetcdtc(fe(rrgriohlpaa)oernudnpfti;ntoir(s(taaltro;re)itaia()ufnfcsurceantaet,irwewpieirncBclqocunaeeob.eiwdpe.)sonpg)cobrTg(inruloiaphlci)raga()gucuawrg;ucrkm);astiuulaceninBbnptBrtmeseske.lce.s(in(BmsoTeeimTdsehnon.guslpleer)uethnirthraenllrnb;)r;sbaee,asgsrrdw;rinsva(eea(ldnkmaermepiirhicquhuoa(ruod);mrsrruitretu)anneaneabbutjtgtgrucnrgetbtg,mrruishneuentaexorbetxehrhninniuouoei)heo)cermnetSihumosem,rtejais)jtelaeailergdajefarsesedtu-onwtixonntoxohvoieeaha-anudtntxigte.memcgrmnttnrtamnSoathccpmogrtraxetaiaegercpiresogfGrachohrganmmiGoYoohtriotagihgrciYamg,rmiea.t,eaigooinlpw.giigepmvnpitkeeoieernevnagnntrrioriirdameTnemomoap-rrTnroinil-ecionotltpraisca,aecp,.prbt.ntwnne.aseohgeedteprnonongrerhruenhmamnhslruapopgfoampti,hloetlhp,eaeakreihntlt.o.neiedelbabrtse.nrx-hdnlghxdehiie-ldrpnoaralta,elShaGhhietSa.a,Ve,eertaeeohnvVrasei,feeighnnousoheusrgfxexdgmgdsrdee,hmere.xeett.tdGnGnea.ieetfeGcrtegcirlieerdhShuhraufSganinntehmuoemdayftaastsyehHtarlm,nonw.thueecce.esaeinmeylioclh..ffaslaggyrooofyoytataeacatanrratya-tabenmitrrt,hd,iwinim.smaiefoyunoHetl:hluicrhnfsorolohresnaoaeioeesmt-mannesearbtntmonehndekauiueaetiarnirieechu(lirtrhesicnninnann,enasohessaenaanryrt-tdesreethneaakariei(r(esltimrrtde)li-l(iieacnwanaannansina,geaseagattearae,-hdrRnaattiaro,arrroatlsliilsda)dtm)sariel-aiish)ebsessggbaea,eeetlaetasrg,l,cern,wanoostca.rtisrscsseophoiutsuiiipioastitebbabaoiaetesotengilnbartchfpdirifeciprycrcttsisarppokcouuoslaesuinpibouchtocliictieelcgepiGnfkfpGpaiirirducricfsf-dipyueoirtnvisbclvclosarncccplhrmoroefl-dlc-ttul;lGleeGttavb,u;r0BiaeirGu0cuaSmfeoin-alunnvabcci,vaeobrcibubnbfafrg(esalnlerbf;(;rtacta�le0;�rrBlraavaa0mSlocrebalelcsietiluidaroboebsieuuicaebobergr(gt)(iret)r(rureti�vg�krbdmktmtlclcbdermagssylcituavieeteaoseyutcel)eth)f0gjrost)ggthnbvhvnekveddknhenrbdfaheiivafhaebvttavadbaevd1tatt1g0gljdnoet-gaen-bhhvnnv0ahehnh0n0aScgto0bSheeoncgenm�c�eaegdd1bG1blolehdeGee--YoterrY-����eidregdiccbvn-cguvrua�nccT�cakckGG-cTcooGoY-YescovokYevkkeidtribkaGibvtoitaavvpruaurvrrrornT0vvTakkoh-f-T0vGnbvhi-ecv0ecnibveclviTnhtt-ncvst-22npaaeruruaag1ta1,lVruoovvaV1oh1ehlhuGghieir0srghbinllsnhe--nlnrh�b22-�nniaaaav,aaVcV��nVedmaiggasrgsresrdhneikkgspnkkvravvvG0ahynvv22G0,n22, wwaitsAihstouhuonmtnoo5ppentueteotiosofoCnnfiotshnisatla5assyprsraepusataamCnchtfetsehphcetsbsaeirtosebrirsniazeinansciargn:gtytgesaiO.rotaouiuLnznrueapitnomttriioiwftgoovindsge=naeoneolf(tvoefmvets0snpsa,satkeeptfrle0heels,iilennvlssig1ntgso,irgmsamei:n1aesp,goulvisesmmb2us:)sbtptnosbrldmtiiiecnretiisingatngirigaotusesifdnogsSifustp.Smirgo.piptneioratnnys:.etHdhgeaert-ec,tehwneteruiocnbidsteeigrvBaerlugtihojranittghsrmuacphh AAnnootteeoonnaassssuummppttiioonnDss::eOfiOunurroAiAtAmuamiGtnsoopnnosodnuuodobtemettteulees4lipomlomootntn(ninaaofTlakryknaasoeeshssssmsssusaeausumsfroamemoemspmsppgatnetiteteforeiineoiiocninmomennsgmsa:ssppss::nle.alOiicrOOLdcSyuiiteuutr.ttcrraaImowosfsmmsosmptuu=oodhrmmdodepev(leepplvgellmett0eimmit,eonaoevenoakan0esmekks,n.cse.evesosH1Hts,honisseesemoeortr1emmeni,es,g,oiveemwitw2mi)iaepemmcpsbliooerplpseibcsbclletuisisaiccmetlediiarruttaevrvbn,ssaaeesilisstrtuytsstheihmughudeamampnpittpnrtppersisottoao.uugiibnnpoocc.shnenhle.terssdhtH..mygeHHe:e)rl-eetaec.hrrs,eeetanGw,,ttkwweirt-vieemhcoeebnooedsbbrueeassnroBeevifrrtservviSuegeettihjcattnaohhaltgfnaagotstrrrubaeisspcteauuhhhdccvmhhs0,R aoGidomaoGidomtsuutossuuoasstteasbuyttefibp-suyfip-unsmruunmnruinesnitlnespotitltopsotigtosttgtktthfitokhfiorbo-bohrbn-obmnhneomnoalemosalymrevsyarevarearsveaassvraasc0ruatc0e0utfa0eogfabeogobneeonnsenfassnfetasnbettpvrboctpvrrocei1premei1inpeminnseaSrnvsmgesevmggasee1gnaae1asrao,sSsrraco,d.Sryuf.ytytfiy.IAtLinh.lSiLnnhabSt5IwD(otIbeestIne.evsfoeSn.t5IwD(GidomwfoaoGidomeiSgt0etttsnIvaoGidomlaoGidomtaussuuoIsocm.toGidomwtls,trphsiwttfifim.a0easuuhpsuuwtostfolidomw5htststtaeraheeuubobatCarlsbtI,ssyriu5yaifihfifittni-eerttp-aIeuitsbosaib0hsfnioteuspyttuisiyeeu=ufinnnfip-baCsfeop-mbssse=yrrntu,nhfitiep-usvhunnirsnna0mhmogiyisnrvgIwD(entsesruurhtfiunc-sevgispngtselntlretb,uiinnpiieosvhonhtentsenninein(aeiCttoohoithlre1nstelnset(ahpiggvoptoanotisvvitl0mtaettnttotoeoenseslegosdvtttc,vranigneikgktedonghe1thestltfhtfi,et00ogopdtnhrfitoniottke0tok0vaorbrbhrhheofic,-rt-figobevboko,oeimhfhenhtbifom6rb1o�,oeooenmrbebmtfne-cna-cktebrmeboc�ieahh0nerbhecnfd-orooaabmo,l5oemnshemnmnnoiroeba1eaosba,m-nryi0nootobookoaloceahelrwaerve.vem0mifaakoemtsn,a5mlsysvym,relsatanerm,evtrfevrnr.oy,adrteeaalaIwD(wtiv.ire(vetvsviesv1aetihvrsaarnete.svssrvocearaeihwevt,ris.aaarsva1Oisveccc,hiG010ias(aa1usutte,stsv0eGes1fir0i0erplrtaean.asgeGrcsfiifcwaate0so0,etvulOsgoeogatug,zitvc0efi,nbbn0c0ees,ue,mftioaoafehanainci01ennpbofgecogr0aefenoabbtoeueebsaatesodgzovos0snsppomaffnbt,rd5piens10mao,sasaemnoasennaonopo1infonseettnsbsethtftniaf.,bbonirae,iametasptsnntssppnvvafts,rretm,ooceutaustlaasn,zvoios.bbsvsrttaf,tttftptvnhipeeivii(vra11rvrboceppoecciatmmopvtekd1r,i.rtosnnvhrcptahiouraw2sbkeoioei1iOp1ipepsr2vnpsitesmenbrmicp,peieoei1nv,iinpiinvvnemome)pmirrggelsr,vittinnsnes)oeeeinosr1ltneegpdgeaaeeveptvnunmvcqum11gbngemveaapsiiscieqnveim1+mgbfgbtcagsabgsorraaitote,eooeu1,,bg1cdosgvuageamatSpStgreur,5oln1ue,th.oasrash1raoyyn+aem,oot,ffoeayl1Ss.rSr,eeaitntrnoe,nraiileaw..gvsS..ef)oawyynfwiL,raifvuo.y.n,hnahaey1retadyotnSSfii(iitrnasd.inbbn.terLai.aLsynhgtfeI)wf.Inthrr,SeevceltOSdLtsaleicnhebbi,rb.gSi.oe,lcfoifeeilnt,rItgbiIStSvhltenheieenousioiaSeetapIgcvd.geneusso.boifaoeafoessbgmdInIhSftSt.utitoofiteftanettams.hmSon.igttfpwtftaf+rmre.sgsonhegpsIeIghrh,eg,ShhtantsirrIam.edgmo.rurkeepwiatifsotpwtfnrImdk.Ieert-sf1ieeahaIwhtfauiedieteilieieiur,hdnrm.atfofonaraaecsertfs=,bdIrhgetn)ItnrnaiteaiieecieniI-sneieiawsr,avstfrovislgesiforsir-r=sehii=ntofngIggfgtgcatlretasgtt=sinvigcitiipdatiecicesvrsevSrebfherpwsasohehnrgrenneSgns(eghigttdgmtrefaieothgiioaigrattfeeeergeseioivgm+grvemvtei.ehrilnrhnenne(enee(gn,ege.tgtaennrehetfnogeakag(rdiadineisogav0rn0rv0vdavn1rhneresonaeoeirnghgvrnvlvzo,enanewp.erdantadga,t,twe)p.0kenfta00vfi0drri-omineg�omae�na0gv0eacmtvcilrirrhortdnzegeforr,ceb,etftc0nwfhdgtesnrbmc�tdm,nen�othoodfeeoccohecobiramaraw�r0ihmheitekicikeseerrheftshoogieece�gntnocnnmooo-motgncetn,erarirratop0o0rtoiekeiekeeaa:anrtere0cietvyGklvsennsorettmnoetnmmivneyi,ria,rndinnnmalkeehisgrhad,rrolpmiseeahs1tav1t:vt:nnetmt1tet-vovklG:irvianpGGllmaietg.lvhinhageggantsiscars1ihaeatvg1h,lzooaeto1a1tscat1gnncGnsmncGwaee1oithohsaageeeaGsl:ig.eoeia1aith,mht,Wsinftgueidahaslos-n,Glncnoconncsosmrtma-enpnnreagmnp1dnciarrtiietihelciaiiagneoanceaswWtsncanftinsnoitesordlsmes,miaminapps1t1soasetameefkennkpti1eaoainpiaancinikttavsmtglmirhn,wo,pmaGzcmnneaistnsnamcalacc,angnrk)n-dsaknaaopeieir2iu-dstnvasv.amtrilsritmlmnsiatrvenp-ccpaac.atmiaexiak-nkeik).,eprcliahpetriisd)2n2s,ktsrenhetaniaaton2rinosiennalppapgpicqkscqutiiasnie)ih.iep)rlreebxslGakoiieeyob)ieceeiblrsscnWginfngcpeuaedleuGpupcnLecqniucqetembsirsogepntttcqtuiuosir-sotnbcbdeuicstgs(gyeapulwuutabqglucuegeeeueeeusgeuwi6hdLn.wvtyuveastbtoatdoceRtygosetsneotroedornghaosltuanentlhbaecleahenahaelwaewervoevrelaesahewftW)rvflaeadvrfedtrbcdnanevciyassv,rnbi,rlw)raevlaitttivnhalecs,eirr.ataixenkusrroi0ntwBtanoGia0etneSre0c.oBc6a0,orSt,erioadinilctlneti,riatnrisbltsisonngusnsglubsooigcoiaiarragm�grhynnui�Grhoirlta�nigp�rn)ngtotiweiLininunkeeoitkaienslongssuonugcsdstrdures.hon.cghenxgetkngevghak,doibkipe,vsaakobaggdewhkaeekkeegeiitbiineaenidkeeditiesdelgsesdtli.0tjftehgycl,kd0ej,drtenbixvrtlabvotgdn,beLveannvnaeigeinepsnsgsotsrteiatS,hrlSheagSs1egit1eflrerao1tstfeg1elfielohriaioattsohrefGpmtpaaebe.ererSne.sShnroeahbgtthnercreSg�.h�aSarogafsrfe�g�rsioioa-frnSreahrnSiiotnfeaed.ree.eainien,pdrtf,e.awwipe.wcnt.aooegogri.rtnnaannakkGgenaertnanakkd.oa.nhofet,e,htbrewsp.wp.blvtn,bhadtpnvw.hndaaneapnisahvvnalh,pmemvvrwwndhGmh0titGoreionel0bebtottnhntnehdhd.nhb22ttnnehsde22eshheibt,seietrrr,iehhmbhrthrdtynhsoornenelaenchntodetitreeehaielheeaeieeieo:ernterrtiieSyatyslrftilennnnangadctnidnsafeheehirtaatdceatehg:n:tnaenesiaaelilstaehdhaagslghadsaghgatasoatateac-fantgtarttnrtsaseaieiimsii.sehsihaatcesdsamhhesshtdgnesrssnao--tnnthrtrte-ignikirTttiiaikeic.isiaciaairmnfcesrersnnocnnorneicsrinntcr-deeir-dikgtkesteiaTnaehitraikrarntmidienf-mtncecnn-nannc,arcnhei,nritr-d-derneipr-dnacgtrahntrniatentmtmrseirea-d-gitmadeeebat-,,ebheheerhe7lg,ercslrngeneip-laarnidtiar,tsdiernuaiscai.signuieebeebastaiteeebehlulgeccge,lahdwc.ytbtdiitrirygettTloisuiruiacoshuintothfeteiutugidggaeeTruigdidhfyytnetfdftveybnosgvvtosihabraavnahoshrnnoyhettaeehasa0Bde00fhSfaol0Sinpvafovbbisvnviaivafibyvnabirspbgtvtbcr�vn0ahs0Br�hBr�td00S�S0oBoal0Shoniidu0tuuyb0cbchdlocgbgrorc�ci��vr�akrre�ekvkb�rbekhabifauru�h�teuuuccwartllu0jcoa0duionbvvavkakavneebkgvkvavknbbakiabinifanfinvknkza1blr1tl0rj0lja1or1sl0jonnbbvvvvngwgbevdndnvh0eiabalalnnbganbfervv��rayyza11�1�1llao1o1loctcioedwreaen1ieb1dlbaebinnbgtgntrirgn�t��r�eaklkn��lkwkniaohteeisii�eeo�dredreiievfdr=ipvfattiartlnnndvveakakkikmnvvrtGakkva0oGhhl0innhkkfnvv22=itppvz22aa,pfalne,ovvvvzmmvvwGG0a0ohGtal0vvwnhnhnhr.aG222222,,htnr,22etiIaaosi=aonlnet=llnl with one of its paths being a5wuintihCtigohnaneroaotfcstipteesrllpiiznaagthtasiosbnuebinostgfriasnagufneoiftsiSgt.rniontgssp:eollmingnaitsigubsstring of S. with one of its paths btehienygaawrwueintinthihtoiogtonnaneeoostouffsbiptistsetslprlpiiannatgghthsaosbfsbeuaeibnninsgytgraoiantughunenoitrifitgsiSgan.fnoeotststpsrpeinlellgilni.ngIgnaasfauscubtbs,tsrttirhnignisgoifsofSdS.e.sirable in practice, and moreover, 5 Characterization of safe strings: o6m6nitigs the set oIfnatlhlissasfeectsiotrni,nwgespisrotvhideesaetchoafraalcltesruizbasttiroinngofswoaflkthsethmatasxpimellaslaofensetsr.ings. This characterization 55 CChhaarraacctteerriizzaattiioonnooffIw5s5ns5IaiawnltflfheieCtbliCChlseshsibsthhsttaeerhrasacrieetirnrtahnciaabtogcegicacntsosbstte,:ni:aeesrw,osrroiiowezmiismfzzaepoaaontrnpfuttoiiriooirivtotoouniiviodgnnrgimedsosooenoamfiffatcsnihsasgictaaahfiaregffaaleergcasaotlstscgrertttiroteririrhnzriiiiannmtgzhtaggsimoti:ssnin::oointnooomhfmmoetwnfhnnnewaietliianxkittgltesiikgxgsssttssehtcashtetaicotstnpiso.epnlel.lslasfaefestsrtinrignsg.sT. Thihsicshcahraarcatcetreizraiztaiotinon In this section, we provide a cIwhnDailtrelhafibicsnetesitrethicizoetainotbinao5,sniws(oOeoffpmworaonulvkriitsdoigetmh,anaitcethidsgpagreaeall-lcgctsoeearrnfiitetzhrasmittcrioiinnnmgotsh.ofeTdwenhaleil)skx.stchtsLhaeeractattciosGtpner.eilbzleastaiaofendstirreinctgesd. Tghraisphchaarnadctelertizawtion= In this section, we proAvidneoateDIcInhneoatfinrthahnisciaisttsseisseorecuincztmtiaoi5topninot,(in,woOwnoemefsp:wpnroraiOovtlkviiugdsidre,tehameaacdtohcgdhsaepear-aelrclaclemtcenstaaretifrkrzeieiazcsstatitortiminohnnrgooesfode.fweTwilam)hla.kilpsskLlsctihehcttaiahtrtaaGtascpstsseepburleelimlzslaaasptfaietodfiensoirtnsertcisrnt,iegndasg.ssTg.orThauiphsthilscinhcaeahndradarcaatlcteetrtetizrhwiazetaiote=innodn of wwiillllbbeetthheebbaassiissooffoouutrrhooemmInnniti(wttwrvwiiDgio0g(liil,lvdleallae0bulfi0lb,bgegc,neoeeovt0triri1t,thtioit,hhvitehone1ehe1mbn,.m,bbeaH.1aa5si.i,nsisen.s.iir(s,s.toetOv.hoohf,t,eff,meovweoontuntn,euuer,eeixrvrxottott,toibo+mvgmsms1stn,e+ee)ncnicr1ttiibtev)ititiegodieioggbnngaaet.a.alehgll-wagaogcaorotewilrrtnksiiahttutlhhmikrcnmmihciinGniiannm.stGshttWuo.hhedmeeWenenenpslexee)tatxx.iysottsaneLtysschseeetaccttiahtttoiiranGoowet.nnn..wibeseciaesasoasamdorinrymeitcntitgoietdipigfrgioarfvanaepdnheodvnaeolnnyndlityfhleieftfosrfwiomarlpla=llelst Definition 5 (Omnitig, edD(gve0efi,-enc0ei,tnvito1r,niec15,.m(.O.o,mdvte,nle)itt.,ivgtL,+e1t)edGbgeeba-ecewanatlkrdiiicrnecmtGed.odWgerela)p.shayLaetnthdaGtlewtbeiwsaa=doimrenctietdig gifraapnhdaonndlyleiftfowr a=ll D((vv0e0,fi,een00,i,tvvi1o1,,nee11,5,...(..O.,,vmvtt,,neetitda,t,vievgutst+,+inr11ie)te)diD(dbgD((vbevgvpe0e0e0i,efiran,,efia-onee0cnp0w0,aiewt,i,evnantivavr1olit1l1,tkokenr,y,endeie1i:cin11,5gnt,5,.eh...Gm(-G...a(cO...,Ot.eo,,vmnWvdvtWtm,ttthen,e,reeneleetii)c,tttis.,suv,tiagavvidtny+gyt,tLe++i1,tet)11tietBhh))gedabardGtebbgatueegeliwawgje-baanoc-ewiceriswwsgeiantaarlantahkaallrtkkmpdiroiocniihmiimrcnnoenGcnmGuiGGtmi.ttetoipd.i.bgWogduuWWde6igteiiflefsrle6e)tal6sa.a)opasns.fnnhyaardLdloyyyLetmaothoettnnsanthhdaGltaaltyyfGthtewlieeibwwffstebit(fsefrkoiiowassirarn+aaagadao=sli1ldmoolr.l)iemmrnL-cemitnnecteiiittdettgeiirdwggsigfriigo=affarfpanaah(padnnvhdd0ago,nenaooednlnnn0yod,llyylmvief1ltiiee,ffftoewSffr1oow.,rrav=Il2aalf=)lllltbhee genome is not circular (assumption (i)), then e.g. the last k-mer of S can be v , its first k-mer can 6 6 0 be v1, the string v0 k v61 can appear inside S, bu6t6v0 k v1 k v2 does not have to appear in S. If ⊕ ⊕ ⊕ there are gaps in coverage (assumption (ii)), then both an in-neighbor v and an out-neighbor v (cid:48) (cid:48)(cid:48) of v may be missing from G making w look safe whereas in reality v k v k v may not be a 1 0 1 2 ⊕ ⊕ substring of S. If a read contains a sequencing error (assumption (iii)), then this creates a bubble in G with one of its paths being a unitig not spelling a substring of S. 6 5 Characterization of safe strings: omnitigs In this section, we provide a characterization of walks that spell safe strings (see Figure 3 for an illustration). This characterization will be the basis of our omnitig algorithm in the next section. Definition 5 (Omnitig, edge-centric model). Let G be a directed graph and let w = (v ,e ,v ,e ,...,v ,e ,v ) be a walk in G. We say that w is a omnitig if and only if for all 0 0 1 1 t t t+1 1 (cid:54) i (cid:54) j (cid:54) t, there is no proper v -v path with first edge different from e , and last edge different j i j from e . i 1 − v0 e0 ei−1 vi ei ej−1 vj ej et vt+1 Fig.3. An illustration of the omnitig definition, edge-centric model The following theorem proves that the omnitigs spell all the safe strings, using the help of an intermediary characterization of omnitigs. Theorem 1. Given an edge-centric graph model G = (R) built for a set of reads R, and a string G s, the following three statements are equivalent: (1) s is a safe string for G; (2) s is spelled by a walk w = (v ,e ,v ,e ,...,v ,e ,v ) in G and w is an omnitig; 0 0 1 1 t t t+1 (3) s is spelled by a walk w = (v ,e ,v ,e ,...,v ,e ,v ) in G and for all 1 (cid:54) j (cid:54) t all proper 0 0 1 1 t t t+1 v -v (circular) walks w fulfill at least one of the following conditions: j j (cid:48) (i) the subwalk (v ,e ,...,v ,e ,v ) of w is a prefix w , or j j t t t+1 (cid:48) (ii) the subwalk (v ,e ,...,v ,e ,v ) of w is a suffix of w , or 0 0 j 1 j 1 j (cid:48) − − (iii) w is a subwalk of w . (cid:48) We prove Theorem 1 by proving the cyclical sequence of implications (1) (2) (3) (1). ⇒ ⇒ ⇒ Proof of (1) (2). Assume that s is a safe string for G. By definition of a genome graph, s is ⇒ spelled by a unique walk in G. Let w = (v ,e ,v ,e ,...,v ,e ,v ) be this walk, and let A be a 0 0 1 1 t t t+1 circular edge-covering walk of G; thus A contains w as subwalk, and s is a sub-string of spell(A). Assume for a contradiction that there exist 1 (cid:54) i (cid:54) j (cid:54) t, and a proper v -v path p with first j i edge different from e and last edge different from e . From A, we will construct another circular j i 1 − edge-covering walk B of G which does not contain w as subwalk, and hence, by the definition of a genome graph, also spell(B) does not contain s as sub-string. This will contradict the safeness of s. Whenever A visits node v , then B follows the v -v path p, then follows (v ,e ,...,e ,v ), j j i i i j 1 j − and finally continues as A. To see that w does not appear as a subwalk of B, consider the subwalk w = (v ,e ,v ,e ,...,e ,v ,e ,v ) of w (recall that 1 (cid:54) i (cid:54) j (cid:54) t). Since p is proper, and (cid:48) i 1 i 1 i i j 1 j j j+1 − − − its first edge is different from e and its last edge is different from e , then, by construction, the j i 1 − only way that w can appear in B is as a subwalk of p. However, this implies that both v and v (cid:48) j i appear twice on p, contradicting the fact that p is a path. 7 v0 vp w(cid:48) vj v(cid:96) vt+1 v0 w(cid:48) vj v(cid:96) vt+1 e(cid:48) e(cid:96) e(cid:48) e(cid:96) (cid:96) (cid:96) c c (a) (b) v0 w(cid:48) vr vt+1 w (cid:48)(cid:48) c (c) Fig.4. Illustration of three cases in the proof of the implication (2) ⇒ (3) of Theorem 1. Proof of (2) (3). Suppose that w is an omnitig, and assume for a contradiction that there exists ⇒ a proper v -v walk (for some 1 (cid:54) j (cid:54) t) not satisfying (i)–(iii). Let w be the shortest such walk. j j (cid:48) Since w does not have (v ,e ,...,v ,e ,v ) as prefix, then there exists a first node v on w, (cid:48) j j t t t+1 (cid:96) j (cid:54) (cid:96) (cid:54) t, such that from v , w continues with an edge e = e . Symmetrically, since w does not (cid:96) (cid:48) (cid:48)(cid:96) (cid:54) (cid:96) (cid:48) have (v ,e ,...,v ,e ,v ) as suffix, let v be the last node of w, 1 (cid:54) i (cid:54) j such that before 0 0 j 1 j 1 j i − − entering v , the walk w uses an edge e = e . Let w denote the subwalk of w between e and e (incluisive). If w i(cid:48)s a path, then w(cid:48)i−1i(cid:54)s aip−r1oper v -0(cid:48)v path, 1 (cid:54) i (cid:54) (cid:96) (cid:54) t, wh(cid:48)ose first ed(cid:48)(cid:96)ge e (cid:48)i 1 0(cid:48) (cid:48)(cid:48) (cid:96) i (cid:48)(cid:96) is−different from e , and its last edge e is different from e , which contradicts the fact that w (cid:96) (cid:48)i 1 i 1 is an omnitig. We now prove that w is−in fact a path. − (cid:48) Suppose for a contradiction that it is not, thus that it contains a cycle c, with c = w . Let w (cid:48) (cid:48)(cid:48) (cid:54) be the walk obtained from w by removing the cycle c. Observe that w is still a proper v -v walk. (cid:48) (cid:48)(cid:48) j j We show that w still does not satisfy (i)–(iii), which will contradict the minimality of w . Assume (cid:48)(cid:48) (cid:48) for a contradiction that w satisfies at least one of (i), (ii), or (iii). (cid:48)(cid:48) First, if w satisfies (i), this implies that the edge e out-going from v belongs to c, and after (cid:48)(cid:48) (cid:48)(cid:96) (cid:96) traversingc,thewalkw continuesthrough(v ,e ,...,v ,e ,v )(seeFigures4(a)and4(b)).Letv (cid:48) (cid:96) (cid:96) t t t+1 p be the node of w with greatest index p 0,...,(cid:96) that c visits with an edge e = (v,v ) not on w. (cid:48) p ∈ { } Suchanodeexistsbecausecisacycleanditmustreturntov .Ifp (cid:62) 1(seeFigure4(a)),thencdoes (cid:96) not satisfy (i)–(iii). Since c is proper and passes through v , where 1 (cid:54) (cid:96) (cid:54) t, this contradicts the (cid:96) minimality of w . Therefore, p = 0 (see Figure 4(b)), and thus, the initial v -v walk w (containing (cid:48) j j (cid:48) c as subwalk) visits (v ,e ,...,v ), and then continues through (v ,e ,...,v ,e ,v ). This implies 0 0 (cid:96) (cid:96) (cid:96) t t t+1 that w contains w as subwalk, which contradicts the choice of w . (cid:48) (cid:48) The second case when w satisfies (ii) is entirely symmetric. (cid:48)(cid:48) Third, assume that w contains w as subwalk. Since w is not a subwalk of w , this implies that (cid:48)(cid:48) (cid:48) c is a proper v -v walk, for some 1 (cid:54) r (cid:54) t, not satisfying (i)–(iii), which again contradicts the r r minimality of w (see Figure 4(c)). This completes the proof of (2) (3) (cid:48) ⇒ Proof of (3) (1). Assume w satisfies (3), and let A be a circular edge-covering walk of G. We ⇒ need to show that w is a subwalk of A. Let w = (v ,e ,...,v ,e ,v ) be the longest prefix of j 0 0 j 1 j 1 j − − w that A ever traverses, ending at some v . Since A covers all edges, then it also covers e , and j 0 thus j (cid:62) 1. Suppose for a contradiction that j = t+1. (cid:54) 8 Algorithm 1: Omnitig algorithm to find all safe strings of a graph G. 1 extend(w) 2 Denote w=(v0,e0,v1,e1,...,vt−1,et−1,vt); 3 foreach edge e=(vt,y) out-going from vt do 4 X :=(N−(v1)\{v0})∪···∪(N−(vt)\{vt−1}); 5 let G(cid:48) equal G minus the edge e; 6 if there is no path in G(cid:48) from vt to a node of X then 7 extend((v0,e0,v1,e1,...,vt−1,et−1,vt,e,y)); 8 if w was never extended then 9 W :=W ∪{w}; 10 W :=∅; 11 foreach edge e=(u,v) of G do 12 extend((u,e,v)); 13 remove from W any walk that is a subwalk of another walk in W; 14 return {spell(w) : w∈W}; Since A is circular and covers all edges of G, then after traversing w , the walk A eventually j visits the edge e . The walk A may visit v multiple times before traversing the edge e . Let w j j j (cid:48) denote the subwalk of A between the last two occurrences of v before A traverses the edge e . j j Since w is a proper v -v walk, 1 (cid:54) j (cid:54) t, and w satisfies (3), we have that one of the following (cid:48) j j must hold: the walk (v ,e ,...,v ,e ,v ) is a prefix of w : this contradicts the fact that w is a subwalk j j t t t+1 (cid:48) (cid:48) • of A between v and the immediately next occurrence of e (since in this case w contains e ); j j (cid:48) j the walk (v ,e ,...,v ,e ,v ) is a suffix of w : this implies that (w ,e ,v ) is a longer 0 0 j 1 j 1 j (cid:48) (cid:48) j j+1 • − − prefix of w which is a subwalk of A, contradicting the maximality of w ; j the walk w appears on w : since w is a subwalk of A, this implies that also w is a subwalk of (cid:48) (cid:48) • A, contradicting again the maximality of w . j 6 Omnitig algorithm In this section, we use Theorem 1 to give the omnitig algorithm (Algorithm 1) and prove that it runs in polynomial time (Theorem 2). The algorithm finds all maximal omnitigs of (R), which, by G Theorem 1, are exactly the maximal safe strings of (R). Our algorithm is based on the following G observation, which follows directly from the definition of omnitigs: Observation 1. Consider a walk w = (v ,e ,...,e ,v ,e ,v ) of length at least two, and (cid:48) 0 0 t 1 t t t+1 − consider its subwalk w = (v ,e ,...,e ,v ). Then w is an omnitig if and only if (i) w is an 0 0 t 1 t (cid:48) omnitig and (ii) for all 0 (cid:54) i (cid:54) t 1, th−ere is no proper v -v path with first edge different from e t i t − and last edge different from e . i 1 − The idea of the algorithm is to start an exhaustive traversal of G from every edge (Lines 11-12), which by definition is an omnitig, and to keep traversing edges as long as the current walk is an omnitig. An omnitig w is thus recursively constructed, by possibly extending to the right with each 9 edge e out-going from its last vertex (Lines 3-7). If w extended with e is not an omnitig, then we abandon this extension because Observation 1 tells us that no further extension could be an omnitig. To check if this extension is an omnitig or not, it is enough to check whether condition (ii) of Observation 1 is satisfied. Condition (i) is automatically satisfied because of the structure of the algorithm–we extend only walks that are omnitigs. The omnitigs found are saved in a set W (Line 9), except for those omnitigs that are obviously non-maximal (Line 8). In the final step (Lines 13-14), we remove the non-maximal omnitigs from W and report the rest. To check that condition (ii) is satisfied (Lines 4-6), we take the set X (Line 4) and check if there is a path starting with an edge out-going from v and different from e, and leading to a node of X. t The correctness of this procedure can be seen as follows. If there is no such path, then we know that there is no path satisfying (ii). If we do find a path p from v to some in-neighbor x X of t ∈ some v , and p does not use v , then the path obtained by extending p to v contradicts (ii). If p i i i contains v , then such an extension is not possible, because a path cannot repeat a vertex; however, i we will show that p cannot use v by contradiction. Assume that it does, and observe that after i passing through v , the path p cannot pass again through v . Let v , i (cid:54) j < t, be the first vertex i t j that p visits after v such that from v it continues with an edge e = e . Let p denote the v -x i j (cid:48) j (cid:48) j (cid:54) subpath of p from v until x. We obtained that p followed by v is a proper v -v path with first j (cid:48) i j i edge different from e , last edge different from e , and 1 (cid:54) i (cid:54) j (cid:54) t. This contradicts the fact j i 1 − that w (the walk we are extending) is an omnitig. Next, we show that the algorithm runs in polynomial time. First, we show that the number of omnitigs included in W and their length, prior to removal of non-maximal ones, is polynomial: Lemma 1. Let W be a set of omnitigs in an edge-centric graph model (R), whose genome graph G is different than a single cycle. Furthermore, suppose no omnitig in W is a prefix of another omnitig in W. Then, W (cid:54) nm and the length of any omnitig in W is O(nm). | | Proof. We first show that we can visit the edges of G = (R) with a circular edge-covering walk G C of at most nm nodes. Let e ,...,e be an arbitrary order of the edges of G. Since we assume 0 m 1 − that G admits one genomic reconstruction, then G is strongly connected. Thus, from every end extremity of e there is a path to the start extremity of e , 0 (cid:54) i (cid:54) m 1, of length at i (i+1)modm − most n 1. Therefore, C can be constructed to first visit e , then to follow such a path until e , 0 1 − and so on until e , from where it follows such a path back to e . m 1 0 − By Theorem 1, we have that any omnitig of G is a subwalk of C. We can associate every w W ∈ with all the start positions in C (in terms of nodes) where it is a subwalk. Because W does not contain walks that are prefixes of other walks, a position of C can have at most one walk associated with it. Since C (cid:54) nm, W can contain at most nm walks. | | It remains to prove that the length of any omnitig in W is O(nm). To simplify notation, rename C as(v ,e ,v ,e ,...,v ,e ,v )withv = v .SinceGisdifferentthanasinglecycle,thenthere 0 0 1 1 t t t+1 t+1 0 exist v and v on C, such that e = (v ,v ) is an edge of G, and e / e ,e . Any omnitig (thus j i j i j i 1 ∈ { − } a subwalk of C) cannot contain twice v and v as internal nodes, since otherwise the proper path j i (v ,e,v ) violates the omnitig definition. Thus the length of any omnitig is O(nm). j i (As an aside, it remains open whether the bound on W can be reduced to m, which is the case | | for unitigs; our experiments from Section 8 suggest this may be the case in practice.) Note that Line 8 guarantees that W, prior to removal of subwalks in Line 13, satisfies the prefix condition of Lemma 1. Lemma 1 then implies that reporting one omnitig by our algorithm takes polynomial 10