Do Read Errors Matter for Genome Assembly? Ilan Shomorony Thomas Courtade David Tse UC Berkeley UC Berkeley Stanford University [email protected] [email protected] [email protected] Abstract—While most current high-throughput DNA sequencing length and/or coverage depth requirements? An answer to these technologies generate short reads with low error rates, emerging basic feasibility questions can provide an algorithm-independent sequencing technologies generate long reads with high error rates. framework for evaluating different sequencing technologies. It A basic question of interest is the tradeoff between read length would also settle some speculations in the assembly community and error rate in terms of the information needed for the perfect assembly of the genome. Using an adversarial erasure error model, on whether read errors have a significant impact in long-read we make progress on this problem by establishing a critical read technologies (see for example [2]). 5length,asafunctionofthegenomeandtheerrorrate,abovewhich Such a framework was initiated in [3] for error-free reads: perfectassemblyisguaranteed.Forseveralrealgenomes,including 1 a feasibility curve relating the read length and coverage depth thosefromtheGAGEdataset,weverifythatthiscriticalreadlength 0 needed to perfectly assemble a genome was characterized in isnotsignificantlygreaterthanthereadlengthrequiredforperfect 2 assembly from reads without errors. terms of the repeat complexity of the genome (see examples in n Fig. 1). Evaluating this curve on several genomes revealed an a I. INTRODUCTION interesting threshold phenomenon: if the read length is below a J CurrentDNAsequencingtechnologiesarebasedonatwo-step certain critical value (cid:96)crit, reconstruction is impossible; a read 5 process. First, tens or hundreds of millions of fragments from length slightly above (cid:96) and a coverage depth close to the 2 crit random locations on the DNA sequence are read via shotgun Lander-Waterman depth c (i.e., just enough reads to cover LW ] sequencing. Second, these fragments, called reads, are merged the whole sequence) is sufficient. The critical read length (cid:96)crit Tto each other based on regions of overlap, using an assembly is given by the length of the longest interleaved repeat in the .Ialgorithm. genome, and coincides with the minimum read length L needed s Roughly speaking, different shotgun sequencing platforms can to uniquely reconstruct the genome given its L-spectrum, i.e. the c [be distinguished from the point of view of three main metrics: set of reads with one length-L read starting at each position of the read length, the read error rate, and the read throughput. the sequence, illustrated in Fig. 2. This minimum read length 1 In the last decade, the so-called next-generation sequencing also appeared in earlier works by Ukkonen and Pevzner [4, 5] v 4platforms have attained considerable success at employing heavy for reconstruction via sequencing by hybridization. 9parallelization in order to achieve high-throughput shotgun se- Giventhisframework,theimpactofreaderrorscanbestudied 1quencing. This allowed a significant reduction in the cost and by asking how much the critical read length (cid:96) increases when crit 6 time of sequencing, causing an explosion in the number of new there are errors. In this paper, we investigate this tradeoff for a 0 sequencing projects and the generation of massive amounts of specific error model: 1) the errors are erasures; 2) the erasures . 1sequencing data. occur at a rate no larger than D/L for each read and for each 0 In order to guarantee low error rates, most of these next- baseinthesequence,butareotherwisearbitrary.Ourmainresult 5 generation technologies are restricted to short read lengths, is the characterization of a critical read length (cid:96)˜ above which 1 crit :shifting some of the burden of sequencing to the assembly step. perfect assembly is always possible. While in the noiseless case vInpractice,thisresultsinveryfragmentedassemblies,withlarge (cid:96) is a function of the sequence repeat structure, (cid:96)˜ depends i crit crit Xgapsandlittlelinkinginformationbetweenfragments[1].Onthe more generally on the error rate and on the approximate repeats rother hand, recent technologies that generate longer reads suffer in the sequence. More concretely, for a sequence s, afrom lower throughput and much higher error rates1. (cid:96)˜ (s,D)= min k+D·M (D,k+1), Given this technology trend, the natural questions to ask are: crit s k≥(cid:96)crit(s) what is the impact of read errors on the performance of assem- blers? Is the negative impact of read errors more than offset by w1h0ere Ms(D,(cid:96)) is the maximu1m0 nu mber of D-approximate 190 9 10 the increase in read lengths in long-read technologies? It is well 98 8 9 kgewpbrnoeehrroroitiawcartshshinknmcertrtedshoe.saauFftstrleoptorsreemecaeixidnxfitaraceanamrndareidpolngoelrfeosudo,rsrhiicmtnanohvaommDetdipeseoa.BlsnesAr-xaiutginhitmnjdyeni.oofiergHrdceeratgoainfewcputshenpivimdboneaaiprnmst,ahettceedhtoneaaftosaslnegslveoiqaoemrsubwistbeshe:slemmytrgivobsigav,nlrtyerainocepaaanhlndas-, malized coverage depth345678234567 malized coverage depth234567 345678 rpreeeacrdobnalsestnerug),cthits,thtahenegreeerrneoonrmoruaegt?ehDainonfdoerramrocarostvioseinrgaingniefithcdaeenprtetlhayd(innducamrteabatseoeruotnhfieqreureaeadlyds nor01201 110000 220000r``eiinnatteedrrllee 33aalvve00eedd00ngth 44L00 00 550000 660000 nor``r7r7eepp00ee0a0a01tt 880000500012 1r0e0a10d00 len1g52t00h`00 iL`nitn eterrllee3aa2vv0e0ed0d00 40205005`0re0pe3a0t00600 `r7ep0e0at 800 (a) S.aureus (b) R.sphaeroides 1One example of a short-read-length technology is Illumina, with reads of length∼200basepairsanderrorratesofabout1%.Incontrast,PacBioreads Fig.1. Thethickblackcurveisafeasibilitylowerboundforanyalgorithm,and canbeseveralthousandbasepairslong,witherrorratesofabout10-15%. thegreenlinerepresentstheperformanceoftheMultibridgingalgorithm[3]. to the error-free L-spectrum of s by RL,0(s). Notice that the starting position i for each read r is unknown to the assembler. i Whilesuchareadmodelwasoriginallyproposedinthecontext ofsequencingbyhybridization[4,5,9],ourmotivationforusing it comes from next-generation sequencing technologies, where Fig.2. ThesequencesanditsL-spectrum,RL,0(s). the high read throughput can provide large coverage depths at low costs, and a dense read regime is not unrealistic. This way, length-(cid:96) repeats in s. Moreover, reminiscent of classical coding we can bypass the question of the necessary coverage depth for theoryresults,weshowthatthesamereadlength(cid:96)˜ issufficient crit assembly, and instead focus on the interplay between read length forassemblyifinsteadoferasuresweconsidersubstitutionerrors anderrorrateinthecontextofassemblyfeasibility.Moreover,as at half of the rate. In order to characterize (cid:96)˜ , we derive a new crit shown in [3] for noiseless reads, the dense-read model provides result about the error correction capability of the L-spectrum. valuableinsightstowardsunderstandingtheinformation-theoretic More precisely, we show that given a noisy version of the L- limits of reconstruction in the more general shotgun read model. spectrum of a sequence, it is possible to obtain the noiseless In the L-spectrum read model, since we have exactly G reads, (k + 1)-spectrum of the same sequence, for any k such that an assembly of the reads R (s)={r ,...,r } can be thought L > k+D·M (D,k+1). When L > (cid:96)˜ , we can obtain the L,0 1 G s crit ofasapermutationσoftheentriesof(1,...,G).Weassumewith- noiseless (k+1) spectrum for some k >(cid:96) , and the noiseless crit outlossofgeneralitythattheidentitypermutationσ =(1,...,G) 0 result from [3] implies that perfect assembly is possible. yields a correct assembly of s. Notice, however, that the index i By evaluating (cid:96)˜ on several real genomes, including those crit of each read r is unknown to the assembler. Notice also that in in the GAGE dataset [6], we verify that (cid:96)˜ is not significantly i crit general, there may be multiple correct assemblies for a sequence largerthan(cid:96) .Infact,inmostcases,(cid:96)˜ ≈(cid:96) +3D.Hence,if crit crit crit s if r =r for some i(cid:54)=j. i j thereadlengthLischosenabovethenoiselessrequirement(cid:96) , crit perfect assembly is robust to errors up to a threshold (roughly B. Adversarial Erasure Model 13(L−(cid:96)crit) erasures per read). As in the classical coding theory literature, we will study the The impact of read errors on the information theoretic limits problemofDNAassemblywithnoisyreadsfromtheperspective of genome assembly has also been studied in the setting of an of an adversarial noise model. Given that actual sequencing i.i.d. genome model and asymptotically long genome length [7], noise profiles are complex (non-i.i.d., asymmetric across bases) buildingonanearlierworkonerror-freereadsinthesamesetting and technology-dependent, this approach avoids the need for a [8].Theresultsaresurprising:aslongastheerrorrateisbelowa probabilistic noise model by instead focusing on a worst-case threshold (which can be as high as 19% for substitution errors), scenario. Moreover, under this model we can hope to obtain noisy reads are as good as noiseless reads; i.e., the requirements deterministicandnon-asymptoticconditionsforperfectassembly, for assembly in terms of read length and coverage depth are the whichcanbemoreeasilyanalyzedintermsofrealgenomedata. same in both cases. While this result seems stronger than the Motivated by the fact that sequencing technologies usually result in the present paper, it is proved under the idealistic and provideaqualityscoreforeachbasethatisread(whichcouldbe unrealisticsettingsofi.i.d.genomestatisticsandi.i.d.errors.The thresholdedinto“good”and”bad”bases),andinordertosimplify present result, on the other hand, is more robust as it applies to the problem, we will consider an erasure model. The reads in R arbitrary genome repeat statistics and error statistics. will be length-L sequences from the alphabet Σ(cid:48) ={a,c,g,t,ε}, where ε corresponds to an erasure. Thus, a read starting at II. PROBLEMSETTING position i from s can be written as r = (r [0],...,r [L−1]), i i i In the DNA assembly problem, the goal is to reconstruct a where either ri[j] = s[i+j] or ri[j] = ε, for 1 ≤ i ≤ G and sequence s=(s[1],...,s[G]) of length G with symbols from the 0≤j ≤L−1. alphabet Σ = {a,c,g,t}. In order to simplify the exposition, For a fixed parameter D, the adversarial erasure model will we assume a circular DNA model; thus, {s[i]}∞ is a periodic be constrained by a maximum error rate of D/L within each i=1 sequence with (minimum) period G. Our results hold in the non- read, and for each base. Since in our read model each base s[i] circular case as well under minor modifications. We will let s(cid:96)i isreadLtimes(ri−(L−1)[L−1],ri−(L−2)[L−2],...,ri[0]),these be the substring of length (cid:96) starting at s[i]; i.e., s(cid:96) =(s[i],s[i+ constraints can be written as follows: i 1],...,s[i+(cid:96)−1]). a) There are at most D erasures per read. The sequencer provides a multiset of N reads R = b) Each base s[i] is erased at most D times across all reads. {r1,...,rN} from s, each of length L. In the noiseless case, We will use RL,D(s) to refer to the L-spectrum of s, RL,0(s), each read is a length-L substring of s with an unknown starting after being corrupted by erasures satisfying (a) and (b). location.Ourfocus,however,willbeonnoisyreadmodels,where Inthecontextofanadversarialnoisemodelwithdeterministic each read may be corrupted by noise. The goal is to design constraints, it makes sense to restrict our attention to potential an assembler, which takes the set of reads R and attempts to sequences ˆs that are consistent with the reads R (s). A L,D reconstruct the sequence s. sequence ˆs is said to be consistent with R (s) if it could L,D havegeneratedthesetofreadsR (s)accordingtotheerasure L,D A. The L-Spectrum Read Model model in (a) and (b). By extension, we will say that an assembly We will consider a “dense-read” model, in which all the reads σ ofR (s)isconsistentifthereexistsasequenceˆs,consistent L,D intheL-spectrumofsareprovided.Moreprecisely,Rwillhave with R (s), that could have generated the reads in R (s) L,D L,D exactly G reads, one from each possible starting position; i.e., according to the positions determined by σ. As illustrated in R = {r ,...,r }, where r = sL for i = 1,...,G. We will refer Fig.3,wenoticethat(b)guaranteesthataconsistentassemblyσ 1 G i i Since (cid:96) (ˆs) can be computed from the assembled sequence crit ˆs, this result means that L > (cid:96) (ˆs) provides a certificate that crit ˆs=s, even without previous knowledge of (cid:96) (s). crit IV. MAINRESULTS In the previous section, we described how Theorem 1 fully characterizes when assembly is possible given the noiseless L- Fig.3. PartofaconsistentassemblyforL=5andD=2.Noticethattherecan spectrum. In this section, we seek a similar characterization in beatmostDerasuresperreadandper“column”oftheassemblyσ.Moreover, the case where reads are noisy. allnon-erasedbasesinacolumnmustagree. Notice that for the erasure setting described in Section II, one defines, up to cyclic shifts, a unique consistent sequence in ΣG, possible erasure pattern is to have the last D bases from each which we will refer to as ˆs(σ). read erased, which effectively results in noiseless reads of length The fundamental feasibility question corresponds to asking L−D. Therefore, the converse part of Theorem 1 implies that, which values of L allow unambiguous reconstruction. Formally, if L≤(cid:96)crit(s)+D, there is a read set RL,D(s) and a sequence it corresponds to the following algorithm-independent question. ˆs (cid:54)= s that is consistent with RL,D(s). But how much larger than (cid:96) (s)+D does the read length L have to be in order to Question 1. Consider a fixed circular sequence s ∈ ΣG. What crit guarantee unambiguous correct reconstruction? In other words, values of L guarantee that, for an arbitrary set of erased reads how do erasures degrade the fundamental limit characterized by R (s), s is the unique sequence consistent with R (s)? L,D L,D Theorem 1? Our main result is the introduction of a new sequence- III. ASSEMBLYINTHENOISELESSCASE dependent quantity, (cid:96)˜ (D,s), such that, if L > (cid:96)˜ (s,D), crit crit The assembly problem in Question 1 was first studied in [4] s is the unique sequence consistent with R (s). In general, L,D in the noiseless setting D =0. Notice that when L=1, R1,0(s) (cid:96) (s)+D < (cid:96)˜ (s,D) for D > 0, and one can construct an crit crit is simply the multi-set {s[1],...,s[G]} and any permutation arbitrary sequence s ∈ ΣG for which the gap between the two σ of (1,...,G) is a consistent assembly. Hence, s cannot be quantities is significant. However, by computing (cid:96) +D and crit reconstructed unambiguously, unless all of its symbols are the (cid:96)˜ for actual genomes, we verify that they are often close, as crit same. On the other hand, when L = G, there is a unique shown in Table I. assembly of R (s) = {s}, and s can always be reconstructed G,0 Rather than being defined in terms of exact repeats, as is the unambiguously. Question 1 is thus equivalent to asking for the case of (cid:96) (s), (cid:96)˜ (s) depends more generally on approximate crit crit threshold (cid:96)th for which s can be reconstructed if and only if repeats.ForasetofsegmentsS ofagivenlength(cid:96);i.e.,S ⊂Σ(cid:96), L>(cid:96) . In [4], this threshold is established as a function of the th we will first define the radius of S to be repeat structure of the sequence s, as we explain next. A repeat of length (cid:96) in s is a subsequence appearing twice ρ(S)= min maxdH(y,x), (1) at some positions t and t (so s(cid:96) and s(cid:96) ) that is maximal; x∈Σ(cid:96) y∈S i.e., s[t1−1](cid:54)=s[t21−1] an2d s[t1+t1(cid:96)](cid:54)=s[tt22+(cid:96)]. Two pairs of where dH(y,x) is the Hamming distance between y and x. repeatss(cid:96) ,s(cid:96) andsk ,sk areinterleavedifa <b ≤a <b . We will say that the segments in S are d-approximate copies if Due to thae1ciarc2ular DbN1Abm2odel, since a subseq1uenc1e s(cid:96) c2an als2o ρ(S)≤d. Intuitively, a sequence s that contains a large set S of t be written as s(cid:96) for any integer m, we additionally require length-(cid:96)segmentswithasmallradiusρ(S)hasmoreambiguityin t+mG that b −a < G. The length of a pair of interleaved repeats termsofassembly.Tocapturethat,wewillletM(d,(cid:96))correspond 2 1 s(cid:96) ,s(cid:96) and sk ,sk is defined to be min((cid:96),k). We let (cid:96) (s) to the maximum number of d-approximate length-(cid:96) segments in bae1thea2length bo1f thbe2 longest pair of interleaved repeats ininsterand s; i.e., set (cid:96) (s) = (cid:96) (s)+1. The results from [4, 5] imply the crit inter M (d,(cid:96))=max{|S|:S ⊂R (s),ρ(S)≤d}. (2) s (cid:96),0 following: Notice that M (d,(cid:96)) is monotonically decreasing in (cid:96). We let s Theorem1. IfL>(cid:96) (s),thensistheuniquesequencethatis crit consistent with RL,0(s). Conversely, if L≤(cid:96)crit(s), there exists (cid:96)˜crit(s,D)= min k+D·Ms(D,k+1). (3) a sequence s(cid:48) (cid:54)=s that is also consistent with R (s). k≥(cid:96)crit(s) L,0 Notice that (cid:96)˜ (s,D)≥(cid:96)˜ (s,0)=(cid:96) (s). Our main result is In other words, Theorem 1 characterizes the threshold on L crit crit crit the following. that fully answers Question 1. We point out that, in the previous literature [3, 4], (cid:96) was defined in terms of the length of pairs Theorem 2. If L > (cid:96)˜ (s,D), then s is the unique sequence crit crit of interleaved repeats (defined in a more restrictive way) and that is consistent with R (s). L,D the length of triple repeats. However, one can verify that by The main tool used to prove Theorem 2 is a result about considering the more general definition of interleaved repeats spectrum error correction. More precisely, we show that from above, triple repeats are included as a special case. a noisy version of the L-spectrum of s R (s), it is possible Notice that, while Theorem 1 characterizes the minimum L L,D to obtain R (s), for some effective read length L(cid:48) < L. This thatguaranteesperfectreconstruction,(cid:96) (s)isafunctionofthe L(cid:48),0 crit result and the proof of Theorem 2 are presented in Section V. groundtruths,andisnotknownapriori.However,thefollowing As in the noiseless case, we point out that (cid:96)˜ (s,D) cannot corollary of Theorem 1 readily follows: crit be computed a priori, since it is a function of the ground truth Corollary 1. If a sequence ˆs is consistent with R (s) and sequence s. However, Theorem 2 can in fact be used to obtain a L,0 L>(cid:96) (ˆs), then ˆs=s. certificateresultanalogoustoCorollary1,allowingonetocertify crit whether an assembly sˆis correct, even without prior knowledge s ATCAGGTA GTACCTAG Vs Vsˆ of (cid:96)˜ (s) and M (D,·). crit s 1 1 Corollary 2. If a sequence ˆs is consistent with R (s) and τ1 τ2 2 2 L,D 3 L>(cid:96)˜crit(ˆs), then ˆs=s. t1 t2 t3 τ1 tt12 Proof: If ˆs is consistent with R (s), by the definition of consistency, R (s) can be viewedLa,Ds a set of reads R (ˆs) sˆ τ2 t3 L,D L,D A T CAGG T ACCTAG A T CAGGTA G G from ˆs, with an erasure pattern satisfying (a) and (b). But from Theorem 2, if L > (cid:96)˜ (ˆs), ˆs is the unique sequence that is consistent with R (cˆsri)t = R (s). Since s must also be Fig.4. Weplaceanedge(u,v)in(Vs,Vˆs,E)ifsku+1=ˆskv+1.Inthisexample, L,D L,D (τ1,t1),(τ2,t2)and(τ1,t3)aresomeoftheedgesin(Vs,Vˆs,E). consistent with R (ˆs), we must have ˆs=s. L,D In Table I, we show the value of (cid:96)˜ (s,D) computed for crit for the set of reads R (s) with assembled sequenceˆs=ˆs(σ). several real genomes. Computing (cid:96)˜ (s,D) is generally imprac- L,D crit The main idea of the proof is to show that (k+1)-blocks in s tical from a computational standpoint, so the values in Table I and ˆs are in one-to-one correspondence; i.e., ˆsk+1 = sk+1 for are based on heuristics implemented by a sequence alignment t τ(t) a bijective mapping τ : {1,...,G} → {1,...,G}, which implies tool called Nucmer [10]. We choose the value of D such that R (s)=R (ˆs). D/(cid:96) ≈ 15%. We point out that the first two genomes, R. k+1,0 k+1,0 crit In order to show the existence of this bijection τ, we consider sphaeroidesandS.aureusarefromtheGAGEdataset[6],which abipartitegraph(V ,V ,E ),whereV =V ={1,...,G}and is used as a benchmark for assemblers. Notice that, with the s ˆs k+1 s ˆs exceptionofE.coli536,inallcases(cid:96)˜ (s,D)=(cid:96) (s)+mD, E = {(u,v) ∈ Vs×Vˆs : sku+1 =ˆskv+1}, as illustrated in Fig. 4. crit crit The existence of the bijective mapping τ is equivalent to the for m∈{2,3,4}. This occurs because, for the genomes consid- existenceofaperfectmatchingin(V ,V ,E).Hence,Theorem3 ered, (cid:96) (s) is already long enough so that there aren’t many s ˆs crit is equivalent to the following: approximate repeats of that length. Claim 1. There exists a perfect matching in (V ,V ,E). s ˆs Genome(s) (cid:96)crit(s) (cid:96)˜crit(s,D) D For a set of nodes U ⊂V , we let δ(U)={v ∈V :(v,u)∈ R.sphaeroides 271 331 30 ˆs s E for u∈U} be the set of neighbors of U. We will show that, S.aureus 1799 2399 200 A.ferrooxidans 2628 3228 300 for any U ⊂ V , |δ(U)| ≥ |U|, and by Hall’s marriage theorem, ˆs E.coli536 3245 4462 450 Claim 1 will follow. We will first state the following lemma, E.coliK-12 1744 2544 200 which establishes |δ(U)|≥|U| for the special case of sets U of the form U ={u∈V :ˆsk+1 =x} for some x∈Σk+1. TABLEI x ˆs u COMPUTED(cid:96)˜crit(s,D)FORD/(cid:96)crit≈15% Lemma 1. For the bipartite graph (V ,V ,E), |δ(U )| ≥ |U |, s ˆs x x While the results in this section were presented for an erasure for any x∈Σk+1. model, they can be extended to a substitution error model. In The proof of Lemma 1 is at the end of this section. Now fact, if instead of D erasures per read and per base, we have consider a general set U ∈V . Let Sk+1 ={sk+1 ∈Σk+1 :u∈ D/2 substitution errors, the proofs of Theorems 2 and 3 can ˆs U u U}. Since two nodes u,u(cid:48) ∈ U with sk+1 (cid:54)= sk+1 cannot be be modified accordingly, and the statements still hold. We will u u(cid:48) connected to the same node v ∈V , we have restrict the discussion to the erasure case for simplicity. s (cid:88) (cid:88) V. SPECTRUMERRORCORRECTION |δ(U)|= |δ(Ux∩U)|= |δ(Ux)| The main result we use to prove Theorem 2 is a statement x∈Sk+1 x∈Sk+1 U U about when it is possible to take a noisy L-spectrum of s and (cid:88) (cid:88) ≥ |U |≥ |U ∩U|=|U|, x x unambiguously construct its noiseless L(cid:48)-spectrum, for L(cid:48) <L. x∈Sk+1 x∈Sk+1 U U Theorem 3. Suppose that, for some k, we have where the first inequality follows from Lemma 1. By applying L>k+D·M (D,k+1). (4) s Hall’s theorem, Claim 1 follows, implying that, R (s) = k+1,0 Then, for any sequence ˆs that is consistent with RL,D(s), Rk+1,0(ˆs). Therefore, to conclude the proof of Theorem 3, we R (ˆs)=R (s). just need to prove Lemma 1. k+1,0 k+1,0 Proof of Lemma 1: Let U = {t ,...,t } ⊂ V , where Theorem 3 says that, by finding a consistent assembly of x 1 q ˆs t ,...,t are distinct and ˆsk+1 = ... = ˆsk+1 = x. Consider one RasLl,oDn(gs)a,swke scaatnisfioebstai(n4)t.hTeh(enroeifsoerlee,ssw)h(ekn+L1)>-sp(cid:96)˜ectr(usm,Do)f, sif, s1uch blqock ˆskt+1, for t ∈t1{t1,...,tq}. Ttqhere are L − k reads crit that cover ˆsk+1 in ˆs, as illustrated in Fig. 5. These are the we let k(cid:63) be the minimizer in (3), we have that L > k(cid:63) +D· t reads given by r , for n = 0,1,...,L − k − 1. Notice M (D,k(cid:63)+1) and, by Theorem 3, any ˆs that is consistent with σ−1(t−n) s that read r was originally obtained from the segment R (s) has the same (k(cid:63)+1)-spectrum R (s). But since, σ−1(t−n) L,D k(cid:63)+1,0 sL fromthetruesequences.Theconsistencyrequirement k(cid:63) + 1 > (cid:96)crit(s), Theorem 1 implies that there is only one σ−1(t−n) sequence that is consistent with Rk(cid:63)+1,0(s), and we must have on σ thus implies that dH(sLσ−1(t−n),sLt−n) ≤ D. Moreover, if ˆs=s. This proves Theorem 2. we just focus on the (k +1)-block corresponding to ˆsk+1, we t Next, we turn to the proof of Theorem 3. Suppose that we have dH(skσ+−11(t−n)+n,skt+1)=dH(skσ+−11(t−n)+n,x)≤D, which pick some k satisfying (4) and that σ is a consistent assembly holds for each t∈{t ,...,t } and n=0,...,L−k−1. 1 q sσk+−11(t−3)+3 sσk+−11(t)=sσk+−11(t−2)+2 sσk+−11(t−1)+1 s s sk+1=x τj sˆ σ stk1+1 =x stk2+ 1 =x ... stkq+ 1 =x sˆ Fig. 6. If |Bτj| ≥ D+1, at least D+1 of the reads that cover one of sˆtk+1 ˆskt1+1,...,ˆsktq+1 inˆsalsocoverskτj+1 ins(inthisexample,D=3).Sincethere Fig. 5. For an arbitrary length-(k+1) block B in sˆ, the L−k reads that areatmostD erasuresperbaseins,wemusthaveskτj+1=x. completely cover B according to the assembly σ are shaded (in this example, L = 6 and k = 2). By mapping these L−k reads back to s, we find the VI. CONCLUDINGREMARKS corresponding (k+1)-blocks in s given by sk+1 for j = 0,1,2,3. σ−1(t−j)+j Our results show that for several actual genomes, if we are in Noticethat,inthisexample,σ−1(t)=σ−1(t−2)+2,becausereadsrσ−1(t) a dense-read model with reads 20-40% longer than the noiseless andrσ−1(t−2)+2 arealignedtoeachotherinthesamewayinsandˆs. requirement (cid:96) (s), perfect assembly feasibility is robust to crit erasures at a rate of about 10%. While this is not as optimistic If we now consider the set of all such (k+1)-blocks in s as the message from [7], we emphasize that we consider an (cid:110) (cid:111) S = sk+1 :n=0,...,L−k−1,i=1,...,q , (5) adversarial error model. When errors instead occur at random σ−1(ti−n)+n locations, it is natural to expect less stringent requirements. since d (y,x) ≤ D for each y ∈ S, we have that ρ(S) ≤ D. Another message provided by our results deals with error H Hence, if we let correction. Most current sequencing technologies employ error correction algorithms based on aligning reads to form clusters T =(cid:8)σ−1(t −n)+n:0≤n≤L−k−1,1≤i≤q(cid:9) i and outputing a cleaned-up read for each cluster. However, the spectrum error correction result from Theorem 3 suggests that be the starting positions of these blocks in s, T must satisfy a “global” approach to generating cleaned-up reads (based on |T|≤M (D,k+1). Now consider the set of (n,i) pairs s finding a consistent assembly and looking at its spectrum) may B ={(n,i):0≤n≤L−k−1,1≤i≤m}. perform better than cluster-based, or local, error correction. A direction for future work is to replace the dense-read model WewilldefineapartitiononBaccordingtothevalueofσ−1(t − i with a shotgun read model. While the L-spectrum approach n)+n. More precisely, we will let is motivated by the high-throughput of current technologies, it B =(cid:8)(n,i)∈B :σ−1(t −n)+n=τ(cid:9), bypasses the question of the actual coverage depth required for τ i assembly. As was the case in [3], we expect the read length for τ ∈T. It is clear that {Bτ}τ∈T is a partition of B. We claim requirementsfromthedense-readmodeltotranslateintobridging that there exist distinct τ1,...,τq ∈ T such that |Bτj| ≥ D+1, conditions in the shotgun model, allowing one to compute the for j = 1,...,q. Suppose by contradiction that this is not the coveragerequiredforperfectreconstructionwithhighprobability. case, and we have at most q − 1 parts B with |B | ≥ D + τ τ ACKNOWLEDGMENT 1. Notice that, since σ : (1,...,G) → (1,...,G) is one-to-one, σ−1(t −n)+n (cid:54)= σ−1(t −n)+n if t (cid:54)= t , and, for any This work is partially supported by the Center for Science i j i j τ, we must have |B | ≤ L − k. Therefore, since (4) implies of Information (CSoI), an NSF Science and Technology Center, τ L−k−1≥D·M (D,k+1), under grant agreement CCF-0939370. s (cid:88) REFERENCES |B |≤(q−1)(L−k)+(|T|−q+1)D τ [1] S.L.Salzberg,“Mindthegaps,”Naturemethods,vol.7,no.2,pp.105–106, τ∈T 2010. ≤(q−1)(L−k)+D·M (D,k+1) [2] E. Myers. (2014, Feb.). [Online]. Available: https://twitter.com/ s thegenemyers/status/437349388676263937 =q(L−k)−1. [3] G.Bresler,M.Bresler,andD.Tse,“Optimalassemblyforhighthroughput shotgunsequencing,”BMCBioinformatics,2013. (cid:80) Butsince τ∈T |Bτ|=|B|=q(L−k),wehaveacontradiction. [4] E. Ukkonen, “Approximate string matching with q-grams and maximal Now consider the segments sk+1 with |B | ≥ D + 1, for matches,”TheoreticalComputerScience,vol.92,no.1,1992. τj τj [5] P. Pevzner, “DNA physical mapping and alternating Eulerian cycles in j = 1,...,q. Since τ ,...,τ are all distinct, these segments start 1 q coloredgraphs,”Algorithmica,vol.13,pp.77–105,1995. at different points in s. Moreover, since |Bτj| ≥ D +1, each [6] S.L.Salzberg,A.M.Phillippy,A.Zimin,D.Puiu,T.Magoc,S.Koren,T.J. sk+1 is covered by D+1 reads from the reads that cover ˆsk+1, Treangen, M. C. Schatz, A. L. Delcher, M. Roberts, G. Marcais, M. Pop, τj ti and J. A. Yorke, “GAGE: a critical evaluation of genome assemblies and i = 1,...q. Notice that these must be distinct reads from the assemblyalgorithms,”GenomeResearch,vol.22,no.3,pp.557–567,2012. multiset R (s). This is because two distinct pairs (n,i) and [7] A.Motahari,K.Ramchandran,D.Tse,andN.Ma,“OptimalDNAshotgun L,D (m,j) in B must have n(cid:54)=m, and the corresponding reads are sequencing: Noisy reads are as good as noiseless reads,” Proc. of IEEE τ InternationalSymposiumonInformationTheory,pp.1640–1644,2013. r = r and r = r , which are distinct σ−1(ti−n) τ−n σ−1(tj−m) τ−m [8] A.Motahari,G.Bresler,andD.Tse,“InformationtheoryofDNAshotgun reads (not necessarily different sequences from ΣL). Finally, as sequencing,”IEEETransactionsonInformationTheory,vol.59,no.10,pp. illustrated in Fig. 6, we note that, since there are at most D 6273–6289,Oct.2013. [9] P. E. C. Compeau, P. Pevzner, and G. Tesler, “How to apply de Bruijn erasures per base in s, we have that skτj+1 = x, for j = 1,...,q. graphstogenomeassembly,”NatureBiotechnology,vol.29,2011. We conclude that |δ(U)|≥q. [10] [Online].Available:http://mummer.sourceforge.net/