ebook img

Fundamental Bounds and Approaches to Sequence Reconstruction from Nanopore Sequencers PDF

0.5 MB·
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Fundamental Bounds and Approaches to Sequence Reconstruction from Nanopore Sequencers

Fundamental Bounds and Approaches to Sequence Reconstruction from Nanopore Sequencers Jarek Duda,∗ Wojciech Szpankowski,† and Ananth Grama‡ Department of Computer Science and Center for Science of Information, Purdue University, USA Nanopore sequencers are emerging as promising new platforms for high-throughput sequencing. As with other technologies, sequencer errors pose a major challenge for their effective use. In this paper, we present a novel information theoretic analysis of the impact of insertion-deletion (indel) errorsinnanoporesequencers. Inparticular,weconsiderthefollowingproblems: (i)forgivenindel error characteristics and rate, what is the probability of accurate reconstruction as a function of sequencelength; (ii)whatisthenumberof‘typical’sequenceswithinthedistortionboundinduced by indel errors; (iii) using replicated extrusion (the process of passing a DNA strand through the nanopore), what is the number of replicas needed to reduce the distortion bound so that only one typical sequence exists within the distortion bound. Our results provide a number of important insights: (i) the maximum length of a sequence that 6 canbeaccuratelyreconstructedinthepresenceofindelandsubstitutionerrorsisrelativelysmall;(ii) 1 thenumberoftypicalsequenceswithinthedistortionboundislarge;and(iii)replicatedextrusionis 0 aneffectivetechniqueforuniquereconstruction. Inparticular,weshowthatthenumberofreplicas 2 isaslowfunction(logarithmic)ofsequencelength–implyingthatthroughreplicatedextrusion,we n can sequence large reads using nanopore sequencers. Our model considers indel and substitution a errors separately. In this sense, it can be viewed as providing (tight) bounds on reconstruction J lengths and repetitions for accurate reconstruction when the two error modes are considered in a 1 single model. 1 PACSnumbers: ] T I I. INTRODUCTION flow is disrupted. This disruption of the current flow is . s monitored, and used to characterize the analyte. This c generalprinciplecanbeusedtocharacterizenucleotides, [ The past few years have seen significant advances in small molecules, and proteins. Complete solutions based sequencing technologies. Sequencing platforms from Il- 1 on this technology are available from Oxford Nanopore lumina, Roche, PacBio and other vendors are commonly v Technologies [11]. In this platform, a DNA strand is ex- 0 available in laboratories. Accompanying these hardware truded through a protein channel in a membrane. The 2 advances, significant progress has been made in statisti- rate of extrusion must be slower than current measure- 4 cal methods, algorithms, and software for tasks ranging 2 from base calling to complete assembly. Among the key ment (sampling) for characterizing each base (or groups 0 of small number of bases, up to four, in the nanopore at distinguishing features of these sequencing platforms are . any point of time). 1 their read lengths and error rates. Short read lengths 0 poseproblemsforsequencinghigh-repeatregions. Higher Inprinciple,nanoporeshaveseveralattractivefeatures 6 errorrates,ontheotherhand,requireoversamplingtoei- – long reads (beyond 100K bases) and minimal sam- 1 thercorrect,ordiscarderroneousreadswithoutadversely ple preparation. However, there are potential challenges : v impacting sequencing/ mapping quality. Significant re- thatmustbeovercome–amongthem, theassociateder- i search efforts have studied tradeoffs of read-length, error ror rate. The extrusion rate of a DNA strand through X rates, andsequencingcomplexity. Anexcellentsurveyof a protein channel is controlled using an enzyme [12]. r a these efforts is provided by Quail et al [14]. This rate is typically modeled as an exponential distri- More recently, nanopores have been proposed as plat- bution. When a number of identical bases pass through formsforsequencing. Nanoporesarefabricatedeitherus- thenanopore, theobserved(non-varying)signalmustbe ing organic channels (pore-forming proteins in a bilayer) parsed to determine the precise number of bases. This or solid-state material (silicon nitride or graphene). An results in one of the dominant error modes for nanopore ionic current is passed through this nanopore by estab- sequencers. Specifically, insertion-deletion errors in such lishing an electrostatic potential. When an analyte si- sequencers are reported to be about 4% [12, 15]. More multaneously passes through the nanopore, the current recently, higher error rates of approximately 12% in ad- dition to a 6% substitution error have been reported [7]. The high error rate can be handled using replicated reads for de-novo assembly, or through algorithmic tech- ∗Electronic address: [email protected]; Institute of Computer niques using reference genomes. The Oxford Nanose- Science, Faculty of Mathematics and Computer Science Jagiel- quencer claims a scalable matrix of pores and associated lonianUniversity,Poland †Electronicaddress: [email protected] sensors using which replicated reads can be generated. ‡Electronicaddress: [email protected] Alternately, other technologies based on bi-directional 2 extrusionhavebeenproposed. Ineithercase, twofunda- by altering the number of repeated bases. If the output mentalquestionsarisefordenovoassembly: (i)forsingle blocksizek(cid:48) isnotthesameastheinputblocksizek, an reads, what is the bound on read length that can be ac- indel error occurs. Specifically, k(cid:48) < k corresponds to a curately reconstructed using a nanopore sequencer with deletion error, and k(cid:48) > k to an insertion error. Please knownerrorrates;and(ii)whatisthenumberofreplicas note that this model does not account for substitution needed to accurately reconstruct the sequence with high errors. probability (analytically defined). In our analysis we as- We model the sequencing process (both the sequencer sume that each fragment is read multiple times. Since it and the associated processing) as a channel. A chan- isnotcurrentlypossibletoexercisesuchfinegraincontrol nel in information theory is a model (traditionally for a over the nanopore to read the same sequence multiple storage or communication device, but in our case, used times, we can achieve this through PCR amplification more generally) for information transfer with certain er- and resulting reads from multiple copies. These reads ror characteristics. The input sequence of blocks is sent canbealignedtoachievethesameeffectasreadingfrag- into this channel. The error characteristics of the chan- ments multiple times. Note that the alignment problem nel transform a block of k >0 characters into a block of is simpler here owing to longer reads. k(cid:48) characters. This transformation is modeled as a dis- Inthispaper,wepresentanovelinformationtheoretic tribution: k(cid:48) =G(k,P), where G is the distribution and analysis of the impact of indel and substitution errors P,theassociatedsetofparameterstunedtothesequenc- in nanopore sequencers. We model the sequencer as a ingplatform. Intypicalscenarios, thedistributionpeaks sticky insertion-deletion channel. The DNA sequence is at k and decays rapidly on either side. The distribution fedintothischannelandtheoutputofthechannelisused may be asymmetric around k depending on relative fre- to reconstruct the input sequence. Using this model, we quency of insertion and deletion errors. We refer to such solve the following problems: (i) for given error char- achannelasastickychannelifitmaintainsthestructure acteristics and rate, what is the probability of accurate of blocks: k(cid:48) >0. reconstructionasafunctionofsequencelength;(ii)what a. Insertion-Deletion Channels In ideal communi- isthenumberof‘typical’sequenceswithinthedistortion cation systems, one often assumes that senders and re- boundinducedbyindelsandsubstitutions;and(iii)what ceivers are perfectly synchronized – i.e., each sent bit is is the number of replicas needed to reduce the distortion read by the receiver. However, in real systems, such per- boundsothatonlyonetypicalsequenceexistswithinthe fect synchronization is often not possible. This leads to distortion bound (unique reconstruction). sentbitsmissedbythereceiver(adeletionerror),orread Ourresultsprovideanumberofimportantinsights: (i) morethanonce(aninsertionerror). Suchcommunication the maximum length of sequence that can be accurately systems are traditionally modeled as insertion-deletion reconstructedinthepresenceoferrorsisrelativelysmall; channels. Formally, an independent insertion channel is (ii)thenumberoftypicalsequenceswithinthedistortion one in which a single bit transmission is accompanied bound induced by errors is large; and (iii) the number with the insertion of a random bit with a probability of replicas required for unique reconstruction is a slow p. An independent deletion channel is one in which a function (logarithmic) of the sequence length – implying transmitted bit can be deleted (omitted from the out- that through replicated extrusion, we can sequence large put stream) with a probability p(cid:48). An insertion-deletion reads using nanopore sequencers. The bounds we derive channelcontainsbothinsertionsanddeletions[5]. Please are fundamental in nature – i.e., they hold for any re- note that a number of basic characteristics of insertion- sequencing/ processing technique. deletion channels, such as their capacity, are as-yet un- known in information theory literature as well. We consider a variant of the independent insertion- II. APPROACH deletion model that is better suited to nanopore se- quencers. In particular, we recognize the primary source In this section, we present our model and the underly- of error in nanopore sequencers is associated with dis- ingconceptsininformationtheorythatprovidethemod- ambiguating the exact number of identical bases pass- eling substrates. We define notions of a channel, recon- ing through the nanopore. We modify the indepen- struction, an insertion-deletion channel, and distortion dent insertion-deletion channel to the sticky insertion- bound. Wethendescribehowtheseconceptsaremapped deletion channel described above (Figure 1). Coinciden- to the problem of sequence reconstruction in nanopore tally, the analysis of this block modification insertion- sequencers. deletion channel is easier – as we demonstrate in this Ourbasicmodelforananoporesequencerisillustrated paper. in Figure 1. A DNA sequence is input to the nanopore b. Typical Sequences. There are 4n distinct nu- sequencer. This sequence is read and suitably processed cleotide sequences of n bases, each generated with a to produce an output sequence. We view the input se- corresponding (idealized) probability of 4−n. For con- quence as a sequence of blocks. Each block is comprised venience, we can also view this as probability as 2−2n, ofavariablenumber(k)ofidenticalbases. Thenanopore withtheunderstandingthatifeachbaseisequallylikely, sequencer potentially introduces errors into each block we would need two bits for each base. However, from 3 FIG. 1: Overview of the proposed channel and its correspondence with a sequencer. asymptotic equipartition property (AEP), we know that rapidly with the number of extrusions, thus enabling ac- there is a typical set such that the probability of gen- curateand/oruniquereconstructionwithrelativelysmall erating a sequence belonging to this set approaches 1. number of replicas. In other words, while there may be sequences outside of this set whose individual probability may be high, their numberissmallenoughthatthetotalprobabilityisdom- III. METHODS inated by the sequences in this typical set. Furthermore, we know that the probability of drawing a sequence of A. Notation and Theoretical Model lengthnfromthissetisasymptoticallygivenby2−H(X)n, whereH(X)istheentropyofthesource. Comparingthis Theinputsequencetothechannel/sequencerisdrawn expression with the sequence probability assuming each from an alphabet A. For DNA sequencing, A = base is equally likely (2−2n), we note the unsurprising {A,T,C,G}. The alphabet size |A| is denoted by m. conclusion that for our idealized case H(X) = 2. How- We assume that an n length input sequence X is inde- ever, a number of studies have shown that the entropy pendent and identically distributed (i.i.d.); i.e., each se- of living DNA is in fact much lower, as low as 1.7 or quence has the same probability distribution as the oth- below [10]. This suggests that the number of typical se- ers and all sequences are mutually independent. Mathe- quences is in fact much less than the number of total se- (cid:80) matically, probability Pr(x =s)=:p , p =1. quences. Weusethisnotionoftypicalsequencesandthe i s s s typical set in our derivation of the performance bounds. c. Distortion Bound and Unique Reconstruction. 1. Blocking Identical Symbols Viewing a DNA sequence passing through a nanopore as a point (in some very high dimensional space), pass- As mentioned, we view input and output sequences as ingitthroughasequencerintroducesanerror. Thiserror sequences of blocks, with each block comprised of one can be viewed as a hypersphere around the original se- or more identical symbols. If sk denotes k ≥ 1 repeats quence. Each point in the hypersphere corresponds to a of symbol s, we can write sequence X as concatenation possible input sequence with a probability that can be of N ≤ n blocks: X = sk1...skN, such that k > 0, analytically quantified. We refer to this hypersphere as 1 N i s (cid:54)=s . For example, a sequence X = “AATATTAA” a distortion ball. Ideally, we want the radius of this dis- i+1 i is represented in the block form as A2TAT2A2. tortion ball to be as small as possible – containing only We initiate our discussion by enumerating basic sta- a single point. A weaker condition is that the distortion tistical properties of block sequences. We see that for ball contains only a single typical sequence. We refer to s(cid:54)=s(cid:48): the former as an accurate reconstruction and the latter as a unique reconstruction. p Pr(s =s(cid:48)|s =s)= s(cid:48) . (1) A single pass through the nanopore induces a distor- i+1 i 1−p s tion ball whose probability profile can be quantified. We showthattheradiusofthisballcanbereducedbyrepli- Since we have a block of symbols s, the block must cated extrusion through the nanopore. One of the key start with at least one appearance of symbol s. The contributions of this paper is that the radius shrinks probabilitythattheblockwillhavelengthk ≥1isgiven 4 by: q = q = 0, the number of blocks and their order re- k0 0l mainthesamewhileapplyingthechannel. However, the Psk :=Pr(ki =k|si =s)=pks−1(1−ps) (2) lengths of blocks may change. The sequencing problem canthenbereducedtothefollowingtask: determinethe (cid:88) original set of block counts (k ) from the follow- satisfying P =1. i i=1..N sk ing two cases under consideration: (i) the single extru- k≥1 sion case – a single read sequence Y: (l ) ; and (ii) i i=1..N The expected length of a block of symbols s is given by: the multiple extrusion case – each sequence is replicated c∈N times: (lj) for j =1,...c. i i=1..N (cid:88) (cid:88) 1 k := kP = kpk−1(1−p )= . s sk s s 1−p s k≥1 k≥0 C. Accurate Reconstruction from Nanopore Sequencers Theexpectednumberofsymbolssintheentiresequence of length n is np . Therefore, the expected number of s blocks of type s, and of all blocks is, respectively, Wenowconsidertheproblemofconstructingatruese- quence (from the channel or the sequencer) from an ob- Ns :=nps/ks =nps(1−ps), (3) served sequence (from the ionic current measurement). This problem is one of reconstructing a true block se- (cid:32) (cid:33) quencefromanobservedblocksequenceattheoutputof N :=(cid:88)N =n 1−(cid:88)p2 . (4) thestickychannel. Sinceweassumethattheblockstruc- s s ture is not changed by the channel, this problem can be s s solved one block at a time. Specifically, we observe a block at the output of the channel of length l and we B. Sticky Insertion-Deletion Channel (indel) must infer the block length of the corresponding input, k. We refer to this as the problem of finding the most We model the sequencer using a sticky insertion- probable reconstruction. deletionchannel. Inthischannel,ablockofkconsecutive identical symbols, with probability q , is transformed kl 1. Single Run Estimation: Inferring k from a Single l into a block of l copies of the symbol: (cid:88) Pr(sl|sk)=q (s) where q (s)=1. (5) Inthefirstinstanceoftheproblem,wedonotconsider kl kl anyreplicas–i.e.,asingleblockpassesthroughthechan- l nel (sequencer) only once. From this single observation Thetermstickyisusedtoimplythattheblockstructure of l, we must infer k. Assuming n→∞, the probability remains unchanged; i.e., q (s)=q (s)=0. k0 0l that an output block of length l was observed from an In this model, {q (s)} specifies the block length change kl input block of length k is given by: probabilities. We can assume that for given k, q (s) has kl amaximumatl=k; i.e.,theprobabilitythatthereisno P q pk−1q Pr(sk|sl)= sk kl = s kl (7) emrarotiroinn. aInblSoeccktieoxnceIeVd,swaenycoontshiedrer(etrwroonsepoeucsi)fictrcahnosifcoers- (cid:80)k(cid:48)Psk(cid:48)qk(cid:48)l (cid:80)k(cid:48)pks(cid:48)−1qk(cid:48)l for function q: an exponential distribution and indepen- where P =Pr(k =k|s =s)=pk−1(1−p ). sk i i s s dentinsertion-deletions. Inthismodel,theprobabilityof In this case, the most likely input block length k for observing an output sequence Y from our indel channel observedoutputblocklengthl istheonethatmaximizes is given by: pk−1q for given l. Let us denote this k by k . We then s kl l have: N (cid:89) Pr(Y =sl11sl22...slNN|X =sk11sk22...skNN)=i=1qkili(si). pskl−1qkll =mkaxpks−1qkl (8) For example, true sequence X = “AAAATTAAA” = We would expect that kl =l. However, for a general dis- A4T2A3 is read by the sequencer as Y = ”AATTTAA” tribution q, this is not necessarily the case. The natural =A2T3A2 with probability q42(A)·q23(T)·q31(A). condition for q, given by maxkqkl =qll, turns out not to For simplicity, we assume that q is identical for different be sufficient for this purpose. Even with this condition, symbols s: asingleinputblocklengthk maycorrespondtomultiple corresponding observed block lengths l (for some l (cid:54)= l(cid:48), q (s)≡q . (6) k =k ),andsomeinputblocklengthskmightnothave kl kl l l(cid:48) correspondingvaluesoflatall(∀ k (cid:54)=k). Consequently, l l This implies that insertion-deletion errors in sequencing we need a stronger condition. It is easy to see that: are independent of the bases. Please note that this as- sumptionisnotalimitationofourframework. Assuming ∀ q <(p )i·q ⇒ k =l (9) l,i,s l−i,l s l,l l 5 Wecannowdeterminetheprobabilitythatanobserved As before, the probability that we accurately infer all block is properly corrected as: block sizes (accurate sequencing) in analogy to (10) de- creases exponentially as: (cid:88) m := q (0 if k cannot be obtained), k kl l:kl=k 2n(cid:80)sps(1−ps)lg((cid:80)kPskmkc). (14) which reduces to m = q if (9) is satisfied. The ex- k kk Unfortunately the problem of finding k is a complex es- pectednumberofblocksofsymbolsisN =np (1−p ). l s s s timation procedure, and therefore finding the required Their expected total length is np . Therefore, the prob- s number of replicas, c, for unique reconstruction appears ability that we accurately correct all blocks is asymptot- difficult. However, as we show next, using tools from in- ically given by: formation theory, we can estimate this replication rate. (cid:32) (cid:33)Ns Moreimportantly, weshowthatthisreplicationrateisa (cid:89) (cid:88)Pskmk =2n(cid:80)sps(1−ps)lg((cid:80)kPskmk) (10) slow function of sequence length. s k It is easy to see that this probability decreases exponen- D. Fundamental Bounds for Unique tially with the length of the sequence. Stated otherwise, Reconstruction this result shows that the probability that we accurately reconstruct the entire sequence decreases exponentially in the length of the sequence. We rely on an information theoretic approach to com- puting the minimum number of replicas c required for accurate reconstruction. We do this by modifying the 2. Multiple runs: estimating k from multiple l original problem somewhat. Recall that the noise model ofthechannelintroducesadistortionballaroundtheout- putsequence. Itispossiblethatmultipleinputsequences We now investigate how reading the same block mul- belong in this distortion ball – leading to the problem of tiple times can help infer the input block length. We identification of the most probable reconstruction. How- assume that each block is read c times. This can be ever, if we could use replicated extrusion of blocks to done by extruding the same sequence through an array shrink the distortion radius to the point where only one of nanopores. In this case, we have c observed values of l (the ith block length), (lj) for j = 1,...c. The sequence belongs in the ball, we have unique reconstruc- i i i=1..N tion. We use this principle to focus on the problem of probability that the corresponding input block length is numberoftypical sequencesthatbelonginthedistortion k is given by: ball, and find the number of replicas c for which this Pr(sk|sl1,..,slc)= Pskqkl1...qklc . (11) number approaches one. Please note that this problem (cid:80) P q ...q is slightly distinct from the problem of most probable k(cid:48) sk(cid:48) k(cid:48)l1 k(cid:48)lc reconstruction. Let us analogously define k , where l := {l1,..,lc}, as l theinputblocklengthk thatmaximizespk−1·q ...q , s kl1 klc given by: 1. Difference Between the Most Probable and Typical pksl−1qkll1...qkllc =mkaxpks−1qkl1...qklc =mkaxpks−1(cid:89)qk˜lii Reconstruction i Webeginourdiscussionbyhighlightingthedifferences (cid:32) (cid:33) between the most probable and typical reconstructions. =exp max (k−1)ln(p )+(cid:88)˜liln(q ) (12) To gain an intuition about the difference between these s ki k two types of reconstructions, let us briefly look at error i correction of the basic binary symmetric channel (BSC): where ˜li = #{j : lj = i} is empirical probability distri- wesendN bits,eachofthemhasindependentprobability bution. The (cid:80) ˜liln(q ) term is equal to minus cross (cid:15) < 1/2 of being flipped. Observe that we could write i ki entropy between observed frequency {˜li} and probabil- this in the formalism we have introduced for blocks as: i ity distribution expected for a given k: {qki}i. The k,l∈{0,1}, q00 =q11 =1−(cid:15), q01 =q10 =(cid:15). (k−1)ln(p )termfavorsshorterblocks. Foranobserved Obtaining from the output sequence Y ∈ {0,1}N, the s empirical distribution {˜li} , we should choose k corre- most probable input sequence X is simple – it is simply i sponding to the closest probability distribution {q } , X = Y. The probability that this input sequence X is ki i as the one maximizing (k−1)ln(p )+(cid:80) ˜liln(q ). the correct sequence is given by: (1−(cid:15))N = 2Nlg(1−(cid:15)). s i ki However, for large N, we expect that approximately (cid:15)N Theprobabilitythatinputblocklengthk isaccurately bits are flipped. There are determined is given by: (cid:88) (cid:18) (cid:19) mkc := qkl1..qklc. (13) N ≈2Nh((cid:15)) l:kl=k (cid:15)N 6 (where h(p)=−plg(p)−(1−p)lg(1−p)) = hx−h(ps). (16) 1−p s differentwaysofdoingit. Thesearealltypicalcorrections - whichare statisticallydominating forlarge N (Asymp- Theentropyofchoosingblocklengthforsymbolsisgiven totic Equipartition Property). In this case, it is easy to by: see that conditional entropy H(X|Y)=N·h((cid:15)). Having (cid:88) noadditionalinformation,allofthesetypicalcorrections hxs ≡H(ki|si =s)=− Pskln(Psk) are equally probable. Consequently, the probability of k≥1 choosingthecorrectoneisgivenby2−H(X|Y) =2−Nh((cid:15)). Conversely, the definition of the channel says that the probability that a given typical correction (flipped (cid:15)N =−(cid:88)pks−1(1−ps)ln(cid:0)pks−1(1−ps)(cid:1) bits) is the correct one is given by: k≥1 (cid:15)N(cid:15)·(1−(cid:15))N(1−(cid:15)) =2−Nh((cid:15)) =2−H(X|Y) p h(p ) =− s ln(p )−ln(1−p )= s . (17) 1−p s s 1−p – exactly the same as before. s s To summarize, typical corrections correspond to a because (cid:80) kzk =z/(1−z)2. Hamming sphere S(Y,(cid:15)N). The most probable correc- k≥0 We can now express (15) in the block framework: tion corresponds to its center and for unique reconstruc- tion,thissphereshouldreducetoapoint;i.e.,H(X|Y)≈ (cid:88) (cid:88) hx 0. H(Xn)= N (h +hx)=n p (1−p ) s s s s s 1−p s s s 2. Entropy in the Framework of Blocks of Identical Symbols (cid:88) =nhx p =nhx. (18) s Returning to our original problem, by definition of s a typical sequence, the number of typical sequences of Finally, the source entropy H(Xn) = n(cid:80) p ln(1/p ) length n, Xn, grows asymptotically as exp(H(Xn)), canbealternativelyviewedasthesumofentrsopsyofsyms- where bol order (h ) and of block lengths (hx). s s (cid:88) H(Xn)=nhx and hx :=− p ln(p ). (15) s s s 3. InDel Channel with a Single Extrusion Wenowderivethisentropyformulaintheblockframe- Assuming the sticky insertion-deletion channel (indel) work for the asymptotic case (large n). This analysis is described above, the sequence of symbols s is unmod- analogoustotheresultofMitzenmacheretal. [5], where i thenon-asymptoticcaseispresented. Wemustdealwith ified: Yn = sl11...sNlN. Only the block lengths are changed in accordance with the noise model (k → l ). two additional considerations here: our alphabet is not i i Analogously, as in the previous section, we can deter- binary; and the distribution among input symbols is not mine the entropy of joint distribution H(Xn,Yn) as: necessarily uniform. Theinformationcontainedintheinputsequenceinthe (cid:88) block framework: Xn =sk11...skNN can be split into two H(Xn)+H(Yn|Xn)=H(Xn,Yn)= Ns(hs+hxsy) parts – the sequence of symbols (s ), and corresponding s i (19) block lengths (k ). H(Xn) is sum of the two entropies. i where hxy is entropy of pair lengths for input (k) and UsingresultsfromSection(IIIA1),theentropyofselect- s output (l) type s blocks: ing the symbol for succeeding block (s (cid:54)= s ) is given i+1 i by: (cid:88) hxy =− P q ln(P q ) s sk kl sk kl (cid:18) (cid:19) h ≡H(s |s =s)=−(cid:88) ps(cid:48) ln ps(cid:48) k,l≥1 s i+1 i 1−p 1−p s s s(cid:48)(cid:54)=s (cid:88) (cid:88) =hx− P q ln(q ) s sk kl kl k≥1 l≥1 1 (cid:88) = p (ln(1−p )−ln(p )) 1−p s(cid:48) s s(cid:48) s s(cid:48)(cid:54)=s (cid:88) =hx+ P hq (20) s sk k k≥1 1 = 1−ps ((1−ps)ln(1−ps)+hx+psln(ps)) and hqk :=−(cid:80)l≥1qklln(qkl). 7 The entropy of output sequence Y and mutual infor- hxyc−hyc =(cid:80) P × s s k≥1 sk mation I(Xn;Yn)= H(Xn)+H(Yn)−H(Xn,Yn) are given by: × (cid:88) q ..q ln(cid:18)1+ (cid:80)k(cid:48)(cid:54)=kPsk(cid:48)qk(cid:48)l1..qk(cid:48)lc(cid:19). H(Yn)=(cid:88)N (h +hy) l1..lc≥1 kl1 klc Pskqkl1..qklc s s s s Grouping q corresponding to the same i by assigning ki ˜li = #{j : lj = i}, using n! ≈ (n/e)n, the distribution     becomes ((cid:80) ˜li =c): (cid:88) (cid:88) (cid:88) i for hys =−  Pskqklln Pskqkl (21) (cid:18) (cid:19) l≥1 k≥1 k≥1 (cid:88) q ..q = (cid:88) c (cid:89)q˜li kl1 klc ˜l1,˜l2,... ki l1..lc≥1 ˜l1,˜l2,... i≥1 I(Xn;Yn)= =(cid:88)Ns((hs+hxs)+(hs+hys)−(hs+hxsy)) ≈(cid:88)(cid:89)(cid:18)˜lci(cid:19)˜liqk˜lii s ˜l1,..i≥1 =(cid:88)Ns(hs+hxs +hys −hxsy) (22) (cid:88)  (cid:88)˜li (cid:32)˜li/c(cid:33) s = exp−c ln . c q ki ˜l1,.. i≥1 H(Xn|Yn)=H(Xn)−I(Xn;Yn) ThesuminbracketistheKullback-Leibler(asymmetric) (cid:16)(cid:110) (cid:111) (cid:17) divergence: D ˜li/c || {q } - the exponent is =(cid:88)N (hxy−hy) KL i ki i s s s asymptotically(largec)dominatedby˜li/c=q distribu- s ki tion. Tofindanapproximationoftheformulahxyc−hyc, s s we focus only on these distributions: (cid:88) =n p (1−p )(hxy−hy). (23) s s s s hxyc−hyc ≈ s s s Asymptotically, the number of typical corrections is given by exp(H(Xn|Yn)), which grows exponentially in  (cid:80) P (cid:16)(cid:81) qqkl(cid:17)c (cid:88) k(cid:48)(cid:54)=k sk(cid:48) l≥1 k(cid:48)l the length of the sequence n. This directly implies the ≈ Pskln1+ (cid:16) (cid:17)c  exponentially decreasing probability of accurate recon- P (cid:81) qqkl k≥1 sk l≥1 kl struction. We now discuss how the number of typical corrections can be reduced (approaching 1) by reading   the input sequence multiple times (c). The goal of this analysisistoestimatethenumberofextrusionsweshould =(cid:88)Pskln1+ (cid:88) PPsk(cid:48)e−c·dkk(cid:48) perform for unique reconstruction. k≥1 k(cid:48)(cid:54)=k sk (cid:88) (cid:88) 4. InDel Channel with Multiple Extrusions ≈ Psk(cid:48)e−c·dkk(cid:48) k≥1k(cid:48)(cid:54)=k We consider the case of c replicas on each block: where d is the Kullback-Leibler divergence: (lj) for j =1,...c. In this case, we have: kk(cid:48) i i=1..N   hxsyc =− (cid:88) Pskqkl1..qklcln(Pskqkl1..qklc) dkk(cid:48) :=−ln(cid:89)(cid:18)qqk(cid:48)l(cid:19)qkl k,l1,..,lc≥1 l≥1 kl (cid:88) =hxs +c Pskhqk =(cid:88)q ln(cid:18)qkl (cid:19)=D ({q } || {q } ) k≥1 kl q KL kl l k(cid:48)l l k(cid:48)l l≥1 (cid:16) (cid:17) hyc =−(cid:80) (cid:80) P q ·..·q × s l1..lc≥1 k≥1 sk kl1 klc Since H(X|Y) = H(X,Y)−H(Y) and P = pk−1(1− sk s p ), we can write:   s ×ln(cid:88)Pskqkl1 ·..·qklc H(Xn|Ync)=n(cid:88)ps(1−ps)(hxsyc−hysc) k≥1 s 8 ≈n(cid:88)(1−p )2(cid:88) (cid:88) pk(cid:48)exp(−c·d ) (24) IV. EXPERIMENTAL RESULTS s s kk(cid:48) s k≥1k(cid:48)(cid:54)=k We present a simulation study of the implications ≈nexp(−c·dmin)·(cid:88)(1−ps)2psk0(cid:48) (25) of our analysis on real-world sequencing experiments. We consider three models for our channel – the first s model is a sticky channel with exponential distribution where dmin :=min d =d is the minimal value k(cid:48)(cid:54)=k kk(cid:48) k0k0(cid:48) and the second, an independent insertion-deletion chan- of d (if the minimum exists). nel. Finally, we also use these general considerations The distance d describes the similarity between the kk(cid:48) for substitution-only case. In each case, we examine the results of reading blocks having original lengths k and boundonlengthofsequenceforaccuratereconstruction, k(cid:48). It quantifies the likelihood of mistakenly identify- the number of replicas needed for larger reconstructions, ing a k length block as a k(cid:48) length block. The smaller and the Kullback-Leibler divergences. The goal of these it is, the faster is the growth of number of typical cor- studies is to demonstrate that for a wide class of chan- rections (exp(H(Xn|Ync))) with k erroneously replaced nelcharacteristics: (i)thelengthofsequencethatcanbe by k(cid:48). The smallest distance (dmin) corresponds to the accurately reconstructed in single read is small; (ii) the mostlikelymistake,anditasymptoticallydominatesthe number of required replicas for longer reconstructions is growth of typical corrections. a slowly growing function (logarithmic); and (iii) even The use of multiple extrusions (c) allows us to re- in the presence of high indel error rates, nanopore se- duce the exponent in the number of typical corrections quencers can accurately reconstruct sequences with re- exp(H(Xn|Ync)). For unique reconstruction, the num- quired number of replicas. ber of typical corrections must approach 1. Conse- We consider a binary equi-probable input – m = 2, quently, we choose c such that n·exp(−c·dmin) is of p = p = 1/2. The behavior of conditional entropy 0 1 order of 1. From this, we see that the number of ex- H(X|Y) is primarily determined by the q distribution. ks trusions should grow logarithmically in sequence length: The results in this section can be naturally generalized c ≈ ln(n)/dmin. This important result establishes the to any alphabet and probability distribution. feasibilityoflow-overheadsequencingusingnanoporese- quencers. a. Analysisofthesubstitution-onlycase Theframe- A. Exponential Insertion-Deletion Error Model work presented above for indel errors can also be ex- tendedtosubstitution-onlyerrors: areferencebases∈A Wewillfirstconsiderastickychannelwithexponential has probability q to be read as base (cid:96) ∈ A, where s(cid:96) distributionfortheerrorprobabilities–forsome0<q < |A| = 4 is the set of 4 bases. The expected number 1: ofoccurrencesofbasesinasequenceoflengthnisgiven by N˜ =np . We analogously have: 1−q s s q =q|k−l| kl 1+q−qk hxyc−hyc = s s The second term in the product is for normalizing the = (cid:88) q ..q ln(cid:18)1+ (cid:80)s(cid:48)(cid:54)=sqs(cid:48)(cid:96)1..qs(cid:48)(cid:96)c(cid:19) probability. s(cid:96)1 s(cid:96)c q ..q Wefirstconsiderthesingleruncase. Equation(23)al- s(cid:96)1 s(cid:96)c (cid:96)1..(cid:96)c≥1 lows us to calculate conditional entropy H(Xn|Yn), de- scribing the growth in the number of typical corrections: ≈ln1+ (cid:88)(cid:32)(cid:89)(cid:18)qs(cid:48)(cid:96)(cid:19)qs(cid:96)(cid:33)c euxeps(oHf(cXonnd|Yitnio)n).alTehnetrleofptyPfaonrelvoafriFouigsuvreal2uepsreosfenqt.svAasl-- q s(cid:48)(cid:54)=s (cid:96)∈A s(cid:96) suming correction procedure as taking a random typi- cal correction, the Right Panel of this figure presents H(Xn|Ync)=(cid:88)N˜ (hxyc−hyc)≈ the probability of obtaining the right correction. This s s s probability drops exponentially with the length of the s sequence, making a single run approach impractical for (cid:88) (cid:88) longer sequences. ≈n p exp(−c·d ). (26) s ss(cid:48) Thislimitationcanbehandledbyperformingmultiple s s(cid:48)(cid:54)=s extrusions of the same sequence. We use Equation (24) where d =D ({q } ||{q } ). to find conditional entropy in this case. This requires ss(cid:48) KL s(cid:96) (cid:96) s(cid:48)(cid:96) (cid:96) Analogously to the indel case, exp(H(Xn|Ync)) is the finding Kullback-Leibler divergences between {q } dis- kl l size of distortion ball - the number of typical reconstruc- tributions for different original block lengths k. Table 1 tions. For unique reconstruction, we need this size to presents some of these values for q =0.5. be of order one. If there exists dmin = min d , the The minimal distance is dmin =d ≈0.193, and cor- s(cid:54)=s(cid:48) ss(cid:48) 21 numberofreadsrequiredforuniquereconstructionisap- responds to misinterpreting original k = 2 sequence as proximatelyc≈ln(n)/dmin;i.e.,logarithmicinsequence k(cid:48) = 1. This intuitively stronger overlap of the first two length. distributioncanbeobservedintheLeftPanelofFigure3, 9 FIG. 2: Left Panel: Conditional entropy for single extrusion between input sequence (input to the nanopore sequencer) and theoutputsequence(observedsequence). ThisisderivedfromEquation(23). RightPanel: Theprobabilitythatweselectthe correct typical correction for q=0.01 and increasing value of n. k(cid:48)→ 1 2 3 4 5 6 7 k(cid:48) → 1 2 3 4 5 6 7 k↓ k↓ 1 0 0.223 0.665 1.123 1.857 2.512 3.194 1 0 1.879 4.247 6.747 9.319 11.94 14.58 2 0.193 0 0.234 0.694 1.270 1.905 2.568 2 ∞ 0 1.394 3.473 5.760 8.160 10.63 3 0.567 0.220 0 0.233 0.696 1.274 1.909 3 ∞ ∞ 0 1.091 2.931 5.032 7.295 4 1.054 0.644 0.227 0 0.232 0.695 1.273 4 ∞ ∞ ∞ 0 0.886 2.219 4.024 5 1.621 1.181 0.671 0.229 0 0.232 0.694 5 ∞ ∞ ∞ ∞ 0 0.739 2.220 TABLE I: Kullback-Leibler divergence for q = 0.5, expo- nentialdistribution,andvariousoriginalblocklengths: k,k(cid:48). TABLEII:Similarityofqkl fordifferentvaluesofkforthein- dependent insertion-deletion channel. Infinities denotes that Thesmallestvalueallowstoconcludethatweasymptotically asymptotic empirical probability e.g. for k = 2 cannot be need c≈ln(n)/0.193 reads for unique reconstruction. misinterpreted as k=1. containing{q } forthefirst15valuesofk. Thedistance kl l between farther neighboring distributions is nearly the same; i.e., misreading a block of length five nucleotides as six, is as likely as an input block of 100 nucleotides being read as 101. The right panel of Figure 3 shows conditional entropy as a function of number of replicas c, for q =0.5. There Figure 4 shows the first 15 distributions (q ) for this kl are important observations drawn from this figure: (i) model. Itisillustrativetonotethatunlikemanyprevious thenearlylinearnatureofthecurveshowsthatthenum- modelsinwhicherrorratesareinvariantonblocklength, ber of replicas is almost logarithmic in the read length larger block lengths have higher insertion-deletion error (H(Xn|Yn)/n asymptotically behaves as exp(−c·dmin); rates in our model. Table 2 presents the approximated and (ii) for realistic reconstruction lengths (say, 100K first values of similarities of q for different values of k, kl bases), the number of replicas is relatively small (less for(cid:15)=0.06,derivedfromJainetal.[7]. Theinfiniteval- than 60). These are important results that establish the ues in the table correspond to difference in support (one feasibility of nanopore sequencers for accurate low-cost of values is zero). For example, the value k =2 can lead construction of long reads. toblocksoflength1,2,3,or4,whilek =1canleadonly to blocks of length 1 or 2. Many reads of length k = 2 will lead to some blocks of length 3 or 4. These are in B. Independent Insertion-Deletion Channel Model additiontooriginalblocksoflengthk =1,whichmaybe misinterpreted in the assumed model. We note that the To demonstrate the robustness of our results we now distributions become closer to their own neighbors as k consider a different channel model – the independent grows: lim d =0, the minimal nonzero distance k→∞ kk+1 insertion-deletion channel. In this model, for each nu- dmin does not exist. In other words, distinguishing be- cleotide, there is probability (cid:15) > 0, that the symbol is tweenk andk+1becomesmoredifficultwithincreasing deleted. There is also an identical probability that the k, and their difference vanishes asymptotically. Conse- symbol is duplicated. It follows that there is a probabil- quently, as we can observe from the right panel of Fig- ity 1−2(cid:15) that the symbol is sequenced without an error. ure 4, the required number of replicas grows slower than For this model, q , for given k, is the convolution of k logarithm of sequence length. Figure 4 shows the con- kl such random variables, additionally truncated to enforce ditional entropy of the input and output sequences for that it is a sticky channel; i.e., q = 0. For large k, different numbers of replicas (c). long sequences (100K k0 qkl approa√ches the Gaussian distribution with standard nucleotides), we need even fewer replicas (less than 20, deviation 2(cid:15)k. in this case, compared to 60 for the exponential model). 10 FIG.3: LeftPanel: {q } distributionsforq=0.5andk=1to15. RightPanel: (24)approximationforq=0.5anddifferent kl l numbers of copies c. For example for n=105 length sequence we need less than 60 reads for unique reconstruction. FIG. 4: Left Panel: The first 15 q distributions for (cid:15) = 0.06 for the independent insertion-deletion channel. Right Panel: Approximation of joint entropy for (cid:15) = 0.1 and different numbers of copies c (from Equation (24). For example for n = 105 length sequence we need less than 20 reads for unique reconstruction. C. Substitution-only case Churchill and Waterman [3] address similar questions to ours for substitution channels. A key differentiating We finally use our approach to the substitution-only aspect of their work is that they reconstruct a sequence case from Section IIID4a. Figure 5 contains probabil- of read values for a given position as a consensus; i.e., ities of substitution from [7]. In this case, the required the most frequent value across replicates. In contrast, number of reads is given by: wefocusofthemostprobablereconstruction. Thisleads tocomparisonofobtainedempiricalprobabilitydistribu- (cid:88) (cid:88) c≈n p exp(−c·d ) tion of reads for given value, with probability distribu- s kk(cid:48) tions expected for different reference values (using cross s s(cid:48)(cid:54)=s entropy, equation (12)). The final formulae for recon- what is c≈ln(n)/2.89 in the assumed case. structionboundscontainKullback-Leiblerdivergencebe- tweenprobabilitydistributionsfordifferentreferenceval- uesquantitativelydescribingthedifficultyofdistinguish- V. RELATED RESEARCH ing them. Error characteristics and models for nanopore se- Technologies underlying nanopore sequencers quencers have been recently studied by O’Donnell et have been investigated for over a decade [1, 4]. al [12]. In this study, the authors investigate error char- Commercial platforms based on these technolo- acteristics, andbuildastatisticalmodelforerrors. They gies have only recently been announced – with usethismodeltoshow, throughasimulationstudy, that Oxford Nano being the leading platform. An ex- replicatedextrusioncanbeusedtoimproveerrorcharac- cellent introduction to this platform is available teristics. Inparticular,theyshowthatusingtheirmodel, at: https://www.nanoporetech.com/technology/ it is possible to achieve 99.99% accuracy by replicating analytes-and-applications-dna-rna-proteins/ the read 140 times. This empirical study provides ex- dna-an-introduction-to-nanopore-sequencing. cellent context for our analytical study, which provides There have been preliminary efforts aimed at char- rigorous bounds and required replication rates. acterizing the performance of nanopore sequencing platforms in terms of error rate, error classification, and There have been a number of efforts aimed at ana- run lengths [6, 11, 12, 15]. A consensus emerges from lyzing the error characteristics of current generation of these studies that the primary error mode in nanopore sequencing technologies, including the 454 and PacBio sequencers is deletion errors and that the error rate is sequencers [2, 9, 13]. These efforts are primarily aimed approximately4%withareadlengthofover150Kbases. attheproblemofalignmentofrelativelyshortfragments These studies provide important data that is used to - of localizing their positions in a longer sequence. In build our insertion-deletion channel. contrast, due to longer reads obtained by nanopore se-

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.