Table Of Content

Multiple Sequence Alignment System for Pyrosequencing Reads FahadSaeed1,AshfaqKhokhar1,OsvaldoZagordi2,andNikoBeerenwinkel2 1DepartmentofElectricalandComputerEngineering, UniversityofIllinoisatChicago,ILUSA 9 and 0 0 2DepartmentofBiosystemsScienceandEngineering 2 ETHZurich,Basel,Switzerland n a J Abstract. Pyrosequencingisamongtheemergingsequencingtechniques,capa- 9 bleofgeneratingupto100,000overlappingreadsinasinglerun.Thistechnique 1 ismuchfasterandcheaperthantheexistingstateoftheartsequencingtechnique suchasSanger.However,thereadsgeneratedbypyrosequencingareshortinsize ] N and contain numerous errors. Furthermore, each read has a specific position in G thereferencegenome.Inordertousethesereadsforanysubsequentanalysis,the readsmustbealigned.Existingmultiplesequencealignmentmethodscannotbe . o usedastheydonottakeintoaccountthespecificpositionsofthesequenceswith i respecttothegenome,andarehighlyinefficientforlargenumberofsequences. b Therefore,thecommonpracticehasbeentouseeithersimplepairwisealignment - q despiteitspooraccuracyforerrorpronepyroreads,orusecomputationallyexpen- [ sivetechniquesbasedonsequentialgappropagation.Inthispaper,wedevelopa computationallyefficientmethodbasedondomaindecomposition,referredtoas 1 pyro-align, to align such large number of reads. The proposed alignment algo- v 3 rithm accurately aligns the erroneous reads in a short period of time, which is 5 ordersofmagnitudefasterthananyexistingmethod.Theaccuracyofthealign- 7 mentisconfirmedfromtheconsensusobtainedfromthemultiplealignments. 2 . 1 1 Introduction 0 9 0 Pyrosequencing is among the emerging sequencing techniques developed for deter- : v mining the sequences of DNA bases from a genome. It is capable of generating up i to 100,000 overlapping reads in a single run. However, multitude of factors, such as X relatively short read lengths (i.e., as of 2008 an average of 100−250 nt compared to r a 800−1000 nt for Sanger sequencing), lack of a paired end protocol, and limited ac- curacyofindividualreadsforrepetitiveDNA,particularlyinthecaseofmonopolymer repeats, present many computational challenges [14] to make pyrosequencing useful forbiologyandbioinformaticsapplications. Forovermorethanadecade,Sangersequencinghasbeenthecornerstoneofgenome sequencing including that of microbial genomes. Improvements in DNA sequencing techniques and the advances in data storage and analysis, as well as developments in bioinformaticshavereducedthecosttoamere8000$−10000$permegabaseofhigh qualitygenomedraftsequence.However,theneedofmoreefficientandcosteffective 2 SaeedKhokharZagordiandBeerenwinkel approaches has led to development of new sequencing technologies such as the 454 GS20 sequencing platform. It is a non-cloning pyrosequencing based platform that is several orders of magnitude faster than the Sanger machines. However, the new tech- nologydespiteitsenormousadvantageintermsoftimeandmoneywillnotbeableto replacethecurrentSangertechnology,unlessthereadsgeneratedareproperlyaligned withrespecttothereferencegenome. Thekeyissuesassociatedwiththeuseofpyrosequencingtechniqueareasunder: Read Length: The read length is expected to be of the order of 100 − 250bp on average.ThisismuchshorterthantheotherstateoftheartSangermachineswhichgive outconsistentreadlengthsoftheorderof>800−900bp. Orientation: This is generally the case for most of the sequencing technologies. Each DNA helix will be broken into the original and its Watson-Crick complement. Thesewouldbefurtherbrokenupintopieces,andthereisgenerallynowaytoreveal whichofthetwoisit.Theproblemismoresevereandusuallyencounteredforgenome reconstruction. Errors:EachindividualDNAsequenceorreadislikelytohaveerrorsintheform of insertions and deletions. It may also have mutations and the pyrosequencer may itselfmakeerrors.Theseerrorscorrespondtohomopolymereffects,includingextension (insertions),incompleteextensions(deletions),andcarryforwarderrors(insertionsand substitutions).Insertionsareconsideredthemostcommontypeoferror(36%oferrors) followedbydeletions(27%),ambiguousbases,Ns(21%),andsubstitutions(16%)[28]. Fig.1.Pairwisealignmentofthereadswiththereferencegenomeisshown For most practical purposes, pyroreads without any post processing are of limited use.Oneofthemostwidelyrequiredtasksasapreprocessingstepformanyapplica- tions, including haplotype reconstruction [12] [13], analysis of microbial community analysis[3],analysisofgenesfordiseases[2],isthealignmentofthesereadswiththe MultipleSequenceAlignmentSystemforPyrosequencingReads 3 wildtype.Forimportantapplicationssuchasviralpopulationestimationorhaplotype reconstructionofvariousvirusese.g.,HIVinapopulation,scientistsusuallyhavethe informationaboutthewildtypegenomeofthevirus.Whileforothersequencingtech- nologies,suchasSanger,simplepair-wisealignmentwiththewildtypemayproduce reasonablemultiplealignment,inthecaseofpyrosequencing,thevariationinthehap- lotype population compounded with the errors introduced in the reads does not allow feasiblemultiplealignmentbysimplepair-wisealignment.Fig.1depictssimplepair- wisealignmentofpyrosequencereadswithareferencegenome.Weassertthataccurate and workable multiple alignment is often necessary for a variety of applications and statisticalpackagestoworkwiththesepyroreads,asdemonstratedin[12][13][3][2]. In theory, alignment of multiple sequences can be achieved using pair-wise alignment, each pair getting alignment score. But for optimal alignment the sum of all the pair-wise alignment scores need to be maximized, which is an NP complete prob- lem [15]. Towards this end, dynamic programming based solutions of O(LN) com- plexityhavebeen pursued,where N isthenumber ofsequencesand L istheaverage lengthofasequence.Suchaccurateoptimizationsarenotpracticalforlargenumberof sequences-asisthecaseinpyrosequencing-,thusmakingheuristicalgorithmsasthe onlyfeasibleoption.Theliteratureontheseheuristicsisvastandincludeswidelyused works, such as Notredame et. al. [16], Edgar [18], Thompson et. al. [17], Do et. al. [22],andMorgensternet.al.[20].Theseheuristicsarecomplexcombinationofad-hoc procedureswithsomeflavorofdynamicprogramming.Despitetheusefulnessofthese widelyusedheuristics,theyscaleverypoorlywithincreasingnumberofsequences. Formultiplealignmentofpyroreads,’outofthebox’useoftheseheuristicsisnot feasible because of two main reasons: 1) the pyrosequencing reads can be very large innumber(upto100,000usablereadsinasinglerun(withaRocheGS20platform), and2)theheuristicsdonottakeintoaccountthepositionsofthereadswithrespectto thereferencegenome.Additionalfactorssuchasshortlengthsanderrors,andthefact that these reads have preceding or trailing ’gaps’ pose further alignment challenges. In [12], an alignment technique based on sequential gap propagation has been used. This technique is computationally expensive and its alignment quality decreases with theincreaseinthemutationvalue. In this paper, we present a computationally efficient algorithm pyro-align, specifically designed for multiple alignment of DNA reads obtained from pyrosequencing. Theproposedalgorithmisbasedonanoveldomaindecompositionconcept,therefore itiscapableofaligningverylargenumberofpyrosequences.Ittakesintoaccountthe position of the reads with respect to the reference genome, and assigns weight to the leadingandtrailinggapsforthereads. Theobjectiveofourworkistodevelopamultiplealignmentsystemforsmallerror prone reads, such that the errors in the alignment are ’highlighted’ and the system is abletohandlelargenumberofreads,asmaybeexpectedfrompyrosequencingreads. Weassumethatthereadsmaybegeneratedfromoneormanygenomes,with’for- ward’ orientation. We also assume that the reference genome (or its wild type) from which the reads are generated is available, as is generally the case for haplotype reconstruction. In our experiments, we have used HIV-pol gene virus as the reference genome (with length of 1970bp) and simulator Readsim [11] to generate these reads. 4 SaeedKhokharZagordiandBeerenwinkel The algorithm uses concepts from domain decomposition and parallel multiple align- menttechniques[1,21]. For the sake of completeness, let’s first formally define the Multiple Sequences Alignmentprobleminitsgenericform,withoutindulgingwiththeissuessuchasscor- ing functions. Let N sequences be presented as a set S = {S ,S ,S ,···,S } and 1 2 3 N letS(cid:48) = {S(cid:48),S(cid:48),S(cid:48),···,S(cid:48) }bethealignedsequenceset,suchthatallthesequences 1 2 3 N in S(cid:48) are of equal length, have maximum overlap, and the score of the global map is maximumaccordingtosomescoringmechanismsuitablefortheapplication. Aperfectmultiplealignmentforpyroreadswouldbe,thatthereadsarealignedwith each other such that the position of the reads with respect to the reference genome is conserved; the reads have maximum overlap and are of equal lengths after the alignment,includingleadingandtrailinggaps. The intuitive idea behind the proposed pyro-align algorithm is to first place the readsincorrectorientationwithrespecttothereferencegenomeandthenuseprogres- sive alignment to achieve the final alignment. For efficient progressive alignment, the correctlyplacedreadsarereorderedaccordingtothestartingposition,andacomputa- tionallylowcomplexitysimilaritymetricisextractedfromthisorderingposition.The similarity metric is then used to align pairs of aligned reads using a hierarchical decomposition strategy. The proposed multiple alignment algorithm takes advantage of thepyroreadscharacteristicsandbringsintechniquesfromdatastructuresandparallel computingtorealizealowcomplexitysolutionintermsoftimeandmemory. Theproposedalignmentalgorithm,pyro-align,consistsofthefollowingtwomain components: 1. Semi-Globalalignment 2. Hierarchicalprogressivealignment (a) Reorderingofreadstogenerateguidancetree (b) Pairwiseandprofile-profilealignment Each component is designed considering the characteristics of pyroreads and it is describedinthefollowingsectionsalongwithitsjustification. 1.1 Semi-GlobalAlignment The first step is to determine the position of each read with respect to the reference genome.Ifthisstepisomitted,therearenumberofalignmentsthatwouldbecorrect, butwouldbeinaccurateifanalyzedintheglobalcontext.Areadthatisnotconstricted intermsofposition,maygivethesamescore(SPscore)forthemultiplealignmentbut wouldbeincorrectincontextofthereference.Toaccomplishthetaskof’placing’the readsinthecorrectcontextwithrespecttothereferencegenomeweemploysemi-global alignmentprocedure. The semi global alignment is also referred to as overlapping alignment because the sequences are globally aligned ignoring the start and end gaps. For semi-global alignmentweuseamodifiedversionofNeedleman-Wunschalgorithm[5]. ThemodificationinthebasicversionofNeedleman-Wunschisrequiredtohandle the leading and trailing gaps of the reads when aligning to the reference genome. If MultipleSequenceAlignmentSystemforPyrosequencingReads 5 theleadingandtrailinggapsarenotignored,consideringtheshortlengthofthereads, the alignment scores would be dominated by these gaps, hence giving an inaccurate alignmentwithrespecttothegenome. Let the two sequences to be aligned be s and t, and M(i,j) presents the score of the optimal alignment. Since, we do not wish to penalize the starting gaps, we mod- ify the dynamic programming matrix by initializing the first row and first column to be zero. The gaps at the end are also not to be penalized. Let M(i,j) represent the optimal score of s ,···,s and t ,···,t . Then M(m,j) is the score that repre- 1 i 1 j sents optimally aligning s with t . The optimal alignment therefore, is now de- 1,···,j tected as the maximum value on the last row or column. Therefore the best score is M(i,j) = max (M(k,n),M(m,l)),andthealignmentcanbeobtainedbytracking k,l thepathfromM(i,j)toM(0,0).Foradditionaldetailsonsemi-globalalignmentwe referthereaderto[8]. Onceeachreadhasbeensemi-globallyalignedwiththereferencegenome,weob- tainreadswithleadingandtrailinggaps,wherethefirstcharacterafterthegapsisthe startingpositionofthereadwithrespecttothereferencegenome.Theinformationfor thesealignmentsarestoredinhashtablesthatarefurtherusedforprocessinginreorder- ingthereadsforalignment. 2 HierarchicalProgressiveAlignment Generallymultiplesequencealignment(MSA)proceduresareeitherbasedoniterative methods or employ progressive techniques. Although, progressive techniques relative toiterativetechniquesaremoreefficient,theyarenotsuitablewhenthesequencesare relativelydiverseorthenumberofsequencesisverylarge.Consideringthefactthatthe pyroreadsarehighlysimilar,wedevelopahierarchicalprogressivealignmentprocedure thatisalsocomputationallyefficientforlargenumberofreads. ProgressivealignmenttechniquesdevelopfinalMSAbycombiningpair-wisealign- mentsbeginningwiththemostsimilarpairandprogressingtothemostdistantlyrelated. All progressive alignment methods require two stages: a first stage in which the rela- tionships between the sequences are represented as a tree, called a guide tree, and a second step in which MSA is built by adding the sequences sequentially to the grow- ingMSAaccordingtotheguidetree.Inthefollowing,wedescribethelowcomplexity componentsofpyro-align. 2.1 ReorderingReads Themethodfollowedbymostoftheprogressivemultiplealignmentalgorithmsisthata quicksimilaritymeasureiscomputedthatisbasedonk-mercounting[4]orsomeother heuristicmechanism.Thesepair-wisesimilaritymeasures(distances)aretabulatedina matrixformandatreeisconstructedfromthisdistancematrixusingUPGMAorneigh- boringjoining.Theprogressivealignmentisthusbuilt,followingthebranchingorder ofthetree,givingamultiplealignment.ThesestepsrequireO(N2)timeeach,where N isthenumberofreads.Toreducethiscomplexity,weexploitthefactthatthereads arecomingfromthesamereferenceornearlythesamereference.Thisinturnimplies 6 SaeedKhokharZagordiandBeerenwinkel that the reads starting from the same or near same ’starting’ point with respect to the referencegenomearelikelytobesimilartoeachother.Therefore,wealreadyhavethe orderinginformationorthe’guidetree’fromthefirststepofthealgorithm.Ourguide tree, or the order in which sequences will be aligned in the progressive alignment is fromthestartingpositionofthereadsfromthefirststage.Ofcoursethedecomposition of the reads (the subtree of the profiles that we built) doesn’t render the reads in the sameorderasintraditionalprogressivealignment,butneverthelesstheorderismoreor lessthesamewhentheprofilesofthesereadsarealigned. LettherebeN numberofreadsR=R ,R ,···,R generatedfrompyrosequenc- 1 2 N ing technique, from the reference genome of length L . Also, let the length of each g readdenotedbyL(R) .Afterexecutingsemi-globalalignmentusingthealgorithmdis- p cussed in the previous section, let each read be presented by R , where the pth read pq has q leading gaps and L −q −L(R) trailing gaps. Then the reordering algorithm g p would reorder the reads such that after the reads are reordered using the information fromtheleadinggaps,thereadR comesinordering’before’R ,∀p,q ∈L . pq p(q+1) g Toexecutethereorderinginanefficientmanner,weemployhashtablesthatspeed up the search process. We create two hashtables: hashtable uses fasta sequence tag 1 asthehashkeyandstoresthecorrespondingstartingpositionoftheread;hashtable 2 stores the read names (fasta sequence tag) and the dna sequence it is associated with. Usingthesetables,thereadsarereorderedinthedatabaseinlineartime. 2.2 Pair-wiseandProfile-ProfileAlignments Theorderingofthereadsdeterminedintheprecedingstepisnowusedtoconductthe progressive alignment. Traditional progressive alignment requires that the sequences mostsimilartoeachotherarealignedfirst.Thereafter,sequencesareaddedonebyone tothemultiplealignmentsdeterminedaccordingtosomesimilaritymetric.Thissequen- tialadditionofsequencesforprogressivealignmentisnotsuitableforlargenumberof sequences. In order to devise a low complexity system, we design a hierarchical pro- gressivealignmentprocedurethatisbasedondomaindecomposition[1],asdescribed belowanddepictedinFigure2. First of all, pair-wise local alignment using standard Needle-Wunsch is executed on each overlapping pair of reads (the ordering is still the same as discussed in the previoussection).Afterthisstage,thereadsarealignedinpairssuchthatwehaveN/2 pairsofalignedreads.TheseN/2pairsofreadsarethenusedforprofilealignmentsas discussedbelow. Profile-profile alignments are used to re-align two or more existing alignments(in our case the pairs of aligned reads). It is useful for two reasons; one being that the usermaywanttoaddsequencesgradually,andsecondbeingthattheusermaywantto keeponehighqualityprofilefixedandkeeponaddingsequencesalignedtothatfixed profile[17]. Wetakeadvantageofbothofthesepropertiesinourdomaindecomposition. Inthisstageofthealgorithm,theN/2pairsofalignedreadshavetobecombinedto getamultiplealignment.Wehaveshownin[21]thatthedecompositionoftheprofiles givesafairamountoftimeadvantagesevenonasingleprocessor.Thereforeahierar- chicalmodelsimilarto[1]isimplemented(seeFig.2).Themodelrequiresthatinstead MultipleSequenceAlignmentSystemforPyrosequencingReads 7 Fig.2.Hierchicalprofile-profilealignmentsforpyro-alignisshown ofcombiningtheprofilesinasequentialmanner(onebyone),abinarytreeisbuiltsuch thattheprofilestobealignedaretheleafsofthetree. Fig.3.Twoprofiles(XandY)arealignedunderthecolumnsconstrains,producingprofileZ Inordertoapplypair-wisealignmentfunctionstoprofiles,ascoringfunctionmust bedefined,similartothesubstitutionmethodsdefinedforpair-wisealignments.Oneof themostcommonlyusedprofilefunctionsisthesequence-weightedsumofsubstitution matrix scores for each pair of amino acid letters. Let i and j be the amino acid, p i the background probability of i, p the joint probability of i and j aligned to each ij other,S thesubstitutionmatrixbeingused,fxtheobservedfrequencyofiincolumn ij i x of the first profile, x the observed frequency of gaps in that column. The same G attributesareassumedfortheprofiley.Profilesumofpairs(PSP)isthefunctionused in Clustalw [17], Mafft [23] and Muscle [19] to maximize Sum of Pairs(SP) score, whichinturnmaximizesthealignmentscoresuchthatthecolumnsintheprofilesare preserved,asdepictedinFig.3.ThePSPscorecanbedefinedasin [24]and[19]: 8 SaeedKhokharZagordiandBeerenwinkel S =log(p /p p ) (1) ij ij i j (cid:88)(cid:88) PSPxy = fxfylog(p /p p ) (2) i j ij i j i j Fig.4.ThefinalAlignmentofthereads Forourpurposes,wewilltakeadvantageofPSPfunctionsbasedon200PAMma- trix[25]andthe240PAMVTMLmatrix[26].Somemultiplealignmentmethodsim- plementdifferentscoringfunctionssuchasLogexpectation(LE)functions,butforour purposesPSPscoringsuffices.Profilefunctionshaveevolvedtobequitecomplexand good discussion on these can be found at [19] and [27]. We use the profile functions from the clustalw system. The final alignment from the pyro-align algorithm can be seeninFig.4.Differentstepsoftheproposedpyro-alignAlgorithmareoutlinedbelow. Input:ReadsgeneratedfrompyrosequencingprocedureandReferenceGenome Output:AMultipleAlignmentofReadsisreturned //CalculateoverlappingofeachofthereadswithrespecttothereferenceGenome for(i=1;i≤N;i++)do Overlapped-Reads←Semi-Global-Alignment(R ,Genome); i end Reordered-Reads←Reordering(Overlapped-Reads); //PairwisealignmentusingstandardNeedle-Wunschisexectued,forpairsof orderedreads; Pair-wise-aligned←Needle-Wunsch(Reordered-Reads); //Profile-profilealignmentisobtainedusingSample-align-Dstrategy Final-Alignment←Profile-Profile-alignment(Pair-wise-aligned); returnFinal-Alignment; Algorithm 1: Steps of the Proposed Multiple Sequence Alignment pyro-align Algorithm MultipleSequenceAlignmentSystemforPyrosequencingReads 9 3 PerformanceAnalysis Asdiscussedearlierinthepaper,theexactsolutionformultiplealignmentisnotfeasible andheuristicsareemployed.Mostoftheseheuristicsperformwellinpracticebutthere isgenerallynotheoreticaljustificationpossiblefortheseheuristics[9].Forpyro-align itcanbeshownthatthesemi-globalalignmentofthereadswiththereferencegenome isanalogoustocenterstaralignment.Thecenterstaralignmentisshowntogiveresults within2-approxoftheoptimalalignment[9]inworstcaseandsamecanbeexpected from the semi-global alignment of reads with reference genome. The accuracy of the later stages is confirmed by rigorous quality assessment procedure described in the sectionbelow. 3.1 ExperimentalSetupandQualityAssessment Theperformanceevaluationofthealgorithmhasbeencarriedonasingledesktopsys- tem2xQuadCoreIntel53552.66GHz,2x4MBCacheand16GBofRAM.Theoper- atingsystemonthedesktopisRedHatLinuxwithkernel2.6.18-92.1.13.el5.Thesoft- ware uses libraries from Biojava [7] and is built using java version ”1.6.0” Java(TM) SERuntimeEnvironment,IBMJ9VM. ToinvestigatethequalityofthealignmentproducedbythealgorithmweusedRead- simsimulator[11]togeneratethereads.Thequalityassessmentofmultiplealignment isgenerallycarriedoutusingbenchmarkssuchPrefab[18]orBaliBase[6].However, thesebenchmarksarenotdesignedtoaccessthequalityofthealignedreadsproduced from pyrosequencing, and there are no benchmarks available specifically for these reads. Therefore, a system has to be developed to access the quality of the aligned reads.Theexperimentalsetupforthequalityassessmentofthealignmentprocedureis shownintheFig.5andisexplainedbelow. Our quality assessment have two objectives: (1)to assess the quality of the alignment produced by pyro-align with respect to the original genome (2) ensure that the systemmustbeabletohandlereadsfrommultiplehaplotypeforalignment. Toachievetheseobjectives,wesetupthequalityassessmentsystemasshownFig.5. We used a HIV pol gene virus with length of 1970bp as the wildtype for the experiments. The wildtype is then used to produce 4 sets of genomes, randomly mutated at differentrate;ThefoursetsofgenomesareDist-003,Dist-005,Dist-007andDist-010, withmutationsof3%,5%,7%and10%,respectively.Nowusingthemutatedgenomes, 2000and5000readsfromtheReadsimweregeneratedusingstandardReadSimparam- eterswithforwardorientation. Thegeneratedreadsfromthesemutatedgenomeswerethenalignedwiththewild- type.Thisprocedureisadoptedbecausegenerallyscientistsonlyhaveawildtypeofthe microbialgenomesavailableandthereforeitdepictsamorepracticalscenario. Afterthealignment,amajorityconsensusofthereadsisobtained.Adistancebased similarityisthencalculatedoftheconsensusobtainedfromthealignedreadswiththe originalgenomefromwhichthereadsweregenerated.Theresultsofthealignmentob- tainedandtheaccuracyoftheconsensusthusobtainedareshowninFig.6andFig.7 for2000and5000readsrespectively. 10 SaeedKhokharZagordiandBeerenwinkel Fig.5.Theexperimentalsetupforthequalityassessmentofthemultiplealignmentprogram Wecomparetheaccuracyofthealgorithmwithtwodifferentmethods.Firstbeing the simple pair-wise alignment of the reads with the reference genome. Secondly, we compare it with a sequential gap propagation method, used in recent pyrosequencing systems[12].Simplyput,gappropagationmethodbuildsmultiplealignmentfrompair- wisealignmentsbysequentially’propagating’thegapsfromeachpairwisealignment to all the reads in the system. Propagation of gaps is accomplished for every position whereatleastonereadhasaninsertedbase.Agapisinsertedinthereferencegenome and,consequently,inallreadsthatoverlapthegenomeatthatposition.Thecomplexity oftheprocedureisoftheorderofO(N2). The accuracy of the consensus obtained using just the pairwise alignment is less than 55% and that obtained from the pyro-align is always greater than 96%.An even betteralignmentqualityisachievedforgreaternumberofreads,becausemorenumber of reads provide a better coverage for a genome of given length. The accuracy of the gappropagationprocedure,iscomparabletopyro-alignforsmallmutations,butasthe mutationsincreasetheaccuracyofgappropagationbasedmethoddecreases. To illustrate that the alignment system also works with a ’mixture’ of reads from different haplotype, we use the mutated reads from Dist-003, Dist-005 and Dist-007 to generate a new set of reads. The new set contains equal number of reads from the mutatedsetse.g.2000readsfromeachmutatedgenomefortheresultsshown.Thereads are then aligned by the pyro-align algorithm using wildtype as the reference genome. TheresultsofalignmentforthismixturesetareshowninFig.8forDist-003/Dist-005 and Dist-005/Dist-007 mixtures. It must be noted here that we don’t have a ’ground