ebook img

ARK: Aggregation of Reads by K-Means for Estimation of Bacterial Community Composition PDF

17 Pages·2015·1.57 MB·English
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview ARK: Aggregation of Reads by K-Means for Estimation of Bacterial Community Composition

ARK: Aggregation of Reads by K-Means for Estimation of Bacterial Community Composition Koslicki, D., Chatterjee, S., Shahrivar, D., Walker, A. W., Francis, S. C., Fraser, L. J., ... & Corander, J. (2015). ARK: Aggregation of Reads by K-Means for Estimation of Bacterial Community Composition. PLoS ONE, 10(10), e0140644. doi:10.1371/journal.pone.0140644 10.1371/journal.pone.0140644 Public Library of Science Version of Record http://cdss.library.oregonstate.edu/sa-termsofuse RESEARCHARTICLE ARK: Aggregation of Reads by K-Means for Estimation of Bacterial Community Composition DavidKoslicki1,SaikatChatterjee2*,DamonShahrivar2,AlanW.Walker3,Suzanna C.Francis4,LouiseJ.Fraser5,MikkoVehkaperä6,YuehengLan7,JukkaCorander8 1DeptofMathematics,OregonStateUniversity,Corvallis,UnitedStatesofAmerica,2Deptof CommunicationTheory,KTHRoyalInstituteofTechnology,Stockholm,Sweden,3MicrobiologyGroup, RowettInstituteofNutritionandHealth,UniversityofAberdeen,Aberdeen,UnitedKingdom,4MRCTropical a11111 EpidemiologyGroup,LondonSchoolofHygieneandTropicalMedicine,London,UnitedKingdom,5Illumina CambridgeLtd.,ChesterfordResearchPark,Essex,UnitedKingdom,6DeptofElectronicandElectrical Engineering,UniversityofSheffield,Sheffield,UnitedKingdom,7DeptofPhysics,TsinghuaUniversity, Beijing,China,8DeptofMathematicsandStatistics,UniversityofHelsinki,Helsinki,Finland *[email protected] OPENACCESS Abstract Citation:KoslickiD,ChatterjeeS,ShahrivarD, WalkerAW,FrancisSC,FraserLJ,etal.(2015)ARK: AggregationofReadsbyK-MeansforEstimationof BacterialCommunityComposition.PLoSONE10 Motivation (10):e0140644.doi:10.1371/journal.pone.0140644 Estimationofbacterialcommunitycompositionfromhigh-throughputsequenced16SrRNA Editor:JonathanH.Badger,NationalCancer geneampliconsisakeytaskinmicrobialecology.Sincethesequencedatafromeachsam- Institute,UNITEDSTATES pletypicallyconsistofalargenumberofreadsandareadverselyimpactedbydifferentlev- Received:April20,2015 elsofbiologicalandtechnicalnoise,accurateanalysisofsuchlargedatasetsis Accepted:September28,2015 challenging. Published:October23,2015 Results Copyright:Thisisanopenaccessarticle,freeofall copyright,andmaybefreelyreproduced,distributed, Therehasbeenarecentsurgeofinterestinusingcompressedsensinginspiredandcon- transmitted,modified,builtupon,orotherwiseused vex-optimizationbasedmethodstosolvetheestimationproblemforbacterialcommunity byanyoneforanylawfulpurpose.Theworkismade composition.Thesemethodstypicallyrelyonsummarizingthesequencedatabyfrequen- availableundertheCreativeCommonsCC0public domaindedication. ciesoflow-orderk-mersandmatchingthisinformationstatisticallywithataxonomically structureddatabase.Hereweshowthattheaccuracyoftheresultingcommunitycomposi- DataAvailabilityStatement:Thedataand programsareavailablefromGitHub(https://github. tionestimatescanbesubstantiallyimprovedbyaggregatingthereadsfromasamplewith com/dkoslicki/ARK)orfromhere(http://www.ee.kth. anunsupervisedmachinelearningapproachpriortotheestimationphase.Theaggregation se/ctsoftware).Furtherrealbiologicaldatathatwe ofreadsisapre-processingapproachwhereweuseastandardK-meansclusteringalgo- usedforexperimenthavebeensubmittedtothe EuropeanNucleotideArchiveusingtheaccession rithmthatpartitionsalargesetofreadsintosubsetswithreasonablecomputationalcostto numberPRJEB9828. provideseveralvectorsoffirstorderstatisticsinsteadofonlysinglestatisticalsummariza- Funding:ThisworkwassupportedbytheSwedish tionintermsofk-merfrequencies.Theoutputoftheclusteringisthenprocessedfurtherto ResearchCouncilLinnaeusCentreACCESS(S.C.), obtainthefinalestimateforeachsample.TheresultingmethodiscalledAggregationof ERCgrant239784(J.C.),theAcademyofFinland ReadsbyK-means(ARK),anditisbasedonastatisticalargumentviamixturedensityfor- CenterofExcellenceCOIN(J.C.),theAcademyof Finland(M.V.),theScottishGovernment’sRuraland mulation.ARKisfoundtoimprovethefidelityandrobustnessofseveralrecentlyintroduced EnvironmentScienceandAnalyticalServices methods,withonlyamodestincreaseincomputationalcomplexity. PLOSONE|DOI:10.1371/journal.pone.0140644 October23,2015 1/16 ARK Division(RESAS)(A.W.W),andtheUKMRC/DFID Availability grantG1002369(S.C.F).L.J.F.receivedfundingin Anopensource,platform-independentimplementationofthemethodintheJuliaprogram- theformofsalaryfromIlluminaCambridgeLtd.The fundershadnoroleinstudydesign,datacollection minglanguageisfreelyavailableathttps://github.com/dkoslicki/ARK.AMatlabimplementa- andanalysis,decisiontopublish,orpreparationof tionisavailableathttp://www.ee.kth.se/ctsoftware. themanuscript. CompetingInterests:L.J.F.receivedfundinginthe formofsalaryfromIlluminaCambridgeLtd.Thisdoes notaltertheauthors’adherencetoallthePLOSONE policiesonsharingdataandmaterials. Introduction Theadventofhigh-throughputsequencingtechnologieshasenableddetectionofbacterial communitycompositionatanunprecedentedlevelofdetail.Atechnologicalapproachisto produceforeachsamplealargenumberofreadsfromampliconsofthe16SrRNAgene,which enablesanidentificationandcomparisonoftherelativefrequenciesofdifferenttaxonomic unitspresentacrosssamples.Therapidlyincreasingnumberofreadsproducedpersample resultsintheneedforfasttaxonomicclassificationofsamples.Thisproblemhasattractedcon- siderablerecentattention[1–5]. Manyexistingapproachestothebacterialcommunitycompositionestimationproblemuse 16SrRNAgeneampliconsequencingwherealargeamountofmoderatelengthreads(around 250–500bp)areproducedfromeachsampleandthengenerallyeitherclusteredorclassifiedto obtainacompositionestimateoftaxonomicunits.Intheclusteringapproach,readsare groupedintotaxonomicunitsbyeitherdistance-basedorprobabilisticmethods[6–8],such thattheactualtaxonomiclabelsareassignedtotheclustersafterwardsbymatchingtheircon- sensussequencestoareferencedatabase.Incontrasttotheclusteringmethods,theclassifica- tionapproachisbasedonusingareferencedatabasedirectlytoassignreadstomeaningful biologicalunits.Methodsfortheclassificationofreadshavebeenbasedeitheronhomology usingsequencesimilarity,orongenomicsignaturesintermsofk-mercomposition.Examples ofhomology-basedmethodsincludeMEGAN[9,10]andphylogeneticanalysis[11].Another popularapproachistouseaBayesianclassifier[1,12,13].Onesuchmethod,theRibosomal DatabaseProject’s(RDP)naïveBayesianclassifier(NBC)[1],assignsalabelexplicitlytoeach readproducedforaparticularsample.DespitethemethodologicalsimplicityofNBC,theRDP classifiermaystillrequireseveraldaystoprocessalargedatasetinadesktopenvironmentdue totheread-by-readclassificationapproach.Giventhischallenge,considerablyfasterestimation methodsbasedonmixturesofk-mercountshavebeendeveloped,forexample,Taxy[2], Quikr[3]andtherecentlyproposedSEK[14].Taxyisaconvex-optimizationbasedmethod. SEKandQuikraresparsesignalprocessingbasedmethods(inspiredbycompressedsensing andconvex-optimization),andSEKwasshowntoperformbetterthanQuikrandTaxyin[14]. Taxy,QuikrandSEKalluseastheirmaininputa(statistical)meanvectorofsamplek-mer countscomputedfromthereadsobtainedforasample.Thek-mercounts(alsocalledk-mers) arefeaturevectorsextractedfromrawsequencedata.Thenecessarymodelingassumptionis thatthesamplemeanvectorofk-mercounts(thatmeansfirstorderstatistics)issufficiently informativeaboutthesamplecomposition.Thesethreemethodsdonotusethereadsinany additionalwayoncethemeanvectorofk-mersiscomputed.Weproposehereanalternative basisofinformationaggregationthatremainscomputationallytractabletoallowprocessingof largesetsofreads.Borrowingideasfromsourcecodinginsignalprocessing[15,16],clustering inmachinelearningandsourcecoding[17],fusioninsignalestimation[18]anddivide-and- conquerbasedshotgunsequenceassembly[19],ournovelapproachfirstsegregatesthefullset ofreadsintosubsets(inthek-mersfeaturespace),computesthemeanvectorforeachsubset, employsastandardmethod(suchasTaxy,QuikrorSEK)toestimatecompositionforeach PLOSONE|DOI:10.1371/journal.pone.0140644 October23,2015 2/16 ARK subset,andfinallyfusestheseestimatesintoacompositionestimatejointlyforallthereads.To segregatethereadsintosubsets,wechoosetoemploytheK-meansclusteringalgorithm[20]. SincetheK-meansclusteringalgorithmissimpleandcomputationallyinexpensiveforarea- sonablenumberQofclusters(subsets),itcanbeusedtopartitionevenfairlylargesetsofreads intomore(intra)homogeneoussubsets.Byitsveryalgorithmicnature,K-meansclustering partitionsthefeaturespaceintoQnon-overlappingregionsandprovidesasetofcorrespond- ingmeanvectors.Thisiscalledcodebookgenerationinvectorquantization[15],originally fromsignalprocessing,codingandclustering.OurnewmethodistermedasAggregationof ReadsbyK-means(ARK).Fromthestatisticalperspective,theoreticaljustificationofARK stemsfromamodelingframeworkwithamixtureofdensities. Methods Summarizingreadsequencedatabysinglemeank-mercounts Inthemethoddescription,wedenotethenon-negativereallinebyR andstatisticalexpecta- + tionoperatorbyE[.].First,wedescribethepreviouslypublishedapproachofusingsinglek- mersummariesforeachsample.Letx2R4k andC denoterandomk-merfeaturevectorsand þ m mthtaxonomicunit,respectively.Givenatestsetofk-mers(computedfromreads),thedistri- butionofthetestsetismodeledas X M pðxÞ¼ pðC Þ pðxjC Þ; ð1Þ m m m¼1 wherewedenoteprobabilityfortaxonomicunitm(orclassweight)byp(C ),satisfying P m M pðC Þ¼1.NotethatfpðC ÞgM isthecompositionoftaxonomicunitsinthegiventest m¼1 m m m¼1 set(reads).Theinferencetaskistoestimatep(C )asaccuratelyaspossiblewithareasonable m computationalresource.Letusderivethemeanvector Z Z Z X X M M E½x(cid:2)¼ xpðxÞdx¼ x pðC Þ pðxjC Þdx¼ pðC Þ x pðxjC Þdx: ð2Þ m m m m m¼1 m¼1 ThemeanE[x]containsinformationaboutp(C )inthisprobabilisticformulation.Inpractice, m theinformationsummaryisobtainedbycomputingthesamplemeanfromthecompletesetof readsavailableforasample.Letusdenotethesamplemeanofk-mersfeaturevectorsofreads byμ2R4k withtheassumptionthatμ(cid:3)E[x].Severalmethods,suchasTaxy[2],Quikr[3], þ andSEK[14]usethesamplemeanμdirectlyasthemaininputtocomputethecompositionp (C ). m AggregationofreadsbyK-means(ARK) Fortheabove-describedprincipleofinformationaggregationfromthereadsbythemeanvec- torofk-mercounts,computationofthesamplemeanvectorisstraightforward.Thisconse- quentlyenableshandlingofaverylargeamountofreadswithlowcomputationalcost. However,wehypothesizethatthesamplemeanvectorcomputedfromthefullsetofreadsis notsufficientintermsofinformationcontenttofacilitateaccurateestimationofp(C ).Indeed, m sincetypicallythenumberoftrainingtaxonomicunitsMismuchlargerthanthenumberofk- mers(forexamplek=6),thesetofk-mervectorsforfC gM isnotlinearlyindependent,and m m¼1 soweriskreconstructingamixtureoftaxonomicunitsasasingletaxonomicunit.Hence,we segregatethereadsintoseveralsubsetsandcomputeasamplemeanvectorseparatelyforeach subset,assumingthatasetofsamplemeanvectorsismoreinformativethanasinglemean PLOSONE|DOI:10.1371/journal.pone.0140644 October23,2015 3/16 ARK vector.Notethatinthecasewheretheresultingreadsubsetswerenotinpracticedistinctfrom eachotherintermsoftheirk-mercounts,thesubsequentcompositionestimatewouldeffec- tivelybeidenticaltotheestimateobtainedwithasingledatasummarydescribedinEqs(1)and (2). Letuspartitionthek-mersfeaturespaceR4k intoQnon-overlappingregionsR suchthat þ q [Q R ¼R4k and8q,r,q6¼r,R \R =;.SuchpartitionscanbeformedbyastandardK- q¼1 q þ q r meansalgorithmthattypicallyusesanearestneighborclassificationrulebasedonsquare Euclideandistancemeasure.Thenon-overlappingregionsR arecalledVoronoiregions.We P q defineP ≜Pr(x2R )satisfying Q P ¼1.Inpractice,P iscomputedas q q q¼1 q q number of feature vectors in R P ¼ q: ð3Þ q total number of feature vectors Itisremindedthatthefeaturevectorsarek-mers.Thedistributionofthefulltestsetandsub- setscanbewrittenas P pðxÞ¼ Q P pðxjx2R Þ; q¼1 q q P ð4Þ pðxjx2R Þ¼ M pðC jx2R Þ pðxjC ;x2R Þ; q m¼1 m q m q wherethefirstequationfollowsastandardmixturedensityframework.Now,ifwecanestimate p(C jx2R ),thenthefinalquantityofinterestp(C )canbeestimatedas m q m X Q pðC Þ¼ P pðC jx2R Þ: ð5Þ m q m q q¼1 Theestimationofp(C )inEq(5)isajudiciousfusionofp(C jx2R )throughalinearcombi- m m q nation.LetusnowderivethemeanvectorforR ,whichisaconditionalmeanvector q E½xjx2R (cid:2) R q ¼ x pðxjx2R Þdx ð6Þ P q R ¼ M pðC jx2R Þ x pðxjC ;x2R Þ dx: m¼1 m q m q ThemeanE[xjx2R ]containsinformationaboutp(C jx2R ).Inpracticeweusethesam- q m q plemeandenotedbyμ withtheassumptionthatμ (cid:3)E[xjx2R ].ComparingEqs(2)and q q q (6),fortheqthVoronoiregionR wecanestimatecompositionp(C jx2R )byusingan q m q appropriatecompositionestimationmethod,suchasTaxy,QuikrorSEK. Algorithms TheARKalgorithmcanbeimplementedbyfollowingsteps. 1. Dividethefulltestdatasetofk-mersintoQsubsets.TheregionR correspondstotheqth q subset. 2. Fortheqthsubset,computeP andthesamplemeanμ . q q 3. Fortheqthsubset,applyacompositionestimationmethodthatusestheinputμ ;estimatep q (C jx2R ). m q P 4. Estimatep(Cm)bypðCmÞ¼ Qq¼1Pq pðCm jx2RqÞ. TheARKmethodisdescribedusingaflow-chartinFig1.Theflow-chartshowsthemain componentsoftheoverallsystemandtheassociatedoff-lineandon-linecomputations.The PLOSONE|DOI:10.1371/journal.pone.0140644 October23,2015 4/16 ARK Fig1.Aflow-chartoftheARKmethod. doi:10.1371/journal.pone.0140644.g001 crucialcomputational/statisticalchallengesrelatedtotheARKalgorithmoutlinedaboveareas follows: 1. WhatisanappropriatenumberofsubsetsQ? 2. HowshouldoneformthesubsetsR ? q Theabovepointsareinherenttoanysubsetformingalgorithm,andmoregenerallytoany clusteringalgorithm.Furthermore,findingoptimalregions(orclusters)requiresalternative optimizationtechniques.Givenapre-definedQ,typicallyaK-meansalgorithmperformstwo PLOSONE|DOI:10.1371/journal.pone.0140644 October23,2015 5/16 ARK alternatingoptimizationsteps.Theseare:(1)givenasetofrepresentationvectorsfμ gQ (also q q¼1 calledcodevectors)formnewclustersfR gQ byanearestneighborrule(orformnewsubsets q q¼1 fromthefulldataset),(2)findthesetofclusterrepresentationvectorsgiventheassignmentof dataintoclusters.TheoptimalrepresentationvectoristhemeanvectorifsquaredEuclidean distanceisusedforthenearestneighborrule.TheK-meansalgorithminitializeswithasetof representativevectorsandrunsalternatingoptimizationuntilconvergenceinthesensethat theaveragesquaredEuclideandistanceisnolongerreduced.Inthepresentpaperweperform theclusteringusingapopularvectorquantizationmethodcalledtheLinde-Buzo-Gray(LBG) algorithm[15](orsourcecodingliterature).ThereareseveralvariantsoftheLBGavailable.In onevariant,thealgorithmstartswithQ=1andthenslowlysplitsthedenseandhighprobabil- ityclusterstoendupwithahighQ,suchthatitdoesnotdeviatesignificantlyfromanexponen- tiallydecayingbitrateversuscodingdistortion(rate-distortion)curve. InARK,weusethefollowingtwostrategiestosolvethetwochallengeslistedabove. 1. Optimal/deterministicstrategy:StartwithQ=1,whichcorrespondstotheprevious approachwithasinglemeanvectorasthedatasummary.ThensetQ=2forLBGalgorithm thatusessquareEuclideandistanceasthedistortionmeasure;theLBGalgorithmminimizes meanofsquareEuclideandistance(alsocalledmeansquareerror).Initializationisdoneby astandardsplitapproachwherethemeanvectorisperturbed.UsingQ=2,fRqg2q¼1is formedandweestimatep(C ).Subsequently,Qisincreasedbyoneuntilaconvergencecri- m terionismet.ForQ(cid:4)3,wealwayssplitthehighestrankingclusterintotwosubclustersand usetheLBGalgorithmtofindtheoptimalclusters.ThenumberofclustersQisnolonger increasediftheestimatedvaluesofp(C )differnegligiblyforQand(Q−1).Inpractice,the m stoppingconditionweuseisthatthevariationaldistancebetweenp(Cm)jQandp(Cm)j(Q−1) islessthanapredeterminedthreshold.Thisconditioncanbewrittenas P (cid:2) (cid:3) Mm¼1abs pðCmÞjQ(cid:5)pðCmÞjðQ(cid:5)1Þ <Z,withauserdefinedchoiceofthethresholdη.Note thatη2(0,1]providesanallowablelimitasascaledvariationaldistance(VD)betweentwo probabilitymassfunctions;atypicalchoiceofηcanbe0.01.Thisstrategyistypicallyfound toprovideconsistentperformanceimprovementinthesenseofestimatingp(C )withthe m increaseinQbythestepofone,butwithoutabsoluteguaranteeasthetargetoptimization strategyminimizesmeansquareerror.Furthermore,weallowanincrementinthenumber ofclustersuptoapre-definedmaximumlimitQ .TypicallyQ ispreferablychosenas max max anintegerpoweroftwo.AtypicalchoiceofQ canbebetween16to256. max 2. Non-optimal/randomstrategy:Forverylargetestsets,weuseapre-determinedQanda randomchoiceoftheQrepresentationvectors.ThenthefulltestsetisdividedintoQsub- setsbyanearestneighborruleandwecomputethesetofQmeanvectors{μ },andcluster q probabilities{P }.Eventhoughthisnon-optimalstrategydoesnotuseanalternatingopti- q mization(suchasLBGalgorithm)toformoptimalclusters,itdividesthefulltestsetinto sub-sets,resultinginasetofQlocalizedmeanvectorsacrossthefulltestset. FinallywementionthattheuseofK-meansisfullymotivatedbyitssimplicityandcompu- tationalease.UseofstatisticalK-meansintheformofexpectation-maximizationbasedmix- turemodeling(forexample,Gaussianmixturemodel)couldhavebeeninvestigated,but requiresmorecomputationtohandlealargedatasetofreads. Syntheticdatagenerationformethodevaluation ToevaluatetheperformanceoftheARKmethod,weconductedexperimentsforsimulated dataasdescribedbelow.Forthese,andallcomputationsreportedintheremainderofthe PLOSONE|DOI:10.1371/journal.pone.0140644 October23,2015 6/16 ARK paper,weusedMatlabversionR2013b(withsomeinstancesofCcode),onadesktopworksta- tionwithanIntelCorei74930Kprocessorand64GbofRAM. Testdatasets(Reads). Wesimulated18016SrRNAgene454-likedatasetsusingtheRDP trainingset7andtheGrinderreadsimulator[21]targetingtheV1–V2andV3–V5variable regionswithreadlengthsfixedat250bpornormallydistributedwithameanof450bpand variance50bp.Readdepthswerechosentobeeither10K,100Kor250K,whilethreedifferent readdistributionswereused:powerlaw,uniform,andlinear.Diversitywassetateither50, 100,or500taxaandchimerapercentagesweresetto5%or35%.TheBalzermodel[22]was chosenforhomopolymererrors,andcopybiaswasincludedwhilelengthbiaswasexcluded. Trainingdataset(Reference). InourARKexperimentsweusedQuikr[3]andSEK[14] toestimatep(C jx2R ).TheRDPtrainingset7wasusedasthebasereferencedatabasefor m q bothQuikrandSEK.NotethatthisisthesameasdatabaseD utilizedin[3].Whileinthe small mainmanuscriptweusethesamedataforbothtrainingandtestingthebasemethods(Quikr andSEK),inS1Fileweincluderesultsobtainedwhenthetestdatasetshavetaxaabsentfrom thetrainingdatabase(thatis,sistertaxahavebeenexcludedfromthetrainingdatabase).As expected,allmethodsexperiencealossinreconstructionaccuracywhensistertaxaareabsent, butARKQuikrandARKSEKarestillmoreaccuratethanRDP’sNBC. Realbiologicaldata TofurtherevaluateARK,wealsoutilized28IlluminaMiSeq16SrRNAgenehumanbody-site associatedsamples,plusonenegativecontrolsample.Therealdataconsistofatotalofover5.7 Mreadsdistributedoverthreevariableregions(V1–V2,V3–V4,andV3–V5)aswellastwo bodysites(vaginaandfeces). ForeachofthesesamplesDNAwasextractedusingtheFastDNASPINKitforSoilwitha FastPrepmachine(MPBiomedicals)followingthemanufacturer’sprotocol.16SrRNAgene ampliconsweregeneratedfromtheDNAextractionsusingtheprimercombinationslistedin Section5ofS1File.TheQ5High-fidelitypolymerasekit(NewEnglandBiolabs)wasusedto amplifythe16SrRNAgenes,andPCRconditionswereasfollows:98°Cfor2minutes,followed by20cyclesof98°Cfor30seconds,50°Cfor30secondsand72°Cfor1minute30seconds,fol- lowedbyafinalextensionstepat72°Cfor5minutes.FollowingPCR,theampliconswerethen purifiedusingtheWizardSVGelandPCRClean-Upkit(Promega,UK).Sequencingof16S rRNAgeneampliconswascarriedoutbyIlluminaInc.(LittleChesterford,UK)usingaMiSeq instrumentrunfor2x250(V1–V2),300+200(V3–V4)and400+200(V3–V5)cycles.These datahavebeensubmittedtotheEuropeanNucleotideArchiveusingtheaccessionnumber PRJEB9828. Aftertrimming20bpofprimeroffeachread,thesequencesweretrimmedfromtheright untilallbaseshadaqualityscoregreaterthan27.Thisreducedthetotalnumberofreadsto approximately4M,andreducedthemeanreadlengthfrom315bpto257bp.Wethenutilized allresultingunpairedreads(bothforwardandreverse)includinganyduplicatesequences.We includeinS1Fileresultsforanalternativeerror-correctionprotocol,aswellasresultsfor assemblingpaired-endreads(FigsEandFinS1File). EthicsStatement Forhumanbody-siteassociatedsamples,thefaecalsamplesusedwerenotpartofaclinical studysothereisnocorrespondingethicalapprovalorwrittenconsent.Therewerenoclinical records.Thesamplesareanonymisedandde-identified.Further,vaginalsampleswerecol- lectedaspartofanobservationalmicrobicidefeasibilitystudy.Thestudywasapprovedbythe EthicsCommitteesoftheNationalInstituteforMedicalResearchinTanzaniaandLondon PLOSONE|DOI:10.1371/journal.pone.0140644 October23,2015 7/16 ARK SchoolofHygieneandTropicalMedicine,andallparticipantsgavewritteninformedconsent. Allrecordswereanonymizedandde-identifiedpriortothisretrospectiveanalysis. Results Performancemeasureandrelevantmethods Asaquantitativeperformancemeasure,weusevariationaldistance(VD)tocomparebetween knownproportionsoftaxonomicunitsp=[p(C ),p(C ),...,p(C )]tandtheestimatedpropor- 1 2 M tionsp^ ¼½p^ðC Þ; p^ðC Þ;...;p^ðC Þ(cid:2)t.TheVDisdefinedas 1 2 M VD¼0:5(cid:6)kp(cid:5)p^k 2½0;1(cid:2): 1 AlowVDindicatesmoresatisfactoryperformance. ForARK,weusedbothSEKandQuikrastheunderlyingestimationmethodsappliedto eachcluster.Theserecentmethodswerechosenasappropriaterepresentativesoffastandaccu- ratesparsesignalprocessingapproaches.Ak-mersizeofk=6wasusedforbothQuikrand SEK. AspartoftheSEKpipeline,sequencesinagivendatabasearesplitintosubsequences.We selectedfromthe10,046sequencesintheRDPtrainingset7allsequenceslongerthan700bp inlength,andthensplitthesequencesintosubsequencesoflength400bpwith100bpofover- lap.ThiscorrespondstosettingL =400andL =100asspecifiedin[14].WeusedtheSEK w p algorithmOMPþ;1withparametersasin[14]. sek ResultsforSimulatedData Effectofincreasingnumberofclusters. Wefirstinvestigatehowanincreaseinthenum- berofclustersQaffectsthecompositionreconstructionfidelityandalgorithmexecutiontime forthesimulateddata.Onlythenon-optimal/randomstrategyofK-meansclusteringwasuti- lizedaswefoundthattheperformanceimprovementforoptimal/deterministicstrategywas insignificantgiventheresultingincreaseinexecutiontime(resultsnotshown).Averagingthe VDerroratthegenusleveloverall180simulatedexperiments,itwasfoundthatcombining ARKwithbothSEKandQuikrresultedinapowerlawkindofdecayofVDerrorasafunction ofthenumberofclusters(Fig2).ARKcausesasubstantialincreaseinreconstructionfidelity whichcanbeseensinceusingARKSEKorARKQuikrwithoneclusterisequivalenttorun- ningSEKorQuikrwithnomodification. Sincetheunderlyingalgorithm(SEKorQuikr)mustbeexecutedoneachclusterformedby theK-meansclustering,weexpectthetotalalgorithmexecutiontimetoincreasebyafactor equaltothenumberofchosenclusters.AsseeninFig3,bothalgorithmsexperiencean increaseinexecutiontimeroughlyproportionaltothenumberofclusters. Fixednumberofclusters. Asseenabove,giventhedecreaseinVDasafunctionofthe numberofclusters,wealsofixedthenumberofclustersQto75tocomparetheperformance oftheunderlyingalgorithmswithandwithoutARK.Therewasasignificantdecreaseinthe VDerror(asseeninFig4)atthecostofanincreaseinexecutiontime(asseeninFig5).How- ever,giventhespeedofbothQuikrandSEK,weexpecttheadditionofARKwillnotresultin prohibitivelylongexecutiontimes.Indeed,asseenabove,onrealbiologicaldatabothARK QuikrandARKSEKarestillseveralhoursfasterthantheRibosomalDatabaseProject’sNaïve BayesianClassifier(RDP’sNBC)[1],evenwhenusing75clusters. PLOSONE|DOI:10.1371/journal.pone.0140644 October23,2015 8/16 ARK Fig2.ResultsfortherandomK-meansclusteringonthesimulateddata.MeanVDerroratthegenus levelasafunctionofthenumberofclusters.NotetheimprovementthatARKcontributestoeachmethod. doi:10.1371/journal.pone.0140644.g002 Fig3.ResultsfortherandomK-meansclusteringonthesimulateddata.Meanexecutiontimeincrease (factorgivenincomparisontorunningSEKorQuikrintheabsenceofARK)asafunctionofnumberof clusters.Thedashedlinerepresentsalinewithslope1. doi:10.1371/journal.pone.0140644.g003 PLOSONE|DOI:10.1371/journal.pone.0140644 October23,2015 9/16

Description:
els of biological and technical noise, accurate analysis of such large using sequence similarity, or on genomic signatures in terms of k-mer SEK and Quikr are sparse signal processing based methods (inspired by .. Ethics Committees of the National Institute for Medical Research in Tanzania and
See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.