ARK: Aggregation of Reads by K-Means for Estimation of Bacterial Community Composition Koslicki, D., Chatterjee, S., Shahrivar, D., Walker, A. W., Francis, S. C., Fraser, L. J., ... & Corander, J. (2015). ARK: Aggregation of Reads by K-Means for Estimation of Bacterial Community Composition. PLoS ONE, 10(10), e0140644. doi:10.1371/journal.pone.0140644 10.1371/journal.pone.0140644 Public Library of Science Version of Record http://cdss.library.oregonstate.edu/sa-termsofuse RESEARCHARTICLE ARK: Aggregation of Reads by K-Means for Estimation of Bacterial Community Composition DavidKoslicki1,SaikatChatterjee2*,DamonShahrivar2,AlanW.Walker3,Suzanna C.Francis4,LouiseJ.Fraser5,MikkoVehkaperä6,YuehengLan7,JukkaCorander8 1DeptofMathematics,OregonStateUniversity,Corvallis,UnitedStatesofAmerica,2Deptof CommunicationTheory,KTHRoyalInstituteofTechnology,Stockholm,Sweden,3MicrobiologyGroup, RowettInstituteofNutritionandHealth,UniversityofAberdeen,Aberdeen,UnitedKingdom,4MRCTropical a11111 EpidemiologyGroup,LondonSchoolofHygieneandTropicalMedicine,London,UnitedKingdom,5Illumina CambridgeLtd.,ChesterfordResearchPark,Essex,UnitedKingdom,6DeptofElectronicandElectrical Engineering,UniversityofSheffield,Sheffield,UnitedKingdom,7DeptofPhysics,TsinghuaUniversity, Beijing,China,8DeptofMathematicsandStatistics,UniversityofHelsinki,Helsinki,Finland *[email protected] OPENACCESS Abstract Citation:KoslickiD,ChatterjeeS,ShahrivarD, WalkerAW,FrancisSC,FraserLJ,etal.(2015)ARK: AggregationofReadsbyK-MeansforEstimationof BacterialCommunityComposition.PLoSONE10 Motivation (10):e0140644.doi:10.1371/journal.pone.0140644 Estimationofbacterialcommunitycompositionfromhigh-throughputsequenced16SrRNA Editor:JonathanH.Badger,NationalCancer geneampliconsisakeytaskinmicrobialecology.Sincethesequencedatafromeachsam- Institute,UNITEDSTATES pletypicallyconsistofalargenumberofreadsandareadverselyimpactedbydifferentlev- Received:April20,2015 elsofbiologicalandtechnicalnoise,accurateanalysisofsuchlargedatasetsis Accepted:September28,2015 challenging. Published:October23,2015 Results Copyright:Thisisanopenaccessarticle,freeofall copyright,andmaybefreelyreproduced,distributed, Therehasbeenarecentsurgeofinterestinusingcompressedsensinginspiredandcon- transmitted,modified,builtupon,orotherwiseused vex-optimizationbasedmethodstosolvetheestimationproblemforbacterialcommunity byanyoneforanylawfulpurpose.Theworkismade composition.Thesemethodstypicallyrelyonsummarizingthesequencedatabyfrequen- availableundertheCreativeCommonsCC0public domaindedication. ciesoflow-orderk-mersandmatchingthisinformationstatisticallywithataxonomically structureddatabase.Hereweshowthattheaccuracyoftheresultingcommunitycomposi- DataAvailabilityStatement:Thedataand programsareavailablefromGitHub(https://github. tionestimatescanbesubstantiallyimprovedbyaggregatingthereadsfromasamplewith com/dkoslicki/ARK)orfromhere(http://www.ee.kth. anunsupervisedmachinelearningapproachpriortotheestimationphase.Theaggregation se/ctsoftware).Furtherrealbiologicaldatathatwe ofreadsisapre-processingapproachwhereweuseastandardK-meansclusteringalgo- usedforexperimenthavebeensubmittedtothe EuropeanNucleotideArchiveusingtheaccession rithmthatpartitionsalargesetofreadsintosubsetswithreasonablecomputationalcostto numberPRJEB9828. provideseveralvectorsoffirstorderstatisticsinsteadofonlysinglestatisticalsummariza- Funding:ThisworkwassupportedbytheSwedish tionintermsofk-merfrequencies.Theoutputoftheclusteringisthenprocessedfurtherto ResearchCouncilLinnaeusCentreACCESS(S.C.), obtainthefinalestimateforeachsample.TheresultingmethodiscalledAggregationof ERCgrant239784(J.C.),theAcademyofFinland ReadsbyK-means(ARK),anditisbasedonastatisticalargumentviamixturedensityfor- CenterofExcellenceCOIN(J.C.),theAcademyof Finland(M.V.),theScottishGovernment’sRuraland mulation.ARKisfoundtoimprovethefidelityandrobustnessofseveralrecentlyintroduced EnvironmentScienceandAnalyticalServices methods,withonlyamodestincreaseincomputationalcomplexity. PLOSONE|DOI:10.1371/journal.pone.0140644 October23,2015 1/16 ARK Division(RESAS)(A.W.W),andtheUKMRC/DFID Availability grantG1002369(S.C.F).L.J.F.receivedfundingin Anopensource,platform-independentimplementationofthemethodintheJuliaprogram- theformofsalaryfromIlluminaCambridgeLtd.The fundershadnoroleinstudydesign,datacollection minglanguageisfreelyavailableathttps://github.com/dkoslicki/ARK.AMatlabimplementa- andanalysis,decisiontopublish,orpreparationof tionisavailableathttp://www.ee.kth.se/ctsoftware. themanuscript. CompetingInterests:L.J.F.receivedfundinginthe formofsalaryfromIlluminaCambridgeLtd.Thisdoes notaltertheauthors’adherencetoallthePLOSONE policiesonsharingdataandmaterials. Introduction Theadventofhigh-throughputsequencingtechnologieshasenableddetectionofbacterial communitycompositionatanunprecedentedlevelofdetail.Atechnologicalapproachisto produceforeachsamplealargenumberofreadsfromampliconsofthe16SrRNAgene,which enablesanidentificationandcomparisonoftherelativefrequenciesofdifferenttaxonomic unitspresentacrosssamples.Therapidlyincreasingnumberofreadsproducedpersample resultsintheneedforfasttaxonomicclassificationofsamples.Thisproblemhasattractedcon- siderablerecentattention[1–5]. Manyexistingapproachestothebacterialcommunitycompositionestimationproblemuse 16SrRNAgeneampliconsequencingwherealargeamountofmoderatelengthreads(around 250–500bp)areproducedfromeachsampleandthengenerallyeitherclusteredorclassifiedto obtainacompositionestimateoftaxonomicunits.Intheclusteringapproach,readsare groupedintotaxonomicunitsbyeitherdistance-basedorprobabilisticmethods[6–8],such thattheactualtaxonomiclabelsareassignedtotheclustersafterwardsbymatchingtheircon- sensussequencestoareferencedatabase.Incontrasttotheclusteringmethods,theclassifica- tionapproachisbasedonusingareferencedatabasedirectlytoassignreadstomeaningful biologicalunits.Methodsfortheclassificationofreadshavebeenbasedeitheronhomology usingsequencesimilarity,orongenomicsignaturesintermsofk-mercomposition.Examples ofhomology-basedmethodsincludeMEGAN[9,10]andphylogeneticanalysis[11].Another popularapproachistouseaBayesianclassifier[1,12,13].Onesuchmethod,theRibosomal DatabaseProject’s(RDP)naïveBayesianclassifier(NBC)[1],assignsalabelexplicitlytoeach readproducedforaparticularsample.DespitethemethodologicalsimplicityofNBC,theRDP classifiermaystillrequireseveraldaystoprocessalargedatasetinadesktopenvironmentdue totheread-by-readclassificationapproach.Giventhischallenge,considerablyfasterestimation methodsbasedonmixturesofk-mercountshavebeendeveloped,forexample,Taxy[2], Quikr[3]andtherecentlyproposedSEK[14].Taxyisaconvex-optimizationbasedmethod. SEKandQuikraresparsesignalprocessingbasedmethods(inspiredbycompressedsensing andconvex-optimization),andSEKwasshowntoperformbetterthanQuikrandTaxyin[14]. Taxy,QuikrandSEKalluseastheirmaininputa(statistical)meanvectorofsamplek-mer countscomputedfromthereadsobtainedforasample.Thek-mercounts(alsocalledk-mers) arefeaturevectorsextractedfromrawsequencedata.Thenecessarymodelingassumptionis thatthesamplemeanvectorofk-mercounts(thatmeansfirstorderstatistics)issufficiently informativeaboutthesamplecomposition.Thesethreemethodsdonotusethereadsinany additionalwayoncethemeanvectorofk-mersiscomputed.Weproposehereanalternative basisofinformationaggregationthatremainscomputationallytractabletoallowprocessingof largesetsofreads.Borrowingideasfromsourcecodinginsignalprocessing[15,16],clustering inmachinelearningandsourcecoding[17],fusioninsignalestimation[18]anddivide-and- conquerbasedshotgunsequenceassembly[19],ournovelapproachfirstsegregatesthefullset ofreadsintosubsets(inthek-mersfeaturespace),computesthemeanvectorforeachsubset, employsastandardmethod(suchasTaxy,QuikrorSEK)toestimatecompositionforeach PLOSONE|DOI:10.1371/journal.pone.0140644 October23,2015 2/16 ARK subset,andfinallyfusestheseestimatesintoacompositionestimatejointlyforallthereads.To segregatethereadsintosubsets,wechoosetoemploytheK-meansclusteringalgorithm[20]. SincetheK-meansclusteringalgorithmissimpleandcomputationallyinexpensiveforarea- sonablenumberQofclusters(subsets),itcanbeusedtopartitionevenfairlylargesetsofreads intomore(intra)homogeneoussubsets.Byitsveryalgorithmicnature,K-meansclustering partitionsthefeaturespaceintoQnon-overlappingregionsandprovidesasetofcorrespond- ingmeanvectors.Thisiscalledcodebookgenerationinvectorquantization[15],originally fromsignalprocessing,codingandclustering.OurnewmethodistermedasAggregationof ReadsbyK-means(ARK).Fromthestatisticalperspective,theoreticaljustificationofARK stemsfromamodelingframeworkwithamixtureofdensities. Methods Summarizingreadsequencedatabysinglemeank-mercounts Inthemethoddescription,wedenotethenon-negativereallinebyR andstatisticalexpecta- + tionoperatorbyE[.].First,wedescribethepreviouslypublishedapproachofusingsinglek- mersummariesforeachsample.Letx2R4k andC denoterandomk-merfeaturevectorsand þ m mthtaxonomicunit,respectively.Givenatestsetofk-mers(computedfromreads),thedistri- butionofthetestsetismodeledas X M pðxÞ¼ pðC Þ pðxjC Þ; ð1Þ m m m¼1 wherewedenoteprobabilityfortaxonomicunitm(orclassweight)byp(C ),satisfying P m M pðC Þ¼1.NotethatfpðC ÞgM isthecompositionoftaxonomicunitsinthegiventest m¼1 m m m¼1 set(reads).Theinferencetaskistoestimatep(C )asaccuratelyaspossiblewithareasonable m computationalresource.Letusderivethemeanvector Z Z Z X X M M E½x(cid:2)¼ xpðxÞdx¼ x pðC Þ pðxjC Þdx¼ pðC Þ x pðxjC Þdx: ð2Þ m m m m m¼1 m¼1 ThemeanE[x]containsinformationaboutp(C )inthisprobabilisticformulation.Inpractice, m theinformationsummaryisobtainedbycomputingthesamplemeanfromthecompletesetof readsavailableforasample.Letusdenotethesamplemeanofk-mersfeaturevectorsofreads byμ2R4k withtheassumptionthatμ(cid:3)E[x].Severalmethods,suchasTaxy[2],Quikr[3], þ andSEK[14]usethesamplemeanμdirectlyasthemaininputtocomputethecompositionp (C ). m AggregationofreadsbyK-means(ARK) Fortheabove-describedprincipleofinformationaggregationfromthereadsbythemeanvec- torofk-mercounts,computationofthesamplemeanvectorisstraightforward.Thisconse- quentlyenableshandlingofaverylargeamountofreadswithlowcomputationalcost. However,wehypothesizethatthesamplemeanvectorcomputedfromthefullsetofreadsis notsufficientintermsofinformationcontenttofacilitateaccurateestimationofp(C ).Indeed, m sincetypicallythenumberoftrainingtaxonomicunitsMismuchlargerthanthenumberofk- mers(forexamplek=6),thesetofk-mervectorsforfC gM isnotlinearlyindependent,and m m¼1 soweriskreconstructingamixtureoftaxonomicunitsasasingletaxonomicunit.Hence,we segregatethereadsintoseveralsubsetsandcomputeasamplemeanvectorseparatelyforeach subset,assumingthatasetofsamplemeanvectorsismoreinformativethanasinglemean PLOSONE|DOI:10.1371/journal.pone.0140644 October23,2015 3/16 ARK vector.Notethatinthecasewheretheresultingreadsubsetswerenotinpracticedistinctfrom eachotherintermsoftheirk-mercounts,thesubsequentcompositionestimatewouldeffec- tivelybeidenticaltotheestimateobtainedwithasingledatasummarydescribedinEqs(1)and (2). Letuspartitionthek-mersfeaturespaceR4k intoQnon-overlappingregionsR suchthat þ q [Q R ¼R4k and8q,r,q6¼r,R \R =;.SuchpartitionscanbeformedbyastandardK- q¼1 q þ q r meansalgorithmthattypicallyusesanearestneighborclassificationrulebasedonsquare Euclideandistancemeasure.Thenon-overlappingregionsR arecalledVoronoiregions.We P q defineP ≜Pr(x2R )satisfying Q P ¼1.Inpractice,P iscomputedas q q q¼1 q q number of feature vectors in R P ¼ q: ð3Þ q total number of feature vectors Itisremindedthatthefeaturevectorsarek-mers.Thedistributionofthefulltestsetandsub- setscanbewrittenas P pðxÞ¼ Q P pðxjx2R Þ; q¼1 q q P ð4Þ pðxjx2R Þ¼ M pðC jx2R Þ pðxjC ;x2R Þ; q m¼1 m q m q wherethefirstequationfollowsastandardmixturedensityframework.Now,ifwecanestimate p(C jx2R ),thenthefinalquantityofinterestp(C )canbeestimatedas m q m X Q pðC Þ¼ P pðC jx2R Þ: ð5Þ m q m q q¼1 Theestimationofp(C )inEq(5)isajudiciousfusionofp(C jx2R )throughalinearcombi- m m q nation.LetusnowderivethemeanvectorforR ,whichisaconditionalmeanvector q E½xjx2R (cid:2) R q ¼ x pðxjx2R Þdx ð6Þ P q R ¼ M pðC jx2R Þ x pðxjC ;x2R Þ dx: m¼1 m q m q ThemeanE[xjx2R ]containsinformationaboutp(C jx2R ).Inpracticeweusethesam- q m q plemeandenotedbyμ withtheassumptionthatμ (cid:3)E[xjx2R ].ComparingEqs(2)and q q q (6),fortheqthVoronoiregionR wecanestimatecompositionp(C jx2R )byusingan q m q appropriatecompositionestimationmethod,suchasTaxy,QuikrorSEK. Algorithms TheARKalgorithmcanbeimplementedbyfollowingsteps. 1. Dividethefulltestdatasetofk-mersintoQsubsets.TheregionR correspondstotheqth q subset. 2. Fortheqthsubset,computeP andthesamplemeanμ . q q 3. Fortheqthsubset,applyacompositionestimationmethodthatusestheinputμ ;estimatep q (C jx2R ). m q P 4. Estimatep(Cm)bypðCmÞ¼ Qq¼1Pq pðCm jx2RqÞ. TheARKmethodisdescribedusingaflow-chartinFig1.Theflow-chartshowsthemain componentsoftheoverallsystemandtheassociatedoff-lineandon-linecomputations.The PLOSONE|DOI:10.1371/journal.pone.0140644 October23,2015 4/16 ARK Fig1.Aflow-chartoftheARKmethod. doi:10.1371/journal.pone.0140644.g001 crucialcomputational/statisticalchallengesrelatedtotheARKalgorithmoutlinedaboveareas follows: 1. WhatisanappropriatenumberofsubsetsQ? 2. HowshouldoneformthesubsetsR ? q Theabovepointsareinherenttoanysubsetformingalgorithm,andmoregenerallytoany clusteringalgorithm.Furthermore,findingoptimalregions(orclusters)requiresalternative optimizationtechniques.Givenapre-definedQ,typicallyaK-meansalgorithmperformstwo PLOSONE|DOI:10.1371/journal.pone.0140644 October23,2015 5/16 ARK alternatingoptimizationsteps.Theseare:(1)givenasetofrepresentationvectorsfμ gQ (also q q¼1 calledcodevectors)formnewclustersfR gQ byanearestneighborrule(orformnewsubsets q q¼1 fromthefulldataset),(2)findthesetofclusterrepresentationvectorsgiventheassignmentof dataintoclusters.TheoptimalrepresentationvectoristhemeanvectorifsquaredEuclidean distanceisusedforthenearestneighborrule.TheK-meansalgorithminitializeswithasetof representativevectorsandrunsalternatingoptimizationuntilconvergenceinthesensethat theaveragesquaredEuclideandistanceisnolongerreduced.Inthepresentpaperweperform theclusteringusingapopularvectorquantizationmethodcalledtheLinde-Buzo-Gray(LBG) algorithm[15](orsourcecodingliterature).ThereareseveralvariantsoftheLBGavailable.In onevariant,thealgorithmstartswithQ=1andthenslowlysplitsthedenseandhighprobabil- ityclusterstoendupwithahighQ,suchthatitdoesnotdeviatesignificantlyfromanexponen- tiallydecayingbitrateversuscodingdistortion(rate-distortion)curve. InARK,weusethefollowingtwostrategiestosolvethetwochallengeslistedabove. 1. Optimal/deterministicstrategy:StartwithQ=1,whichcorrespondstotheprevious approachwithasinglemeanvectorasthedatasummary.ThensetQ=2forLBGalgorithm thatusessquareEuclideandistanceasthedistortionmeasure;theLBGalgorithmminimizes meanofsquareEuclideandistance(alsocalledmeansquareerror).Initializationisdoneby astandardsplitapproachwherethemeanvectorisperturbed.UsingQ=2,fRqg2q¼1is formedandweestimatep(C ).Subsequently,Qisincreasedbyoneuntilaconvergencecri- m terionismet.ForQ(cid:4)3,wealwayssplitthehighestrankingclusterintotwosubclustersand usetheLBGalgorithmtofindtheoptimalclusters.ThenumberofclustersQisnolonger increasediftheestimatedvaluesofp(C )differnegligiblyforQand(Q−1).Inpractice,the m stoppingconditionweuseisthatthevariationaldistancebetweenp(Cm)jQandp(Cm)j(Q−1) islessthanapredeterminedthreshold.Thisconditioncanbewrittenas P (cid:2) (cid:3) Mm¼1abs pðCmÞjQ(cid:5)pðCmÞjðQ(cid:5)1Þ <Z,withauserdefinedchoiceofthethresholdη.Note thatη2(0,1]providesanallowablelimitasascaledvariationaldistance(VD)betweentwo probabilitymassfunctions;atypicalchoiceofηcanbe0.01.Thisstrategyistypicallyfound toprovideconsistentperformanceimprovementinthesenseofestimatingp(C )withthe m increaseinQbythestepofone,butwithoutabsoluteguaranteeasthetargetoptimization strategyminimizesmeansquareerror.Furthermore,weallowanincrementinthenumber ofclustersuptoapre-definedmaximumlimitQ .TypicallyQ ispreferablychosenas max max anintegerpoweroftwo.AtypicalchoiceofQ canbebetween16to256. max 2. Non-optimal/randomstrategy:Forverylargetestsets,weuseapre-determinedQanda randomchoiceoftheQrepresentationvectors.ThenthefulltestsetisdividedintoQsub- setsbyanearestneighborruleandwecomputethesetofQmeanvectors{μ },andcluster q probabilities{P }.Eventhoughthisnon-optimalstrategydoesnotuseanalternatingopti- q mization(suchasLBGalgorithm)toformoptimalclusters,itdividesthefulltestsetinto sub-sets,resultinginasetofQlocalizedmeanvectorsacrossthefulltestset. FinallywementionthattheuseofK-meansisfullymotivatedbyitssimplicityandcompu- tationalease.UseofstatisticalK-meansintheformofexpectation-maximizationbasedmix- turemodeling(forexample,Gaussianmixturemodel)couldhavebeeninvestigated,but requiresmorecomputationtohandlealargedatasetofreads. Syntheticdatagenerationformethodevaluation ToevaluatetheperformanceoftheARKmethod,weconductedexperimentsforsimulated dataasdescribedbelow.Forthese,andallcomputationsreportedintheremainderofthe PLOSONE|DOI:10.1371/journal.pone.0140644 October23,2015 6/16 ARK paper,weusedMatlabversionR2013b(withsomeinstancesofCcode),onadesktopworksta- tionwithanIntelCorei74930Kprocessorand64GbofRAM. Testdatasets(Reads). Wesimulated18016SrRNAgene454-likedatasetsusingtheRDP trainingset7andtheGrinderreadsimulator[21]targetingtheV1–V2andV3–V5variable regionswithreadlengthsfixedat250bpornormallydistributedwithameanof450bpand variance50bp.Readdepthswerechosentobeeither10K,100Kor250K,whilethreedifferent readdistributionswereused:powerlaw,uniform,andlinear.Diversitywassetateither50, 100,or500taxaandchimerapercentagesweresetto5%or35%.TheBalzermodel[22]was chosenforhomopolymererrors,andcopybiaswasincludedwhilelengthbiaswasexcluded. Trainingdataset(Reference). InourARKexperimentsweusedQuikr[3]andSEK[14] toestimatep(C jx2R ).TheRDPtrainingset7wasusedasthebasereferencedatabasefor m q bothQuikrandSEK.NotethatthisisthesameasdatabaseD utilizedin[3].Whileinthe small mainmanuscriptweusethesamedataforbothtrainingandtestingthebasemethods(Quikr andSEK),inS1Fileweincluderesultsobtainedwhenthetestdatasetshavetaxaabsentfrom thetrainingdatabase(thatis,sistertaxahavebeenexcludedfromthetrainingdatabase).As expected,allmethodsexperiencealossinreconstructionaccuracywhensistertaxaareabsent, butARKQuikrandARKSEKarestillmoreaccuratethanRDP’sNBC. Realbiologicaldata TofurtherevaluateARK,wealsoutilized28IlluminaMiSeq16SrRNAgenehumanbody-site associatedsamples,plusonenegativecontrolsample.Therealdataconsistofatotalofover5.7 Mreadsdistributedoverthreevariableregions(V1–V2,V3–V4,andV3–V5)aswellastwo bodysites(vaginaandfeces). ForeachofthesesamplesDNAwasextractedusingtheFastDNASPINKitforSoilwitha FastPrepmachine(MPBiomedicals)followingthemanufacturer’sprotocol.16SrRNAgene ampliconsweregeneratedfromtheDNAextractionsusingtheprimercombinationslistedin Section5ofS1File.TheQ5High-fidelitypolymerasekit(NewEnglandBiolabs)wasusedto amplifythe16SrRNAgenes,andPCRconditionswereasfollows:98°Cfor2minutes,followed by20cyclesof98°Cfor30seconds,50°Cfor30secondsand72°Cfor1minute30seconds,fol- lowedbyafinalextensionstepat72°Cfor5minutes.FollowingPCR,theampliconswerethen purifiedusingtheWizardSVGelandPCRClean-Upkit(Promega,UK).Sequencingof16S rRNAgeneampliconswascarriedoutbyIlluminaInc.(LittleChesterford,UK)usingaMiSeq instrumentrunfor2x250(V1–V2),300+200(V3–V4)and400+200(V3–V5)cycles.These datahavebeensubmittedtotheEuropeanNucleotideArchiveusingtheaccessionnumber PRJEB9828. Aftertrimming20bpofprimeroffeachread,thesequencesweretrimmedfromtheright untilallbaseshadaqualityscoregreaterthan27.Thisreducedthetotalnumberofreadsto approximately4M,andreducedthemeanreadlengthfrom315bpto257bp.Wethenutilized allresultingunpairedreads(bothforwardandreverse)includinganyduplicatesequences.We includeinS1Fileresultsforanalternativeerror-correctionprotocol,aswellasresultsfor assemblingpaired-endreads(FigsEandFinS1File). EthicsStatement Forhumanbody-siteassociatedsamples,thefaecalsamplesusedwerenotpartofaclinical studysothereisnocorrespondingethicalapprovalorwrittenconsent.Therewerenoclinical records.Thesamplesareanonymisedandde-identified.Further,vaginalsampleswerecol- lectedaspartofanobservationalmicrobicidefeasibilitystudy.Thestudywasapprovedbythe EthicsCommitteesoftheNationalInstituteforMedicalResearchinTanzaniaandLondon PLOSONE|DOI:10.1371/journal.pone.0140644 October23,2015 7/16 ARK SchoolofHygieneandTropicalMedicine,andallparticipantsgavewritteninformedconsent. Allrecordswereanonymizedandde-identifiedpriortothisretrospectiveanalysis. Results Performancemeasureandrelevantmethods Asaquantitativeperformancemeasure,weusevariationaldistance(VD)tocomparebetween knownproportionsoftaxonomicunitsp=[p(C ),p(C ),...,p(C )]tandtheestimatedpropor- 1 2 M tionsp^ ¼½p^ðC Þ; p^ðC Þ;...;p^ðC Þ(cid:2)t.TheVDisdefinedas 1 2 M VD¼0:5(cid:6)kp(cid:5)p^k 2½0;1(cid:2): 1 AlowVDindicatesmoresatisfactoryperformance. ForARK,weusedbothSEKandQuikrastheunderlyingestimationmethodsappliedto eachcluster.Theserecentmethodswerechosenasappropriaterepresentativesoffastandaccu- ratesparsesignalprocessingapproaches.Ak-mersizeofk=6wasusedforbothQuikrand SEK. AspartoftheSEKpipeline,sequencesinagivendatabasearesplitintosubsequences.We selectedfromthe10,046sequencesintheRDPtrainingset7allsequenceslongerthan700bp inlength,andthensplitthesequencesintosubsequencesoflength400bpwith100bpofover- lap.ThiscorrespondstosettingL =400andL =100asspecifiedin[14].WeusedtheSEK w p algorithmOMPþ;1withparametersasin[14]. sek ResultsforSimulatedData Effectofincreasingnumberofclusters. Wefirstinvestigatehowanincreaseinthenum- berofclustersQaffectsthecompositionreconstructionfidelityandalgorithmexecutiontime forthesimulateddata.Onlythenon-optimal/randomstrategyofK-meansclusteringwasuti- lizedaswefoundthattheperformanceimprovementforoptimal/deterministicstrategywas insignificantgiventheresultingincreaseinexecutiontime(resultsnotshown).Averagingthe VDerroratthegenusleveloverall180simulatedexperiments,itwasfoundthatcombining ARKwithbothSEKandQuikrresultedinapowerlawkindofdecayofVDerrorasafunction ofthenumberofclusters(Fig2).ARKcausesasubstantialincreaseinreconstructionfidelity whichcanbeseensinceusingARKSEKorARKQuikrwithoneclusterisequivalenttorun- ningSEKorQuikrwithnomodification. Sincetheunderlyingalgorithm(SEKorQuikr)mustbeexecutedoneachclusterformedby theK-meansclustering,weexpectthetotalalgorithmexecutiontimetoincreasebyafactor equaltothenumberofchosenclusters.AsseeninFig3,bothalgorithmsexperiencean increaseinexecutiontimeroughlyproportionaltothenumberofclusters. Fixednumberofclusters. Asseenabove,giventhedecreaseinVDasafunctionofthe numberofclusters,wealsofixedthenumberofclustersQto75tocomparetheperformance oftheunderlyingalgorithmswithandwithoutARK.Therewasasignificantdecreaseinthe VDerror(asseeninFig4)atthecostofanincreaseinexecutiontime(asseeninFig5).How- ever,giventhespeedofbothQuikrandSEK,weexpecttheadditionofARKwillnotresultin prohibitivelylongexecutiontimes.Indeed,asseenabove,onrealbiologicaldatabothARK QuikrandARKSEKarestillseveralhoursfasterthantheRibosomalDatabaseProject’sNaïve BayesianClassifier(RDP’sNBC)[1],evenwhenusing75clusters. PLOSONE|DOI:10.1371/journal.pone.0140644 October23,2015 8/16 ARK Fig2.ResultsfortherandomK-meansclusteringonthesimulateddata.MeanVDerroratthegenus levelasafunctionofthenumberofclusters.NotetheimprovementthatARKcontributestoeachmethod. doi:10.1371/journal.pone.0140644.g002 Fig3.ResultsfortherandomK-meansclusteringonthesimulateddata.Meanexecutiontimeincrease (factorgivenincomparisontorunningSEKorQuikrintheabsenceofARK)asafunctionofnumberof clusters.Thedashedlinerepresentsalinewithslope1. doi:10.1371/journal.pone.0140644.g003 PLOSONE|DOI:10.1371/journal.pone.0140644 October23,2015 9/16
Description: