MolecularEcologyResources(2013)13,337–340 doi:10.1111/1755-0998.12063 Pool-hmm: a Python program for estimating the allele frequency spectrum and detecting selective sweeps from next generation sequencing of pooled samples SIMON BOITARD,* ROBERT KOFLER,† PIERRE FRANC(cid:1)OISE,* DAVID ROBELIN,* CHRISTIAN € SCHLOTTERER† andANDREAS FUTSCHIK‡ *LaboratoiredeG(cid:2)en(cid:2)etiqueCellulaire,INRA,24ChemindeBordeRouge,AuzevilleCS52627,CastanetTolosanCedex31326, France,†Institutfu€rPopulationsgenetik,VetmeduniVienna,Veterin€arplatz1,WienA-1210,Austria,‡InstituteofStatisticsand OperationsResearch,UniversityofVienna,Universit€atsstrasse5/9,WienA-1010,Austria Abstract Due to its cost effectiveness, next generation sequencing of pools of individuals (Pool-Seq) is becoming a popular strategy for genome-wide estimation of allele frequencies in population samples. As the allele frequency spectrum provides information about past episodes of selection, Pool-seq is also a promising design for genomic scans for selection. However, no software tool has yet been developed for selection scans based on Pool-Seq data. We intro- ducePool-hmm,aPythonprogramfortheestimationofallelefrequenciesandthedetectionofselectivesweepsina Pool-Seq sample. Pool-hmm includes several options that allow a flexible analysis of Pool-Seq data, and can be run in parallel on several processors. Source code and documentation for Pool-hmm is freely available at https://qgsp.jouy.inra.fr/. Keywords: allele frequency spectrum, hidden Markov models, next generation sequencing, pooled DNA, selective sweeps. Received10July2012;revisionreceived26November2012;accepted29November2012 of the reference genome arise from a random sampling Introduction amongthe pooledchromosomes,soobservations canbe Thedetectionofgenomicregionsthatevolvedundernat- redundant. Second, sequencing error probabilities are ural selection is an important topic in population genet- larger than with classic Sanger sequencing (Luo et al. ics. The case of hard sweeps, where a new mutant goes 2011)andarevariableamongandwithinreads. tofixationinapopulationduetostrongdirectionalselec- Recently,Boitardet al.(2012)proposedahiddenMar- tion, has received particular attention (Kim & Stephan kov model (HMM) for detecting sweeps based on Pool- 2002; Nielsen et al. 2005; Jensen et al. 2007; Boitard et al. Seqdata.Thismethodinvolvescomputingthelikelihood 2009;Alachiotiset al.2012). of the observed read information conditional on allele Theadventofnextgenerationsequencing(NGS)tech- counts in the pool, for each genome position. Down- nologies provides a new dimension to such genome stream analyses—estimation of the background allele scansforselection.Inspiteoftheconsiderablereduction frequency spectrum (AFS) and detection of selective in the cost of sequencing, the sequencing of individuals sweeps—are then based on these likelihoods. Uncer- onapopulationscaleremainsexpensive.However,hard tainty concerning the true allele frequencies in the pool, sweepscanbedetectedusingonlythesampleallelefre- whichmighttypicallybehigherforsiteswithlowcover- quencies along the genome, and this information can be age or bad quality scores, is thus taken into account in obtainedbysequencingDNAfromapoolofindividuals the analyses. Possible biases arising from unequal DNA (Pool-Seq). Although Pool-Seq is considerably cheaper concentration or quality among individuals are not thanthesequencingofindividuals,therearesomemeth- accounted for by this method, but these effects are odological challenges associated with the analysis ofthe expected to be limited for large sample sizes (Futschik resulting data. First, the reads covering a given position et al.2010). Inthiswork,weproposeaPythonprogram,denoted Correspondence:SimonBoitard,Fax:+33561285308; Pool-hmm, that implements the method of Boitard et al. E-mail:[email protected] (2012). The two main applications of this program are ©2013BlackwellPublishingLtd 338 S. BOITARD ET AL. AFS estimation and detection of selective sweeps, in a frequency estimation. In addition, the estimated allele given region. These two applications are implemented frequency at genomic positions with low coverage is independently, so it is possible, for instance, to detect essentiallydeterminedbythepriorAFS. selective sweeps based on a background AFS that is Our likelihood-based approach for estimating the specifiedbytheuser.Inaddition,Pool-hmmprovidesan AFS in a region or allele frequencies at each genomic estimationofallelefrequenciesateachgenomicposition, positionisanalternativetodiscardingbasecallsorgeno- whichcanbeusedinotherpopulationgeneticssoftware. micpositionsbasedonarbitrarythresholds,andhasthe These model-based estimations are preferable to na€ıve advantage of using the available information in a more estimates obtained by computing the ratio of allele continuous way. One important implication is that we countsataposition,asdiscussedbelow. canestimatewithoutbiastheproportionofsingletonsor otherlowallelecounts,evenatlow(downto0.5 9 )per chromosome coverage (Boitard et al. 2012), which is Inputdata clearly not possible when thresholding genomic posi- The main input of the program is the Pool-Seq data, tionsbasedonthenumberofalternativealleles. which must be provided in the SAMtools pileup format Pool-hmm can be also used to compare the AFS of (Li et al. 2009). Any alignment file in BAM or SAM for- genomic regions with different annotations (e.g. introns mat can easily be converted to the pileup format using vs. exons). We provide a script that filters the input thesamtoolsmpileupcommand(without-uor-goptions), pileup file for any feature present in an annotation .gtf independently of the software used to align or prepro- file. Pool-hmm then infers the AFS based on the filtered cessthereads. pileup. Note that the allele frequencies considered by Pool- hmmaresampleallelefrequencies(from0ton)andnot Allelefrequencyestimation population allele frequencies (from 0 to 1), see for The method assumes an infinite sites model, where at instance the AFS in Fig. 1. Population and sample most two alleles can be observed at each genomic posi- frequencies are closely related and essentially provide tion, the ancestral allele and the derived allele. By the same information, but inference based on coalescent default, the ancestral allele is considered unknown, and theory,asthederivationsofNielsenet al.(2005)thatare Pool-hmm focuses on folded (rather thanderived) allele used in our sweep detection model, naturally involve frequencies.Twootherstrategiesmightalsobespecified sampleallelefrequencies. (option —ancestral-allele). First, the reference allele pro- videdinthepileupcanbeconsideredtobetheancestral Detectionofselectivesweeps allele. Second, ancestral alleles at each genomic position can be provided in an additional column of the pileup In the HMM of Boitard et al. (2012), each genomic posi- file. tion is assumed to have a hidden state, which can take For a given genomic region, Pool-hmm estimates the oneofthethreefollowingvalues:‘Selection’,forthesites derived or folded AFS using an expectation maximiza- thatareveryclosetoasweptsite,‘Neutral’,forthesites tion(EM)algorithm.ThestartingvalueforthisEMisthe thatarefarawayfromanysweptsiteand‘Intermediate’ expected AFS under a model with constant population forthesitesinbetween.Thesethreevaluesareassociated size and scaled mutation rate h = 4 Nl = 0.005, which withdifferentAFS.The‘Neutral’AFScorrespondstothe canbemodifiedbytheuser(option–theta). background (whole genome) AFS of the population. It Pool-hmm can also estimate the derived (or minor) can either be estimated from the Pool-Seq data or pro- allele frequency at each position in a specified region vided using option —spectrum-file. The ‘Intermediate’ (option—estim). For this estimation, the previously esti- and‘Selection’AFSarethendeducedfromthe‘Neutral’ mated AFS (or any other AFS provided by the user) is AFSusingthederivationsinNielsenet al.(2005),andare consideredasaprior.Atagivenposition,itiscombined typically more skewed towards low and high allele fre- withthelikelihoodoftheobservedreadstoobtainapos- quencies. The hidden states form a Markov chain along terior distribution of the derived (or minor) allele fre- the genome with a per-site probability q of switching quency. The estimated frequency is the one maximizing state(argument–k).Theobservedvariableateachgeno- this posterior distribution. This estimation procedure is mic position is a vector summarizing the information morereliable thanadirect estimation basedon theratio providedbyreadsatthisposition. ofallelecountsatapositionbecauseitaccountsforaddi- In the HMM described above, a selective sweep is tional properties of the read data, namely the coverage detected if the hidden state ‘Selection’ is inferred for and the base qualities at a given position. For instance, a window of sites. Using Pool-hmm, this inference low-quality base calls have less influence on the allele (option —pred) relies on two different criteria. First, the ©2013BlackwellPublishingLtd POOL-HMM: A PROGRAM FOR ANALYSING POOLED NGS DATA 339 Fig.1 AFSinaquailsampleofn=20chromosomes,computedfromarandomsampleofgenomicpositions(emptycircles),genomic positionswithinexons(fullcircles,left panel),genomicpositionswithinsweepwindow1(emptytriangles,rightpanel)orgenomic positionswithinsweepwindow2(plus,rightpanel).Probabilitiesof0-and20-derivedallelesarenotshownbecausetheyarenotatthe samescale.Thelargeprobabilityobservedfor19-derivedallelesmaybeduetothemisspecificationoftheancestralalleleatasmall proportionofsegregatingsites.Sucherrorsareexpectedifthereissharedpolymorphismbetweenquailandchicken. sequence of hidden states maximizing the likelihood of Example the HMM is computed using the Viterbi algorithm (Ra- biner 1989) and returned in a file with suffix .pred. A WeappliedPool-hmmtoasampleof10quailsthatwere summaryofthesweepwindowsdetectedfromthisalgo- sequenced ina single pool at 20 9coverage. Reads were rithm (genomic regions with predicted hidden state aligned against the chicken genome release WUGSC2.1 ‘Selection’)isalsoreturnedinafilewithsuffix.stat.Sec- usingglint(Courcelleet al.2008)andconvertedtopileup ond, the posterior probability of hidden state ‘Selection’ format using samtools (Li et al. 2009). We focused on is computed for each genomic position using the for- chromosome1((cid:1)200-Mblong,with20millionobserved ward–backwardalgorithm,andisreturnedinafilewith segregating sites) and conducted the analyses described suffix.post. above. We used option —a ‘reference’, thereby assuming thatquailancestralallelesarethosethatarefoundonthe chicken reference genome. The execution time of all Parallelization Pool-hmmcommandswithone,fouroreightprocessors Analysis of whole genome NGS data can be very time- isgiveninTable 1.Asexpected,itdecreasessignificantly demanding.TospeeduptheexecutionofPool-hmm,we when the number of processors increases, although not parallelizedthepartsofthecodewherethelikelihoodof linearly because some parts of the code, as for instance the observed read information conditional on the num- thequeuemanagement,arenotparallelized.Usingeight ber of derived alleles in the pool is computed. These cores, a standard analysis involving AFS estimation and computations represent the largest computational cost sweepdetectiontookabout5 h.Additionalsweepanaly- (otheralgorithms,suchasEMorViterbi,areveryfastin ses with a different sensitivity (parameter -k) are then comparison), but they can be performed independently very fast because they can use intermediate results (the for each genomic position. Taking advantage of this HMM emission probabilities at each observed segregat- property, our strategy is thus to build a queue of 10 kb ingsite)thatarestoredfromthefirstanalysis.Notealso blocks that are distributed for analysis to different pro- that AFS estimation is much faster than allele frequency cessors. The management of this queue and the coordi- estimationbecauseitisbasedononly2%ofthegenomic nation of all processors are implemented using the positionsofchromosome1(chosenatrandom).Thepro- Pythonmultiprocessinglibrary. portion of genomic positions used for AFS estimation This parallelization strategy enables an optimized use canbedefinedusingoption—ratio. ofmultipleprocessorsonasinglemachine.Whenrunning TheAFSonchromosome1isshowninFig. 1.Forcom- Pool-hmmonacomputercluster,wealsorecommendcut- parison, Fig. 1 also shows the AFS obtained using only ting the whole genome data into large regions (typically genomicpositionslocatedwithinexons(asthenumberof chromosomes)andanalysingtheseregionsindependently thesepositionsismuchsmaller,weusedallofthemrather ondifferentnodes.Thiscanbecombinedwithparalleliza- thanonly2%).WefilteredthepileupwithPool-hmm,using tionwithineachnode,asdescribedabove. agtffilecorrespondingtothelatestEnsemblannotationof ©2013BlackwellPublishingLtd 340 S. BOITARD ET AL. Table1 ExecutiontimeofPool-hmmfortheanalysisofchromosome1inaquailsampleofn=20chromosomes.Resultsareprovided forseveraltypesofanalysesandforone,fouroreightavailableprocessorsonacomputingcluster.Pool-hmmcommandscorrespond- ingtotheseanalysesinthecaseofoneavailableprocessorarelistedbelowthetable. Numberofprocessors AFSestimation* Firstsweepprediction† Additionalsweepprediction‡ Allelefrequencyestimation§ 1 6h4min9s 12h21min36s 0h10min14s 31h7min47s 4 1h57min46s 4h32min57s 0h09min21s 7h47min9s 8 1h33min30s 3h15min53s 0h07min06s 4h17min14s *Pythonpool-hmm.py–input-filequail-n20-a‘reference’–only-spectrum–theta0.005–ratio50. †Pythonpool-hmm.py–input-filequail-n20-a‘reference’–pred–spectrum-filequail–k0.0000000001. ‡Pythonpool-hmm.py–input-filequail-n20-a‘reference’–pred–emit-file–k0.0000000001. §Pythonpool-hmm.py–input-filequail-n20-a‘reference’–estim–spectrum-filequail. thechickenassemblyusedforthealignment(url:ftp://ftp. Acknowledgements ensembl.org/pub/release-68/gtf/gallus_gallus/).Exonic regions have an overall deficit of segregating sites, but We thank ChristineLeterrierfor providingthe quail data, and apartfromthattheshapeoftheAFSintheseregionsisclose AndreaRauforhercarefulreadingofthemanuscript.Christian tothatobtainedfromrandomregions. Schlo€ttererissupportedbygrantsoftheAustrianScienceFund (FWF, P19467). Travels between France and Austria were Seventy-four sweep windows were detected on chro- fundedbythePHCAmadeusgrant25154QH. mosome 1. The .stat file reporting these regions is pro- vided as supporting information. The evidence for each sweep window can be assessed using the third column References ofthisfile,whichrepresentsthemaximumoftheposter- ior probability of hidden state ‘Selection’ along the win- AlachiotisN,StamatakisA,PavlidisP(2012)Omegaplus:ascalabletool dow (in log scale). To illustrate the specificity of the forrapiddetectionofselectivesweepsinwhole-genomedatasets.Bio- informatics,28,2274–2275. sweepwindowsdetectedbyourapproach,weestimated BoitardS,Schlo€ttererC,FutschikA(2009)Detectingselectivesweeps:a theAFSinthesweepwindowscorrespondingtothetwo newapproachbasedonhiddenMarkovmodels.Genetics,181,1567– first lines of the .stat file (Fig. 1, right panel). Sweep 1578. window 1 was characterized by an excess of low- and BoitardS,Schlo€ttererC,NolteV,PandeyR,FutschikA(2012)Detecting selective sweeps from pooled next generation sequencing samples. high-frequency alleles, whereas sweep window 2 was MolecularBiologyandEvolution,29,2177–2186. characterized by a general deficit of segregating sites. CourcelleE,BeausseY,LetortSetal.(2008)Narcisse:amirrorviewof Indeed,thedetectionmethodimplementedinPool-hmm conservedsyntenies.NucleicAcidsResearch,36,D485–D490. makes use of both the density of segregating sites and Futschik A, Schlo€tterer , C (2010) Massively parrallel sequencing of pooled DNA samples – the next generation of molecular markers. the allele frequency pattern among segregating sites to Genetics,186,207–218. distinguish sweep regions from neutral regions. Further JensenJ,ThorntonK,BustamanteC,AquadroC(2007)Ontheutilityof details on this point and comparisons with alternative linkagedisequilibriumasastatisticforidentifyingtargetsofpositive approachescanbefoundinBoitardet al.(2009,2012). selectioninnonequilibriumpopulations.Genetics,176,2371–2379. KimY,StephanW(2002)Detectingalocalsignatureofgenetichitchhik- ingalongarecombiningchromosome.Genetics,160,765–777. Li H, Handsaker B, Wysoker A etal. (2009) The sequence alignment/ Conclusion map(sam)formatandsamtools.Bioinformatics,25,2078–2079. LuoL,BoerwinkleE,XiongM(2011)Associationstudiesfornext-genera- Pool-hmmisthefirstsoftwaretoolforPool-Seqdatathat tionsequencing.GenomeResearch,21,1099–1108. provides a probabilistic allele frequency estimation and Nielsen R, Williamson L, Kim Y, Hubisz M, Clark A, Bustamante C detects selective sweeps on a genomic scale. The imple- (2005) Genomic scans for selective sweeps using SNP data. Genome mented statistical algorithms account for two important Research,15,1566–1575. RabinerL(1989)Atutorialonhiddenmarkovmodelsandselectedappli- features of Pool-Seq data, the random sampling among cationsinspeechrecognition.ProceedingsoftheIEE,77,257–287. chromosomes within the pool and sequencing errors. Pool-hmm includes several options that allow a flexible analysisofPool-Seqdata. S.B.,D.R.,C.S.andA.F.designedthesoftware.S.B.,R.K. and P.F. implemented the software. S.B., R.K., A.F. and C.S.wrotethemanuscript. Softwareavailability Source code and documentation for Pool-hmm is freely available at https://qgp.jouy.inra.fr/. Several test data setsarealsoprovided. ©2013BlackwellPublishingLtd