Almeidaetal.BMCGenomics2014,15:1101 http://www.biomedcentral.com/1471-2164/15/1101 RESEARCH ARTICLE Open Access Construction of a dairy microbial genome catalog opens new perspectives for the metagenomic analysis of dairy fermented products Mathieu Almeida1,2,3,9, Agnès Hébert4, Anne-Laure Abraham1,2, Simon Rasmussen5, Christophe Monnet4,6, Nicolas Pons3, Céline Delbès7, Valentin Loux8, Jean-Michel Batto3, Pierre Leonard3, Sean Kennedy3, Stanislas Dusko Ehrlich3, Mihai Pop9, Marie-Christine Montel7, Françoise Irlinger4,6 and Pierre Renault1,2* Abstract Background: Microbial communities oftraditionalcheeses are complex and insufficiently characterized. The origin, safety and functional rolein cheese making of thesemicrobialcommunities are still not well understood. Metagenomic analysis of these communities by high throughput shotgun sequencing is a promising approach to characterize their genomic and functional profiles.Such analyses, however, critically depend onthe availability of appropriate reference genomedatabases against which thesequencing reads can be aligned. Results: We built a reference genomecatalogsuitable for shortreadmetagenomic analysis using a low-cost sequencing strategy. We selected 142 bacteriaisolated from dairy products belonging to 137 different species and 67 genera, and succeeded to reconstruct thedraft genome of 117 of them ata standard or highqualitylevel, including isolates from thegenera Kluyvera, Luteococcus and Marinilactibacillus,still missing from public database. Todemonstrate thepotentialof this catalog, weanalysed themicrobialcomposition ofthesurface oftwo smear cheeses and one blue-veined cheese, and showed that a significant partof themicrobiota ofthese traditional cheeses was composed of microorganisms newly sequenced inour study. Conclusions: Our study provides data, which combined withpubliclyavailablegenome references, represents the most expansive catalog to date of cheese-associated bacteria. Using this extended dairy catalog, we revealed the presence intraditionalcheese of dominant microorganisms not deliberately inoculated, mainly Gram-negative genera such as Pseudoalteromonas haloplanktis or Psychrobacter immobilis,that may contribute to the characteristics ofcheeseproduced through traditional methods. Keywords: Genomic libraries, Genome sequencing, Sequence assembly, Next-generation sequencing, Comparative genomics, Metagenomics, Food bacteria, Dairyecosystems Background 1010cellspergramanditisgenerallyacceptedthatmostof Cheesesharbouradiversemicrobialcommunity,composed them are cultivable in laboratory growth media [6-9]. An ofaresident“houseflora”,thatinteractswithstrainsdelib- inventory of microorganisms with a history of use in food erately inoculated as starter or adjunct cultures [1-5]. The fermentationswasestablishedrecently[10].Itcontains195 cheese microorganisms mainly consist of Firmicutes (lactic bacterial species (30 genera) and 71 yeast and mould acid bacteria, staphylococci), Actinobacteria (coryneform species(35genera).Amongthesebacterialspecies,only80 bacteria), Proteobacteria, Bacteroidetes, yeasts and moulds. (41%) comprise at least one isolate for which a genome Theirconcentrationinthefinalproductsometimesexceeds sequenceisolatedfromfoodisavailable,withalmosthalfof them within Lactobacillus species (NCBI database, May *Correspondence:[email protected] 2014). Furthermore, this list cannot be considered as ex- 1InstitutNationaldelaRechercheAgronomique,UMR1319MICALIS,78352 haustiveofthecheesemicrobialdiversity,sinceoccurrences Jouy-en-Josas,France 2AgroParisTech,UMRMICALIS,78352Jouy-en-Josas,France of species previously undetected in milk and cheese are Fulllistofauthorinformationisavailableattheendofthearticle ©2014Almeidaetal.;licenseeBioMedCentral.ThisisanOpenAccessarticledistributedunderthetermsoftheCreative CommonsAttributionLicense(http://creativecommons.org/licenses/by/4.0),whichpermitsunrestricteduse,distribution,and reproductioninanymedium,providedtheoriginalworkisproperlycredited.TheCreativeCommonsPublicDomain Dedicationwaiver(http://creativecommons.org/publicdomain/zero/1.0/)appliestothedatamadeavailableinthisarticle, unlessotherwisestated. Almeidaetal.BMCGenomics2014,15:1101 Page2of16 http://www.biomedcentral.com/1471-2164/15/1101 periodicallyreported [11-13]and isolates affiliatedto novel relevance of these new genomes to the understanding of taxa characterized [4]. Several species, such as Corynebac- food microbiota, we used the newly created catalog to terium casei, Microbacterium gubbeenense, Arthrobacter analysethemicrobiotaofthree cheesesurfacessequenced arilaitensis, Arthrobacter bergerei, Agrococcus casei, Myce- throughwholemetagenomicsequencing. tocola reblochoni andVibrio caseiappear to beendemicin the cheese habitat and the environment of cheese manu- Results facturing [14-17]. Environmental reservoirs of cheese mi- Creationofadairyreferencegenomecatalog crobial diversity such as milk, cow teat, human skin, brine After bibliographic investigation for bacterial species oc- baths, ripening room air, wooden vessels and shelves on curring in dairy products, we collected 142 dairy bacteria which the cheese rests during ripening, have been identi- of variousorigins.The origin of theisolates and their tax- fied, but their microorganism content remains largely onomy are shown in Additional file 1: Table S1. The col- uncharacterised [18,19]. Microbial communities of cheeses lection comprised 36% Gram negative bacteria, 35% low and dairy environments also represent largely unexplored GC Gram positive bacteria and 29% high GC Gram posi- reservoirs of genetic and metabolic diversity with potential tive bacteria. Among the 67 corresponding genera, four beneficial use for fermented food production. An increase aregeneraforwhichnogenomesequencewasavailablein inthenumberofgenomesequencesofdairybacteriaisalso NCBIdatabases(May2014release):Kluyvera,Luteococcus, useful for a better understanding of the genetic determi- Marinilactibacillus, and Mycetocola. The distribution of nants involved in the adaptation to the dairy habitat and thestrainsaccordingtothetypeofdairyproductandtheir thegenerationoffunctionalproperties[20-23]. geographicoriginisshowninFigure1. In recent years, high-throughput sequencing technolo- We designed a low-cost library sequencing strategy, by gies and information technologies have allowed the devel- poolingthebacterialgenomestogetherinacontrolledway opment of new approaches for studying the genetic priortosequencinginordertoreducelibraryconstruction diversityofmicrobialcommunities.Amongthese,metage- costs. Unlike other approaches that rely on barcoding, nomics is a powerful tool for assessing the phylogenetic we de-convolve the individual genomes by using a co- diversity of complex microbial assemblages present in abundanceapproachasdescribedinNielsenetal.[35]and samples such as soil, sediment, food products or water LeChatelieretal.[36](seeAdditionalfile2supplementary [24] and for exploring the functional properties of their document for an example of the clustering procedure). dominant populations. The characterization of metage- This strategy involves a two-step procedure, as follows. In nomicdatasetsreliesontheuseofreferencedatabasesthat thefirststep,theDNAofthe142strainswasmixedinfive contain sequences of known origin and phenotype. Many pools of about 30 strains each. To reduce the risk of chi- ofthesestudiesarecarriedoutbypyrosequencingofsingle meras each pool contained a mix of genomes from diver- target genes, such as 16S rDNA sequences, that provide gent genera (see Additional file 3: Table S2). The pools information restricted to the phylogenetic composition of were sequenced using Illumina paired-end sequencing, the samples [25-31]. On the other hand, shotgun sequen- andthenassembledtoproducefiveseparatecollectionsof cing of whole community DNA provides additional infor- contigs. In a second step, the genomes were redistributed mation about the functions performed by the microbial into six pools of ~90 strains each, which were sequenced community [32,33]. The length of the reads generated by using SOLiD sequencing, and the resulting reads were current high throughput sequencing technologies is too mapped to the contigs generated in the first step in order short to allow accurate comparative analyses against dis- to estimate the coverage of each contig within the stage 2 tantly related genomes, thus requiring the availability of pools. We used SOLiD sequencing due to the availability reference genomes closely related to organisms from the ofthisplatformatourinstitutionandthelowercostofse- environment being studied. Currently, the international quencing.However,otherlow-costapproachesforestimat- genome databases are biased towards model organisms ingthecoverageofstage1contigswithinthestage2pools and pathogens, and, according to Huson et al. [34], up couldbeused,suchas,e.g.,shortrunsonIlluminainstru- to 90% of the sequences of a metagenomic dataset may ments.Theresultingcontigcoveragematrixwasthenused remain unidentified due to the lack of adequate refer- to cluster together the contigs with correlated coverage ence sequences. The sequencing of several hundred profiles, each cluster corresponding to one of the original genomes is no longer a technical issue. However, this strains(see co-abundance clusteringmethodsection).The process still remains costly, mainly due to the cost of stage 1 pool assemblies contained each less than 20,000 the construction of individual libraries for each genome contigs with a mean contig size of 40 kbp (see Additional being sequenced. file 3: Table S2, Illumina assembly pool sheet). After the In the present study, we selected and sequenced 142 clustering procedure, more than 80% of the Illumina bacterial strains of dairy origin that belong to 137 differ- contigs comprising more than 96% of the total length of ent species and 67 genera. In order to exemplify the contigscouldbeattributedtoindividualstrains. Almeidaetal.BMCGenomics2014,15:1101 Page3of16 http://www.biomedcentral.com/1471-2164/15/1101 Figure1Originofthe142selecteddairybacterialisolatesinfunctionofthetypeofdairyproduct(A)andthegeographicarea(B). We assessed the quality of the clustering procedure by increase contig size and recover eventual missing parts of mapping the Illumina contigs to the closest NCBI genome the genomes (see Methods section). Interestingly, the opti- sequences(usingBLASTN[37],identitythreshold>=90%). mized re-assembly process halved the number of contigs Furthermore,oneorganisminourcollection-Arthrobacter and increased slightly the total contig size of genomes for arilaitensis Re117 – had already been sequenced and was which close references were available, such as genomes of addedtoonesequencingpoolinordertovalidatetheclus- the genera Leuconostoc and Streptococcus. In order to fur- tering procedure. We evaluated the correctness of contig therassessthequalityofthedraftgenomes(includingthose clustersbycomputingtwomeasures:thedominantgenus- without near-neighbors in public databases), we relied the percentage of the contigs that could be mapped to on the six quality submission criteria established by the relatedgenomesbelongingtothesamegenusastheorgan- Human Microbiome Project (HMP) [38], plus two add- ismrepresentedbythe cluster;andthereferencecoverage- itional criteria that identify potential miss-assignment percentageofthetotalcontigsizeofthepoolthatcouldbe events during the clustering step: phylogenetic marker re- mapped to a genome from the dominant genus. We dundancy and tetranucleotide homogeneity. For the HMP performedthisanalysisforthe53draftgenomesforwhich draft genome quality criteria, 5 criteria correspond to con- we could identify at least five genomes belonging to the tigandscaffoldassemblylengthandcoverage(seeMethods same genus in the NCBI database. For these genomes, section).ThelastHMPqualitycriteriachecksthepresence themeandominantgenusassignmentandreferencecover- of 99 bacterial essential genes [39], which gives an indica- age percentage were of 97.7% and 89.5%, respectively (see tionoftheproportionofthegenomethathasbeenassem- Additional file 4: Table S3). The dominant genus assign- bled (see Method section for the threshold used and ment indicated that only 2.3% of the total length of the supplementary information for the additional criteria). The genomes may have been mis-assigned by the clustering phylogenetic marker redundancy tests the redundancy of procedure. The reference coverage indicated that on aver- 40proteinmarkersexpectedtobeconservedinallbacteria, age 10.5% of the length of each genome may be missing. not laterally transferable and not duplicated within a gen- The potentially missing information is likely present in the ome[40].Thetetranucleotidehomogeneityteststhehomo- 4% of the fragments that were not assigned and contain geneity of the tetranucleotide signature among all the mostly repeated sequences. We performed an optimized contigsofadraftgenome.101genomespassedtheessential re-assembly procedure for each draft genome, in order to genes HMP criterion and the two additional criteria for Almeidaetal.BMCGenomics2014,15:1101 Page4of16 http://www.biomedcentral.com/1471-2164/15/1101 mis-assignment detection, indicating that 101 genomes are from the NCBI database (see Additional file 10: Table S5, almost complete with no mis-assignment evidence (see genome references for phylogeny sheet). The classifica- Additional file 4: Table S3, draft quality evaluation sheet). tions of the genomes sequenced in the present study are Amongthese,72passedallthequalitycriteriaandwerede- consistent with the NCBI reference genomes, further finedashighqualitydraftgenomes.Anexampleisthegen- confirmingthecorrectnessofourreconstruction.Insome ome of Jeotgalicoccus psychrophilus CRBM D2, which was cases, for example for Alkalibacterium kapii, Marinilacti- assembled in only 70 contigs and 20 scaffolds, and had a bacillus psychrotolerans and Luteococcus japonicus, only contig N50 size of 103 kb. Sixteen additional assemblies distantNCBIreferencegenomesareavailable,highlighting presented incomplete sets of the HMP essential genes cri- thecontributionofourstudy. terion but passed the two chimeric test criteria. For the 25 remaining draft genomes, 17 were too incomplete (<1 Mb Genomicsofcheesebacteria in contig cumulative size) and 8 did not pass one of the The high quality draft sequences can be used to perform two chimeric criteria. The genome of Arthrobacter arilai- comparative genomic studies aimed at understanding the tensisRe117 [20],whichwas usedin pool1as control for genetic underpinnings ofthe adaptation of bacteria tothe the procedure, passed all the HMP and chimeric criteria, food environment, e.g., through the characterization of and a comparison with the previously sequenced genome metabolic pathways. Here we compared the genomes of (Genbank project PRJNA53509) showed an average iden- strains from dairy and non-dairy environments for two tity of 99.98%, a completion level of 95.63% and the ab- genera. In a first example, we compared the genomic sence of improperly assigned contigs. The two plasmids sequences of the four Arthrobacter strains isolated from presentinthisbacteriumwerealsopartiallypresentinour cheese to that of 15 environmental isolates. Mostbacteria draft. The missing sequences corresponded mainly to of the genus Arthrobacter are isolated from environments transposase regions which were not assembled possibly such as soil, where they are considered to be ubiquitous duetotheassemblyandclusteringprocedurewhichoften [41]. Interestingly, the four cheese strains share several hasdifficultiesreconstructingrepeatedvariableregions.In properties that may be linked to adaptation to the cheese total 117 of the draft genome sequences (101 which habitat, such as a cluster of five genes involved in the ca- passed all quality controls and an Additional 16 that had tabolism of D-galactonate, as already described in Arthro- noevidence of chimeric contigs) were considered suitable bacter arilaitensis Re117 [20]. This gene cluster is absent for submission to public databases (see Additional file 1: from the genomes ofthe 15Arthrobacter strains of envir- TableS1).The25draftgenomeswithpoorqualityorpos- onmental origin for which a sequence is available (see sible contamination were not submitted to public data- Additionalfile11:TableS6).Ithasbeenhypothesizedthat bases, but were used in our project with caution for D-galactonate may be produced by yeasts from lactose phylogeneticanalyses. duringtheripeningofcheeses,andtheabilitytocatabolize From the 195 bacterial species or subspecies listed by this compound could thus be beneficial for Arthrobacter Bourdichon et al. [10] to occur infood products,only 80 strainsincheeses[20]. had at least one food isolate for which a genome se- Asasecondexample,wecomparedthegenomesoftwo quence was available in the NCBI database (NCBI May strains of Streptococcus infantarius isolated from Western 2014release,seeAdditionalfile5:TableS4).Thepresent African fermented milks, sequenced in this work (3AG study provides genome sequences for 78 additional dairy and 11FA), with those of the type strain isolated from in- isolates, which effectively doubles the number of avail- fant feces (ATCC BAA-102), and of strain CJ18, isolated able genome sequences of relevance to the study of from Eastern African fermented milk. The four strains fermenteddairyproducts. containeach1900–2000genesandshare1567genes.The Inordertobettercharacterizethediversityofthebacter- strainscouldbedividedintotwogroups,thetwoWestern ialstrainsstudiedhere,wereconstructedtheirphylogenetic Africanstrains, which displayed99.7% identityon average relationships.Forthatpurpose,thegenescorrespondingto within the shared genes (including 1206 fully identical the 40 phylogenetic protein markers proposed by Mende genes), and the infant feces and the Eastern African et al. [40] were extracted from the 117 high quality draft strains,whichdisplayed99.3%identity(including485fully genomes in order to build a phylogenetic tree (Figure 2). identical genes) (see Additional file 12: Table S7). The The tree shows that the selected bacterial isolates cover a strains of the two different groups displayed only 98.8% large biodiversity. Four other trees were constructed (see identityandfewerthan185fullyidenticalgenes, confirm- Additional file 6: Figure S1, Additional file 7: Figure S2, ing a clear separation between the Western African food Additionalfile8:FigureS3andAdditionalfile9:FigureS4) strainandthetwoothers.Furtherstudyofthegenecontent by inclusion of 328 genomic sequences from food- of the two Western African Streptococcus infantarius related bacteria or closely-related species (Bacteroidetes, strainsshowedthatthesestrainshadacquiredtheabilityto Firmicutes, Actinobacteria and Proteobacteria) extracted ferment lactose through the LacZS system, as previously Almeidaetal.BMCGenomics2014,15:1101 Page5of16 http://www.biomedcentral.com/1471-2164/15/1101 Figure2Globalphylogenyofthe117dairybacterialisolatessequencedinthepresentstudy.ThephylogenetictreeisanITOLcircular visualization[68]withthebranchlengthandthebootstrapvaluesdisplayed.Thetreeisbasedonaconcatenatedalignmentof40universal markerproteinfamilies[40].Onlygenomesequencesfromwhichaminimumof10markerscouldbeextractedandwhichhadnocontaminating sequencesevidencewereconsidered.ThegenomeofMethanobrevibactersmithiiATCC35061wasusedtorootthetree.Thecolorscorrespondto thedifferentphyla. described for Eastern African strain CJ18 [42]. However, niches occurred convergently and independently in these these genes are located within different regions of the strainsisolatedrespectivelyinEasternandWesternAfrica. chromosome and originate from a different donor. While lacZS may originate from S. thermophilus in the Eastern Applicationofthenewgenomiccatalogtothe African CJ18 strain [43], it has probably been acquired metagenomicanalysisofcheesemicrobiota from S. salivarius in the two Western Africa strains Metagenomic analyses based on sequence mapping on a (see Additional file 13: Figure S5). These data show that set of reference genomes can be used to identify and adaptation of S. infantarius to the dairy fermentation quantify genes and species [36]. To determine whether Almeidaetal.BMCGenomics2014,15:1101 Page6of16 http://www.biomedcentral.com/1471-2164/15/1101 the addition of the new genome sequences to the 5873 cheesesLandE,respectively).Adeeperinvestigationofthe publicly available genomes (bacteria, archaea, yeasts and cheese E good quality unmapped reads indicated that they moulds) could improve such analyses, we sequenced the maycorrespondto(i)strainspecificgenesbelongingtothe microbial communities from the rind of three cheeses pan-genome (including prophages and mobile elements) withprotected designation oforigin.Thesethree cheeses and/or missing regions of the draft genomes, (ii) microor- were made from cow’s milk and correspond to two soft ganisms for which genomes are still absent from the data- smear-ripened cheeses (E and L), and one blue-veined bases, (iii) distant genomic regions containing indels or cheese (G). Extracted DNA was sequenced by SOLiD more than 3 mismatches, which cannot be mapped with technology and reads were assigned to species by map- Bowtie.Lastly,~20%of“technicallygoodreads”onaverage ping them to the reference microbial genomes and to are inherently un-mappable due to the characteristics of thegenomeofBostaurus(seeMethodssection). the SOLiD technology (this number is estimated from an The sequencing of the three samples provided from 8.4 analysis of the unmapped good quality read percentage in to 15.5 million good quality reads (see Additional file 14: five different re-sequencing projects using the same Table S8). The percentage of good quality reads mapping sequencing and mapping techniques as in our paper, see to the microbial reference genomes varied from 46.1% Additional file 15: Table S9). The most prevalent micro- (cheeseL)to57.1%(cheeseE)(Figure3).Interestingly,the organisms detected in the three cheeses are presented reads that mapped only to the dairy genomes sequenced in Table 1 and a more detailed composition is shown in in the present study accounted for a large proportion of Additionalfile14:TableS8. the good quality reads (from 16.7% for cheese L to 23.7% Inthesmear-ripenedcheeseE,theArthrobacterarilai- for cheese G). In cheese G, 11.2% of the reads mapped to tensis GMPA29 reference genome was the most repre- the Bos taurus genome (compared to 0.5% and 0.1% for sented, as it corresponded to 19.8% of the total good Figure3MappingofthegoodqualityreadsfromthemetagenomicsequencingofDNAfromthesurfacesofthreecheeses.Thegood qualityreadscomingfrom3samplesofcheesesurfacewerealignedto5873genomescomingfromNCBIand117genomescomingfromour project.TherepartitionofthegoodqualityreadsthatmaponlyontheNCBIgenomes(blue),onthegenomesequencedinourproject(green), onbothNCBIandourgenome(lightgreen)andonBostaurusgenome(orange)ispresentedinpiecharts.Theunmappedgoodreadsare presentedindarkandlightgrey,respectivelythoselackingareferenceandthosepotentiallyunmappablefortechnicalreason. Table1Mostprevalentmicroorganismsdetectedbymetagenomicsequencingofthreecheesesurfacesamples hA ttplm Referencegenome N(eaw) Ccuolmtumreesrc(iba)l Nurmeabdesr(ocf) NumCDbeSr(odf) CCDuSmluelnagtetdh CDSCo(%ve)r(eed) sequenceColevnegretdh genMoemane readsM(a%p)p(egd) Sebqyuepnecrefescctomveartecdh ://weida w (%)(f) coverage reads(%)(h) w et CheeseE ArthrobacterarilaitensisGMPA29 1 1 2919327 3299 2875518 95.4 93.1 34.64 19.80 99.5 .biom al.BM PsychrobacterimmobilisPG1 1 0 1259254 2941 2824830 99.4 94.3 15.48 8.54 99.9 ed C c G VibriolitoralisB4 1 0 768119 3353 3158767 95.6 82.2 8.36 5.21 92.3 en en tra om Pseudoalteromonashaloplanktis 0 0 384444 3454 3327182 94.1 70.4 4.00 2.61 93.1 l.c ic TAC125 om2s GeotrichumcandidumCLIB918 0 1 378769 6925 10213537 98.3 38.6 1.30 2.57 95.5 /14014 7, 11 Halomonassp.1M45 1 0 221008 3526 3281784 95.6 77.1 2.31 1.50 93.5 -215:1 61 Lactococcuslactissubsp.lactis 0 1 105179 2088 1869362 95.3 54.9 1.80 0.71 96.5 40 Il1403 /151 /1 DebaryomyceshanseniiCBS767 0 1 77071 6295 9107395 91.7 16.6 0.30 1.27 85.6 1 0 1 CheeseL Pseudoalteromonashaloplanktis 0 0 1434355 3454 3327182 93.7 89.5 14.95 17.03 97.6 TAC125 Halomonassp.1M45 1 0 883652 3526 3281784 100.0 99.3 9.16 10.49 99.97 Psychrobacterceler91 1 0 146974 2584 2414775 95.6 76.5 2.08 1.74 91.5 Lactococcuslactissubsp.cremoris 0 1 143400 2470 1969568 92.5 35.4 0.82 1.70 98.1 A76 PenicilliumcamembertiFM013 0 1 83277 14611 22234696 85.1 10.8 0.13 0.99 97.7 VibriolitoralisB4 1 0 65937 3353 3158767 90.0 27.6 0.70 0.78 85.9 ProvidenciaheimbachaeGR4 1 0 53280 3824 3516737 83.6 26.7 0.51 0.63 55.3 GeotrichumcandidumCLIB918 0 1 45165 6925 10213537 95.0 12.7 0.15 0.54 95 CheeseG ArthrobacterbergereiCa106 1 1 2878361 3553 3110335 98.6 95.7 31.77 18.58 99.2 Lactobacillusdelbrueckiisubsp. 0 1 1179878 1508 1340406 98.7 95.4 29.44 7.62 99.8 bulgaricusATCC11842 PenicilliumcamembertiFM013 0 1 633123 14611 22234696 99.4 54.2 1.00 4.09 96.5 StreptococcusthermophilusLMG 0 1 597916 1827 1462709 98.3 86.1 13.62 3.86 99.4 18311 PenicilliumroquefortiFM164 0 1 254435 12630 23447373 98.5 26.3 0.38 1.64 98.4 Pseudoalteromonashaloplanktis 0 0 146457 3454 3327182 93.5 57.8 1.53 0.95 92.3 TAC125 P a g e 7 o f 1 6 Table1Mostprevalentmicroorganismsdetectedbymetagenomicsequencingofthreecheesesurfacesamples(Continued) hA ttplm DebaryomyceshanseniiCBS767 0 1 80179 6295 9107395 94.2 17.6 0.31 0.52 97.1 ://weida w PsychrobacteraquimarisER15174 1 0 74153 2830 2734881 93.7 41.7 0.87 0.48 81.5 w et (a)GenomesBseHqI7uencedinthepresentstudy(1)orfromtheNCBIdatabase(0). .biom al.BM (b)Speciesknowntobecomponentsofcheesemakingcommercialcultures. ed C c G (c)NumberofreadsmappedonCDSfromthereferencegenomewiththreeorlessmismatcheson35nucleotides. e e n n (d)NumberofCDSinthegenome.CDScorrespondingtoinsertionsequences,prophagesandpotentialrepeatedandtransferableelementswereremoved. tra om (e)PercentageofCDScoveredwithatleastoneread. l.c ic (f)Percentageofsequencecoveredbyatleastoneread(sequenceisrestrainedtotheselectedCDS). om2s (g)Numberofreadsalignedwiththisgenomedividedbythenumberofgoodqualityreads. /101 (h)Lengthofsequencecoveredbyperfectmatchreads(withnomismatchonthe35ntlengthalignment)dividedbythelengthofthesequencecoveredbyreads. 474, Foreachcheese,thedatapresentedcorrespondtotheeightreferencegenomeswiththehighestnumbersofmappedreads. 1-2115:1 61 40 /11 5 /1 1 0 1 P a g e 8 o f 1 6 Almeidaetal.BMCGenomics2014,15:1101 Page9of16 http://www.biomedcentral.com/1471-2164/15/1101 quality reads, followed by Psychrobacter immobilis PG1 aquimaris, Brachybacterium tyrofermentans, Corynebac- and Vibrio litoralis B4, with 8.5 and 5.2% of the reads, terium ammoniagenes, Brevibacterium antiquum, Micro- respectively. Furthermore, between 95.4 and 99.4% of bacterium gubbeenense, Brochothrix thermosphacta and their coding sequences were detected, with a high level Marinilactibacillus psychrotolerans, were also present in of coverage (from 8.4X to 34.6X). The genomes of the cheeses (>80% perfect match reads, see Additional Arthrobacter arilaitensis GMPA29 and Psychrobacter file 14: Table S8). Interestingly, among the eight most immobilis PG1 are almost entirely covered (from 99.5 to prevalent microorganisms detected in each cheese by 99.9% of the reference size), showing that the detected metagenomic analysis, two (cheese G), four (cheese E) or strains were closely related to the reference strains. The five(cheeseL)correspondedtospeciesorgeneraofgram- reference strain Psychrobacter immobilis PG1 was isolated negative bacteria which are not known components of from the dairy plant that produces the smear-ripened cheesemakingcommercialcultures(Table1). cheese E, but two years earlier. The high proportion of perfect matches with reference strain PG1 may thus be Discussion explained by the presence of an offspring of this strain in In the present work, we produced 137 draft genomes iso- cheeseE.Manyreadswerealsoassignedtothegenomesof latedfromdairyproducts,whichalmostdoubledthenum- the yeasts Geotrichum candidum CLIB 918 and Debaryo- ber of different species isolated from fermented dairy myceshanseniiCBS767. products. This genome catalog was realized using a low- Inthesecondsmear-ripenedcheese(cheeseL),Pseudoal- cost library sequencing strategy based on a combinatorial teromonas haloplanktis TAC125, Halomonas sp. 1 M45 pooling approach in which a reduced number of DNA and Psychrobacter celer 91 were the three dominant refer- pools weresequenced.Pooling strategieshavebeen previ- encebacteria,with17.0%,10.5%and1.7%ofthegoodqual- ouslybeen usedtoreduce costsfor BAC sequencing [44]. ity matches, respectively. In this cheese sample, the Here we rely on a co-abundance clustering approach that sequences of the reads mapped to Halomonas sp. 1 M45 we developed for reconstructing genomes directly from were essentially perfect matches (>99.9% of covered posi- metagenomic samples [35]. In the present pooled ap- tions with 99.9% perfect match reads, coverage of 9.2X). proach, only 11 libraries were required to produce 150 These data suggest that the Halomonas strains present in draftgenomes,leadingtoacostof~200USDpergenome the cheese sample are almost identical to the reference as opposed to ~500 USD if each genome were sequenced strain. However, even though the reference strain 1 M45 separately(usingbestcommercialoffersavailablein2011). has also been isolated from a smear-ripened cheese of the Today,thispricedifferentialmaybeevenhigheraslibrary sameprotecteddesignationoforigin,itoriginatedfroman- constructioncostshavenotdecreasedasmuchassequen- other manufacturing plant. More than fifty thousand reads cingcosts.Thiscostsavingscomeswithsomelimitations. were assigned to Providencia heimbachae GR4. However, First, we suggest that only distant genomes (i.e. from only 55.3% of covered positions of this reference strain are different genera at least) should be mixed and sequenced without mismatch, which indicates that the strain present together to optimize the assembly and clustering steps. in the cheese sample is not closely related to the reference Second,severalgenomeswerepoorlysequenced,however strain, and may even correspond to another species. Sur- mostofthemwerehighGC%draft genome, knowntobe prisingly, more than 80 thousand reads (~1% of the total difficulttosequenceusingIlluminasequencing[45].Also, good quality reads in cheese L) mapped to the Penicillium about 4% of the total sequence length could not be camemberti FM 013 genome, with 97.7% of perfect match assigned to an individual strain and we estimated that on reads, even though this species is not known to occur in average 2% of a genome’s sequence may be mis-assigned smear-ripened cheeses. One may hypothesize that this duetolimitationsoftheclusteringapproach.However,an couldresultfromcross-contaminationduetothemanufac- examination of unassigned fragments>2 kb showed that turingofmould-ripenedcheeseinthesameplant. theycorrespondmainlytomobileelements(plasmidsand The surface of the blue-veined cheese G was dominated phages) while genome data curation showed that poten- by a strain close to Arthrobacter bergerei Ca106 (18.6% of tiallymis-assignedfragmentsaregenerally<1kb,andpri- the reads, 99.2% perfect match reads). Like for the two marilyimpactgenomesoflowerquality.Thisfactprompts othercheeses,Psychrobacter speciesseem to bepresent in us to suggest that the use of these drafts for comparative this cheese. Cheese G was probably manufactured with a genomics should be restricted to high quality draft thermophiliclacticstarterculture,sinceStreptococcusther- genomesandtogenes presentin longcontigsor scaffolds mophilus and Lactobacillus delbrueckii species were the (i.e. bigger than 1 kb). Lastly, the bioinformatics analysis dominant lactic acid bacteria, in contrast to the two other pipeline is more complex to set up than simple assembly cheeses,inwhichLactococcuslactiswasthedominantlac- procedure in single genome sequencing. Despite these ticacidbacterium.Strainsrelatedtootherreferencestrains limitations,117ofthe142sequencedgenomesresultedin sequenced in the present study, such as Psychrobacter goodorhighqualitydraftgenomessuitableforsubmission Almeidaetal.BMCGenomics2014,15:1101 Page10of16 http://www.biomedcentral.com/1471-2164/15/1101 to public database. Some of the remaining 25 genomes which had been previously detected in such cheeses maystillbeusefulasreferencesformetagenomicanalyses. [1,2,9,12,13,28,49-53], and also in a recent large amplicon The microbial composition of the surfaces of three sequencing studyof themicrobial composition of 137dif- cheeseswasinvestigatedbyhighthroughputmetagenomic ferent cheese rinds [33]. They may originate from the en- profiling.Asall theDNApresentinthe cheesesamplesis vironment of cheese manufacturing (brine, tools, surfaces sequenced, the high throughput sequencing may detect ofshelves…),andtheirhighabundancesuggeststhatthey any type of DNA (bacteria, archaea, eukaryotes and mayhaveanimpactonthepropertiesofthefinalproduct. viruses), provided that adequate references are used. As the analysed cheeses were marketed and were of very Eukaryotes,suchasGeotrichumcandidum,Debaryomyces good quality, these microorganisms cannot be considered hansenii, and Penicillium roqueforti, were found, which hereasspoilers. was not surprising, as these fungi are frequently used in cheesemaking. Interestingly, many reads from cheese G Conclusion mapped on the Bos taurus genome. We hypothesize that Thegenomessequencedinthepresentstudyconsiderably thisisduetothepresenceofcowsomaticcellsinthemilk increased the numbers of mapped reads, although a used for the manufacturing of cheese G. The impact of significantproportionofthemetagenomicreadsremained milk somatic cells on the ripening of cheeses has been unassigned (~20% once taken into account unmapped shown in several studies as associated to the flock reads inherent to SOLiD technology). These data indicate health[46,47]. that even if more than 6000 genomes are currently avail- Shotgun sequencing allows a relative quantification of ableinpublicdatabases(includingthegenomeswegener- DNA molecules present in a sample, based on counting atedhere),additionalmicroorganismsfoundintraditional the number of reads mapped to each member of the cheeses are still missing from this reference collection. community. High throughput sequencing allows higher Furtherstudiesarenecessarytocompletethisreferencein resolution quantitation and we have shown that we can ordertoprovideacompleteviewofthecheeseecosystem. recover even fairly minor taxonomi groups, such as the Tooursurprise,collectingthepresentreferencestrainset Leuconostoc genus (see Additional file 14: Table S8), constituted a laborious work, since strains corresponding known to be part of the minority population in cheeses to non-starter species are frequently not conserved once [48]. However, additional experiments may be needed to described. We anticipate that the results of this work will validate the identification and quantitation of low abun- motivate isolation and conservation of new reference dance populations. strains,aswellasindependentisolatesofthesamespecies Presence or absence of complete set of genes or of to support safety assessment, establish biodiversity re- specific genes, and their level of sequence homology al- sourceandstrainspecificityinproducts.Directsequencing lows also confirming characteristics of particular strains. and assembly as performed for the human microbiome Forexample,cheesesEandLmetagenomicprofilesindi- could also provide new potential references [36], although cated the presence of strains closely related, respectively, thisprocedureissignificantlymoreexpensive.Insummary, to Psychrobacter immobilis PG1 and Halomonas sp. the present study considerably extended the effectiveness 1 M45 coming from our catalog. Since these reference of shotgun metagenomic analysis of cheese microbiota. strains were isolated in the same cheese factory several Even if such analyses require generating and computing years earlier for the former and in the same DOP from large amounts of sequencing data, the technologies are another factory for the later, our analysis would reflect evolving rapidly and one mayanticipate that in the future, the setting up of strains sharing common origins with they will become routine in the investigation of food the references in these cheeses. Metagenomic profiling microbiota. provides thus new perspectives to study cheese ecology by tracing genomes or genes, which should allow point- Methods ing out particular strains (e.g. starters, potential “terroir” Bacterialisolatesandgrowthconditions or regional strains, contaminants…), and following their The bacterial isolates working collection was composed disseminationanddevelopment duringcheeseprocesses. of 142 isolates originating from milk, fermented milks, The metagenomic profiling of the surfaces of the and cheeses, and five food isolates that were not of dairy three cheeses confirmed that microorganisms that are origin (seeAdditionalfile 1:TableS1). not deliberately inoculated constitute a large part of the microbiota, appearing among the few dominant species DNAextractionfromliquidculturesofbacterialisolates in cheese rind. For example, they are predominant in Aftercultivation,bacterialcellswereharvestedbycentri- cheese L. Several of the corresponding microorganisms, fugation for 10 min at 12,000 × g and approximately such as Pseudoalteromonas, Halomonas, Vibrio, Marini- 100 mg of cell pellet were suspended in 400 μl of buffer lactibacillusandPsychrobacterareGram-negativebacteria (0.4 M NaCl, 2 mM EDTA, 10 mM Tris–HCl, pH 8).
Description: