Liuetal.BMCGenomics2014,15:685 http://www.biomedcentral.com/1471-2164/15/685 RESEARCH ARTICLE Open Access Discovery of common sequences absent in the human reference genome using pooled samples from next generation sequencing Yu Liu1*, Mehmet Koyutürk1,2,3, Sean Maxwell1, Min Xiang4,8, Martina Veigl3,6, Richard S Cooper9, Bamidele O Tayo9, Li Li3,5, Thomas LaFramboise3,4, Zhenghe Wang3,4, Xiaofeng Zhu7 and Mark R Chance1,3,4* Abstract Background: Sequences up to several megabases in length have been found to be present inindividual genomes butabsentin thehumanreference genome. These sequences may be common in populations,and their absence inthe reference genome may indicate rare variantsin thegenomes ofindividuals who served as donorsfor the human genome project. As the reference genome is used in probe design for microarray technology and mapping shortreadsin next generation sequencing (NGS), this missing sequencecould be a source of biasin functional genomic studies and variantanalysis. One End Anchor (OEA) and/or orphan readsfrom paired-end sequencing have been used to identify novelsequences thatare absent inreference genome. However, there is nostudy to investigate the distribution, evolution and functionality ofthosesequences in human populations. Results: Tosystematically identify and study the missing common sequences (micSeqs), we extended theprevious method by pooling OEA reads from large number ofindividuals and applying strict filtering methodsto remove false sequences. The pipeline was applied to data from phase 1 ofthe 1000 Genomes Project. We identified 309 micSeqs that are present inat least 1% of thehumanpopulation,but absent inthe reference genome. We confirmed 76% of these 309micSeqs by comparisonto other primate genomes, individualhumangenomes, and gene expression data. Furthermore, werandomly selected fifteenmicSeqs and confirmed theirpresence using PCR validation in38 additional individuals.Functional analysis using publishedRNA-seq and ChIP-seq data showed that eleven micSeqs are highly expressed inhuman brain and three micSeqs contain transcription factor(TF) binding regions, suggestingtheyare functional elements. Inaddition,theidentified micSeqs are absent innon-primates and show dynamic acquisition during primate evolution culminating with most micSeqs being present inAfricans, suggesting some micSeqs maybe important sources of human diversity. Conclusions: 76% of micSeqs were confirmed by a comparative genomics approach. Fourteen micSeqs are expressed in human brain or contain TF bindingregions.Some micSeqs are primate-specific, conserved and may play a role intheevolution of primates. Keywords: Missing common sequence, De novo assembling, Next generationsequencing, Expression inbrain, Transcription factor binding, Genome evolution *Correspondence:[email protected];[email protected] 1CenterforProteomicsandBioinformatics,CaseWesternReserveUniversity, Cleveland,OH,USA 3CaseComprehensiveCancerCenter,CaseWesternReserveUniversity, Cleveland,OH,USA Fulllistofauthorinformationisavailableattheendofthearticle ©2014Liuetal.;licenseeBioMedCentralLtd.ThisisanOpenAccessarticledistributedunderthetermsoftheCreative CommonsAttributionLicense(http://creativecommons.org/licenses/by/4.0),whichpermitsunrestricteduse,distribution,and reproductioninanymedium,providedtheoriginalworkisproperlycredited.TheCreativeCommonsPublicDomain Dedicationwaiver(http://creativecommons.org/publicdomain/zero/1.0/)appliestothedatamadeavailableinthisarticle, unlessotherwisestated. Liuetal.BMCGenomics2014,15:685 Page2of14 http://www.biomedcentral.com/1471-2164/15/685 Background requireeitherintensivelabororthecostisstillprohibitive The identification and characterization of all common fordeepsequencingofalargenumberofindividuals. sequence variants in the human genome has trans- We have developed a method to identify novel se- formed our understanding of segregating diversity, po- quences in individuals using One End Anchor (OEA) pulation genetics, and disease susceptibility. SNPs have reads from paired-end sequencing where one end of the traditionally been thought to be the dominant source of pair can be uniquely mapped to the reference genome sequence variation although other sequence variants, (termed“anchorread”)whiletheothercannotbemapped such as sequence duplication and deletion recognized at all (termed “orphan read”) [11]. A similar method was microscopically [1], were described decades ago. Signifi- developed for the detection of virus sequences [15]. We cant efforts have been made to document all common extend this strategy by pooling shallow sequencing data SNPsinthehumangenomeusingtraditionaltechnologies (e.g.,with4×coverage)fromalargenumberofindividuals [2,3]. In the past decade, the completion of the hu- to systematically identify sequences common in the hu- manreferencegenomeanddevelopmentof highthrough- man population that are absent in the reference genome. put technologies, such as microarray and next generation Forthispurpose,weappliedthismethodongenomicdata sequencing (NGS), have revolutionized methods for the from hundreds of individuals generated from Phase 1 of efficient assessment of genome composition in human the1000GenomesProject[7].Ourpipelineidentifiesmic- populations. Morethan 35 million SNPs have been docu- Seqs by pooling orphan reads from 363 individuals of mentedbasedontheanalysisofdatafromthelatesttech- African, Asian, and European origin followed by de novo nologies [4-8]. However, genome surveys also revealed an assemblyoftheseorphanreadstogeneratecandidatemic- unexpectedly large extent of other categories of sequence Seqs. We used various strategies to filter out false mic- variants, including copy-number variants (CNV), inver- Seqs, and systematically validated the remaining micSeqs sions, translocations, and small to large sequence inser- throughcomparativegenomicanalysis.Direct experimen- tionsanddeletions[7]. tal examination of randomly selected fifteen micSeqs on The human reference genome played a significant role 38 additional samples confirmed that all fifteen micSeqs in the detection of sequence variants, because it was ex- werepresentinatleastoneindividual.Additionalanalysis tensively used for probe design and array generation and suggestedmanymicSeqsarefunctionalgenomicelements mapping short reads in NGS. Given that we now know involvedinbiologicalfunctions. that sequence insertions and deletions are common, and considering that around 80% of the reference genome is Results derivedfrom a singleindividual [9,10], itis reasonableto IdentificationofmicSeqbypoolingsamplesfrom1K expect that many common sequences, i.e. those present GenomeProject in at least 1% of the population, may be absent in the Our pipeline to detect micSeqs, contains four steps reference data due to missing copies in the few individ- (summarized in Figure 1): 1) Data collection and qua- uals that were studied. Recently, multiple studies, using lity control; 2) Identification of One End Anchor reads both NGS and traditional capillary sequencing, have re- (Figure2)anddenovoassemblyofmicSeqsusingpooled ported novel sequences that are absent in the reference OEA reads from 363 individuals. Pooling permits the genome, but are present in at least a few individuals use of shallow sequence data and the identification of [11-14]. For example, using capillary sequencing tech- commonsequencesisdependentonastringentcoverage nology, a recent study identified more than 2,363 novel cutoff for de novo assembly, in this case at least 50× sequences (in total, more than 2 Mb) in the genomes of coverage; 3) Identification and filtering of false micSeqs nine individuals [11].Manyoftheseappearinmorethan based on the genomic location of anchor reads corre- one individual, implying that these sequences may be sponding to the orphan reads that are assembled to- common in human populations. Furthermore, de novo gether (Figure 2); 4) Validation of filtered micSeqs via assembly of several individual genomes that were deeply comparative genomic analysis on a database of vertebrate sequenced using NGS also revealed novel sequences genomes, comparison to the deeply sequenced genomes [12,13]. Based on these findings, the amount ofnovelse- of four individuals, and comparison to RNA-Seq data, as quences that are not present in the reference genome wellasdirectexperimentalexaminationofmicSeqs. was estimated to be around 20–40 Mb for each individ- The pipeline was applied to paired-end sequencing ual [12]. Comprehensive discovery of all common se- data for 363 individuals of African (YRI, 88 individuals), quences that are absent in the reference genome would Asian (CHB, 88, and JPT, 85) and European (CEU, 102) require sequencing of a large number of individual ge- origin obtained from the Phase 1 release of the 1 K Ge- nomes that are representative of the human population. nomes Project. We identified 309 micSeqs with a length Currently, neither capillary sequencing nor deep sequen- of at least 100 bp. In total, the 309 micSeqs contained cingusingNGSaresuitableforthesepurposes,sincethey 68,424 bp (micSeq sequences along with predicted Liuetal.BMCGenomics2014,15:685 Page3of14 http://www.biomedcentral.com/1471-2164/15/685 Figure1SummaryofpipelinefordiscoveryofmicSeqs.Thepipelinehasfoursteps:1)datacollectionandqualitycontrol;2)OEA identificationanddenovoassembly;3)filteringfalsemicSeqs;and4)validationbycomparativegenomics. locations, neighboring genes, supporting evidence and (Figure 2). The location information is also valuable for other information are available in http://proteomics.case. further annotation and validation using Sanger sequen- edu/micseqdb/).ThelongestmicSeqhad1,494bpandthe cing, for example, for primer design. Finally, our results meanlengthis221bp(Figure3a).Over70%ofindividual further contribute to the literature by validating and in- micSeqsarepresentinatleast5%ofthepopulationstud- vestigating the conservation and evolution of the identi- ied(≥18individuals,Figure3b),andonaverageeachindi- fied sequences using comparative genomics approach, as vidual has 50 micSeqs (Figure 3c) comprised of 5Kb or describedbelow. more sequences that are absent in reference genome. We investigated the location of micSeqs in chromosomes ValidationofmicSeqbycomparativegenomicsand basedontheiranchorposition,andfoundthatthesemic- Sangersequencingonadditionalsamples Seqs are relatively uniformly distributed across the gen- Comparison of micSeqs with genome sequences from ome(Figure3d). additional individuals provides validation for many mic- Previous methods have used One End Anchor (OEA) Seqs, while comparison with genomes from other verte- and orphan reads to identify novel sequences present in brate species can evaluate whether they are conserved individuals using deep sequenced genomes [11,14]. In during evolution. We compared the micSeqs with gen- contrast to these studies that focus on individual ge- ome sequences from four individuals (two European, nomes, we here focus on the population by deriving one Chinese, one African), particularly focusing on the strength from pooled reads, i.e., our objective is to iden- sequences that cannot be mapped to the reference tify sequences that are novel and relatively common [12,16,17]. Three micSeqs (with 99% sequence similarity) in human populations. Currently, the number of deep- arepresentinallfourindividuals,while118micSeqswere sequenced genomes is still limited. To take advantage of foundtobepresentinatleastoneofthefour.Moremic- availability of large number of shallow-sequenced ge- Seqs(80)weredetected in J.CraigVenter’sgenome com- nomes, we pooled OEA reads from different individuals pared to other individual genomes (32 micSeqs for the to identify micSeqs. The pooled nature of the reads we African individual, 22 for James Watson, and 25 for the utilized also requires methodological improvements as Chineseindividual).Venter’sgenomewassequencedusing compared to existing methods. Namely, since the num- capillary sequencing [16], previous studies have indicated ber of pooled reads is very large, our method utilizes the that novel sequences were often partially sequenced or position of the corresponding anchor reads to improve weremissingaltogetheringenomessequencedusingNGS computational efficiency by filtering out false sequences technologies[11].SevenofthemicSeqshadhighsequence Liuetal.BMCGenomics2014,15:685 Page4of14 http://www.biomedcentral.com/1471-2164/15/685 Figure2OneEndAnchorreads.Duetothelimitedinsertsize,orphanreadsareintheneighborhoodofanchorreadsinpairedendsequencing, thus,thelocationofanchorreadscanbeusedtofilteroutthefalsesequencesandestimatethelocationoftruesequences.Notethatfullorphan readscouldbeusedtohelpdetectlongerinsertionsequences.a)IllustrationofOEAreads.b)theanchorreadscorrespondingtoonemicSeq candidatearedistributedalongthechromosomewithverymodestcoherenceinonelocation,itisverylikelythisisafalsemicSeq.CandidatemicSqs ofthistype(>50%ofreadsdistributedacrossthechromosome)arefilteredout;c)theanchorreadsarehighlyclusteredinoneshortregion,theseare broughtforwardforvalidationanalysis. similaritywithnovelinsertionspreviouslydetectedinnine withgenomesfromothervertebrates,noneofthemicSeqs individuals [11]. This modest overlap suggests that the had homologs in genomes other than those of primates previous study identified many novel sequences that are (viz, chimpanzee, gorilla, orangutan, gibbon, rhesus, and more rare than those identified here. When compared marmoset). We were able to find potential homologs for Liuetal.BMCGenomics2014,15:685 Page5of14 http://www.biomedcentral.com/1471-2164/15/685 Figure3CharacteristicsofmicSeqs.a)thelengthdistributionof309micSeqs,theaveragelengthof309micSeqsis221bp;b)thedistribution ofmicSeqfrequencyinthehumansamplerepresentedinthe1KGenomeprojectdata,theindividualsareconsideredtocontainthemicSeqif theycontributeatleastthreeorphanreads;70%ofmicSeqshavehigherthan5%frequency,i.e.presentinatleast18individuals;c)thedistribution ofnumberofmicSeqsineachindividual,onaverageeachindividualhas50micSeqs;d)thedistributionofmicSeqsalonghumangenome;e)the distributionofmicSeqlocationwithrespecttoRefSeqgenes. 45% of micSeqs (139 micSeqs, Table 1) in at least one Among 139 micSeqs conserved in primates, fifteen primate with sequence similarity higher than 90%. 106 micSeqs were randomly selected for experimental valid- micSeqs had homologs in at least three primates, indicat- ation using PCR on an additional 38 samples from in- ingthattheyareconservedduringevolutionandmayhave dividuals with African (12 samples), Asian (12 samples) functional roles. Data for micSeq30, including the homo- and European (14 samples) ancestry. The primers used logs in primates, the population distribution, predicted can be found in (Additional file 1: Table S1) along with andvalidatedsequences(seebelow),areshowninFigure4. the targeted micSeqs. The PCR and sequences for The conservation of micSeqs in multiple primate species micSeq30 were shown in Figure 4, which clearly shows provides an opportunity to locate their exact position in the presence of micSeq30 in most of African individuals the human genome at the resolution of single nucleotide while it is absent in the other populations. Furthermore, (Figure 4). For micSeqs without homologs in other spe- theDNAsequencesfromSangersequencingconfirmsour cies, the location of corresponding anchor reads can pro- prediction for micSeq30 in terms of both its sequence vide reliable position information with 1Kb resolution and position. We also selected a few bands that don’t (Figure 2a). The estimated locations can be found at the contain micSeqs (the lower bands) for Sanger sequen- onlinedatabase:http://proteomics.case.edu/micseqdb. cing to verify the absence of micSeqs. One example Table1NumberofmicSeqhomologsinotherprimates Primates Chimp(panTro3) Gorilla(gorGor3) Orangutan(panAbe2) Gibbon(nomLeu) Rhesus(rheMac2) Marmoset(calJac3) Homologs 126 102 103 92 59 15 Noneof309micSeqshashomologotherthanprimates.Incontrast,139micSeqshaveatleastonehomologinotherprimates.Notethatthenumberofhomolog inoneprimateagreeswiththedistancetohuman,i.e.,morehomologpresentinspeciesclosetohuman. Liuetal.BMCGenomics2014,15:685 Page6of14 http://www.biomedcentral.com/1471-2164/15/685 Figure4PCRofindividualsandevolutionaryconservationofmicSeq30.a)PCRresultsformicSeq30forsamplesfromthreepopulations usedforvalidation.TheallelefrequencyclearlyshowspopulationstratificationforthismicSeq:withhigherfrequencyinAfrican,andmuchlower inothers,p-value<10−5forAfricanvsnon-African(Table1).b)SangersequencingresultsforPCRamplifiedDNA,predictedsequenceofmicSeq30, andhomologsfromotherprimates.Thegenomiccoordinatesforotherprimatesare:chimp:chrX:132,861,450-132,861,750;gorilla:chrX:129,774, 750-129,775,050;orangutan:chrX:131,684,602-131,684,902;andrhesus:chrX:130,465,777-130,466,071.Theredarrowsshowthecorresponding coordinatesinhumanreferencegenome(hg19).TheinsertedpositionofmicSeq30inthereferencegenomeis:chrX:131,393,592,and,the correspondinganchorreadsformicSeq30locateintheregionofchrX:131,393,154-131,393,947. was included in (Additional file 1: Figure S1 and S2). All in different populations estimated from 1 K genome data 15 micSeqs were detected in at least one individual, i.e., followsimilartrends. bands of appropriate size are observed in DNA blots In summary, 236 micSeqs out of 309 that were identi- (Table 2). Nine out of 15 micSeqs showed extensive po- fied (76%) were supported by independent evidence for pulation stratification with a significantly higher allele their presence in humans. The independent evidence frequency in African vs. non-African individuals, con- included detection in individual genomes, existence of sistentwithpreviousstudies[11].ThemicSeqfrequencies homologs in other primate genomes, or support from Table2FrequencyofmicSeqsfromvalidationsamples micSeqID Freqfromvalidation Freqfromvalidation Freqfromvalidation P-valueforAfrican dataofEuropesamples dataofAsiansamples dataofAfricasamples vsnon-African micSeq11220 0.46 0.54 0.63 0.33 micSeq13298 0.11 0.08 0.29 0.043 micSeq16601 0.07 0 0.13 0.31 micSeq18764 0.11 0.08 0.38 0.0021 micSeq11710 0 0.25 0.33 0.0030 micSeq143 0 0.92 0.33 0.61 micSeq1824 0.14 0.29 0.13 0.53 micSeq2373 0 0.08 0.33 0.0010 micSeq2382 0.32 0.42 0.50 0.31 micSeq30 0.25 0.04 0.79 0 micSeq71 0.29 0.29 0.58 0.022 micSeq634 0.14 0 0 0.30 micSeq9196 0.29 0.29 0.92 0 micSeq349 0 0.58 0.75 0.00013 micSeq28 0.54 0.13 0.67 0.013 Thesamplesizesare14fortheEuropeangroup,12fortheAfricangroup,and12fortheAsiangroup.ForeachmicSeq,frequencyinitalicfontindicatesthe maximumfrequency.12outof15micSeqsshowhigherfrequencyintheAfricanpopulationthantheotherpopulations;9ofthemarestatisticallysignificantat the5%levelasindicatedbyboldfont. Liuetal.BMCGenomics2014,15:685 Page7of14 http://www.biomedcentral.com/1471-2164/15/685 expression arrays in human brain tissue (see below), that they are relatively highly expressed. For example, in as summarized in Figure 5. All of 15 selected micSeqs eachofsix samples (SRR090440, SRR111935,SRR112600, have been confirmed using PCR experiments in 38 ad- ERR030890, SRR111936 and SRR309138), there are more ditionalindividualsfromAfrica,AsiaandEurope. than 100 reads mapped to micSeq3, three of which are shown in Figure 6a. Note that several positions show dif- SomemicSeqsarehighlyexpressedinhumanbrain ferences between assembled reads with micSeq3; the rea- Potential gene associations of micSeqs were established sonmightbeSNPorRNAeditingsinceeditingeventsare by comparing their genomic locations derived from the prevalent in the brain [25]. The read depths were up to Anchor Reads with those of genes obtained from RefSeq 125 in one sample (SRR090440). We further identified database. The results are summarized in Figure 1e; 214 oneESTsequencefrombrainindbEST(BF687531.1)that micSeqs locate in intergenic region (among them, 40 hassignificanthighsimilaritywithmicSeq3(thesequence micSeqs located in the 5 Kb region of either side of an- alignment can be found in Additional file 1: Figure S3), notated genes) while 95 are seen in the UTR or intron which further confirms the existence and expression regions of RefSeq genes. Only one micSeq locates in of micSeq3. MicSeq3 has a length of 265 nt, and is lo- the coding region (micSeq2153 in exon of MUC6). cated in the 3′UTR of gene PABPC1 (chr8:101,715,280- The RefSeq genes associated with micSeq can be found 101,715,513), which codes a poly (A) binding protein. in the online database (see: http://proteomics.case.edu/ The rate of expression of this micSeq in some individ- micseqdb/). uals suggests that it might have a regulatory function or The complete transcriptome of the human brain con- helptocodeanalternativeexonforPABPC1.Moreover, tributed to our understanding of the development and micSeq3hashomologoussequenceswithhighsequence function of different brain regions and functional cir- similarity identified in four other primates: chimpanzee cuits. We further investigated whether the identified (sequence similarity, 99.2%), gorilla (98.9%), orangutan micSeqs were expressed in the brain. For this purpose, (96.6%) and gibbon (94.3%), providing further evidence we collected and analyzed 59 RNA-seq data sets of sam- foritsimportance(Figure6b).However,furtherdetailed ples from various brain regions [18-24]. 11 micSeqs have analysis are needed to determinetheexactfunction: e.g. more than 100 reads in individual samples, suggesting if they are missing exons or noncoding RNAs, and if Figure5VenndiagramofmicSeqvalidationanalyses.TheoverlapofmicSeqsobservedbehomologoustoprimatesequences,andinhigh coveragedatafromfourindividualgenomes,andinmRNAexpressiondataofhumanbrainsamples.Intotal,236micSeqsor76%have independentevidencefortheirpresenceinhumangenomesequences. Liuetal.BMCGenomics2014,15:685 Page8of14 http://www.biomedcentral.com/1471-2164/15/685 Figure6ExampleofexpressedmicSeqsdetectedfromRNA-seqdata.a)ReadsalignedtomicSeq3.Notethatthereadsdepthcanbeupto morethan100,indictingitismoderatelyexpressed.ThecolorcolumnsindicatethatthereadshavedifferentnucleotideswithmicSeq3probably becauseofSNPsorRNAediting.b)HomologsofmicSeqswithhighsequencesimilaritycanbeidentifiedinotherprimates. thesemicSeqsareexpressedinothertissuesorarebrain binding regions we analyzed the whole ChIP-seq data specific. set from the ENCODE project (~1900 datasets). Our analysis provided strong evidence that three micSeqs ThreemicSeqscontainingTF-bindingregions (micSeq8738, micSeq292, and micSeq9698) are likely Recently,theENCODEprojectpublishedaseriesofcom- to bind to TFs, such as POL2, STAT3, FOXA1. The de- prehensive studies to uncover the functional elements in tailed results can befound inSupplemental Material in- the human genome, including analysis of ChIP-seq data cluding the number of reads mapped to micSeqs, the for201transcriptionfactors(TF)in147celltypes[26,27]. TFs and SRA ids. Figure 7 shows that more than 100 To investigate if the identified micSeqs contain TF- reads from three ChIP-seq datasets were mapped to Figure7TFbindingregioninmicSeq8738.ReadsfromChIP-seqcanbemappedtopartofmicSeq8738,indicatingmicSeq8738containsTF bindingregion.ReadsfromthreeChIP-seqexperimentsareshowedinthisfigure.NotethatmicSeq8738mighthaveonlypartofTFbinding regionduetothehighcutoffthreshold(50coverage). Liuetal.BMCGenomics2014,15:685 Page9of14 http://www.biomedcentral.com/1471-2164/15/685 micSeq8738, and the highest coverage was 61×. How- factors. Several large-scale projects, such as HapMap ever, as mentioned above, due to the stringent criteria and 1 K Genome Project [6,7], have made significant to assemble micSeqs (at least 50×) and lower coverage progress toward documentation of all common variants ofNGSdatafromthe1Kgenomeproject,mostofmic- in several human populations, and laid the basis for the Seqs are short (average of 221 nt), and are likely to be success of current generation of genome wide associa- only part of longer sequences in humans. The ChIP-seq tionstudies(GWAS)[28].Althoughthe‘commonvariant- reads cluster in the end of micSeq8738 (Figure 7), sug- common diseases’ hypothesis was not confirmed by gesting the possibility that micSeq8738 contains only GWAS [29,30], and the basis of the so-called ‘missing partoftheTF-bindingregion. heritability’ problem is still not well understood [31,32], Due to potential concerns of sequencing bias in ChIP- many common variants with small effects and many rare seq experiments, comparison with appropriate control variants with large effects need to be identified and data (ChIP-seq experiments using so-called “input investigated [30]. DNA”, non-ChIP genomic DNA) is critical to filter out MicSeq is a special form of an indel, in that these se- false binding sites. For each of micSeq8738, micSeq292, quences are absent in the reference genome. Detecting and micSeq9698, there were very limited reads from intermediate length to large indels (from 20 nt to several control experiments that could be mapped to them - kb) is challenging with the current high-throughput less than 10 reads in the vast majority of samples technologies, and sequences absent in the reference gen- (see Additional file 2). For example, in the case of ome are not targeted in virtually all genotyping projects micSeq8738, on average, 72 reads map to ChIP-seq [33]. De novo assembly of individual genomes followed experiments for 15 TFs on HeLa-S3 cells, compared with by comparison with the reference sequence was a key only 7 reads to the 14 control samples. Both micSeq8738 method used to discover such sequences. As noted, pre- and micSeq292 are located in the neighborhood of non- vious analyses of individual genomes deeply sequenced coding RNAs; micSeq8738 is 1.5 kb away from one using NGS have identified nucleotide patterns present in lincRNA, AC144831.1, while miRNA1268A is very close specific individuals but absent in the reference genome to micSeq292 (1 kb away). These micSeqs might regulate [12-14,16,17]. However, this approach is still prohibitive expressionofthesenoncodingRNAsthroughTF-binding. asamethodtoidentifycommonvariantsduetothehigh Moreover, these micSeqsoccur at a highfrequency in the costfordeepsequencingofmanyindividualsandassem- population; more than half of 363 individuals contribute bling a large number of short reads. Our approach takes atleastonereadtoassemblethem,andonethirdcontrib- advantage of One End Anchor reads from paired end ute at least two reads, implying that their deletion in the reads, and assembles micSeqs by pooling orphan reads referencegenomecouldbeconsideredararevariant. oflowercoverageNGSdata(4×onaverage)fromalarge number of individuals. Validation by comparative gen- Comparisonwiththelatestreferencegenome omics and PCR experiments on additional samples dem- In December 24, 2013, the Genome Reference Consor- onstrate that our method is accurate and effective in tium announced the public release of GRCh38, the latest detecting micSeqs. However, the high-coverage thresh- version of the human reference genome assembly, and old used in the de novo assembly of step 2 (minimum phase 2 of 1000 Genomes Project used a combination of 50×) limits the detected micSeqs to commonly observed reference GRCh37 (hg19) and unlocalized and unplaced sequences instead of rare ones present in only a few in- contigs as reference genome. We searched our sequen- dividuals. This cutoff could be adjusted based on the ces against these two latest versions of the reference number of individuals and the depth of sequencing to genome for micSeqs. Sequences with more than 95% identifyfewerhighly common variantsoragreaternum- similarity were detected for 43 micSeqs for GRCh38 and ber of rare variants. It should also be noted that the 80 for the reference used by the 1000 Genomes Project, resulting micSeqs assembled by our method could be which further validates our approach. These results also part of the true consensus sequence, and a large number suggest that the remaining micSeqs are good candidates of micSeqs in the population remain to be discovered, for inclusion in future releases of the reference human due to the relatively low coverage of NGS data and only genome. 363individualsfromthreepopulationsusedinthisstudy. Theorigin andevolution of micSeqs inhumans can be Discussion estimatedusingaparsimonyapproachalongwiththetime Acomprehensive catalogueofgenetic variants,including pointofdeletionintheancestorsofasub-populationand/ SNPs, indels, CNVs, and common and rare variants is or identification of periods that new sequences emerged. an essential resource for researchers attempting to iden- None of the 309 micSeqs reported here have homologs tify variants that affect phenotypic traits, complex gen- beyond primates, indicating that they are relatively young etic diseases, and responses to drugs and environmental inthehumangenomeandcommonancestorsofprimates Liuetal.BMCGenomics2014,15:685 Page10of14 http://www.biomedcentral.com/1471-2164/15/685 first acquired those sequences in their genomes. It is also which the sequence is identified depends on the num- interesting to note that the number of micSeqs in each ber of individuals pooled and the depth of sequencing primate agrees precisely with the evolutionary distance for each individual; and examination of the concord- with humans (Table 1), suggesting that micSeqs were ance statistics of the mapping locations provided by acquired gradually over long time intervals. Possible me- the anchor end of the OEA reads to filter out false chanisms to acquire new sequences include duplica- micSeqs across the population of individuals and/or tion followed by dramatic changes, admixture with other reads (Figure 2). Clearly this approach could be an ef- closely related species, and horizontal sequence transfer fective means of identifying insertions of novel sequen- from other species. Duplication is a major evolutionary ces related to human disease by examining pooled NGS eventoccurringacrossthegenomesequence,andcontrib- data from affected individuals, including tumors or utes significantly to shape and refactor the functionalities other tissue-specific genomes. However, due to limita- of the organism [34]. However, duplication may not be a tionsintheassemblyofshortreads,wemayhaveidenti- mechanism relevant to micSeqs since no paralogous se- fied only parts of longer insertions. Assembling the full quences can be detected in human and other primates. orphan reads and linking ends to the existing micSeqs Horizontal DNA sequence transfer, particularly through could identify additional sequences. Thus both the or- retroviralinfection[35],istheprocessbywhichanorgan- phanreadsfromOEAand/orfullorphanreadscouldbe ismincorporatesDNAsequencefromevolutionarilyunre- used to identify novel insertions where techniques such lated organism. This phenomenon is well documented in as mapping to primate sequences, examination of indi- evolution of bacterial and parasitic unicellular eukaryotes vidual genomes, comparison to RNA-seq data, or PCR [36]. Recent studies show strong evidence of transfer of and sequencing are used to validate predicted novel in- DNAtohighereukaryotes,suchasDrosophila,wheelani- sertions. The identification of over 300 sequences - at mals and even rodents [37-40]. However, it is still not least 50 per individual - will be a new focus for future clear if horizontal transfer can occur in primates. Recent functionalanalyses. analyses of Neandertal and archaic hominid genomes showthatupto6%percentoftheirgenomescancontrib- Methods utetogenomesofonehumanpopulation[41-43],suggest- Datacollection ing at least two admixtureevents in the course of human The paired-end sequencing data were downloaded from evolution. Thus, the micSeqs possibly resulted from the phase one of the 1 K Genome Project [7]. We collected admixtureofothercloselyrelatedextinctspecies. data for 363 individuals whose origins are European In our data 253 micSeqs (82%) have either homologs (CEU, 102 samples), African (YRI, 88 samples) and in other primates or are present in African populations. Asian (CHB and JPT, 88 and 85 samples respectively). Thus, absence of the majority of micSeqs in part of the Intotal,12,736fastqfiles(5,118filesforEuropeans,3,526 global human population are due to deletions in recent for Africans, and 4,092 for Asians) were collected and human evolution, since strong evidence supports the processed. origin of modern humans occurred in a relatively re- stricted geographic distribution in East Africa. Thirty Trimminglowqualitypositions three of 253 have homologs in other primates, but are Software trimmomatic (version 0.22) was used to trim absent in Africans, suggesting in a complementary fash- adapter sequences and nucleotides with low quality from ion that they were deleted in African populations. Fifty the original data [45]. Trimmomatic performed a sliding six micSeqs have nohomolog inprimates and areabsent window trimming with window size of four nucleotides; in African populations; one possible mechanism to ex- regions where the average quality falls below 25 were re- plain this finding is the acquisition of new sequences by moved. The whole reads were removed if the remaining ancestors ofcontemporary populations after their migra- length was shorter than 20 nucleotides. The correspond- tion fromAfrica. ing mate read was also removed if one whole read was filtered. Conclusions This work presented here builds on previous several Aligningtoreferencegenome technological and analytic advances, beginning with Bowtie2 was applied to map paired-end reads to human NGS data or Sanger sequencing from cloned DNA, to reference genome (hg19, including chr1-22, X, Y, M and identify novel sequence variants in individual genomes other 44 contigs) [46]. A pair of reads is said to align [11,12,14,17,44]. The unique features of this study that “discordantly”, if the alignment of the two reads to the haverelevanceforotherdatasetsinclude:poolingofor- reference does not match paired-end expectations (i.e. phan reads from many individuals’ NGS data to identify the mates aren’t in the expected relative orientation, novel common sequences, where the frequency with aren’t within the expected distance range, or one read
Description: