space POPSEQ Anchoring and ordering contig assemblies from next generation sequencing data by population sequencing Dissertationsschrift zur Erlangung des Grades eines Doktors der Naturwissenschaften an der Technischen Fakultät der Universität Bielefeld von Martin Mascher 28. Oktober 2013 Gedruckt auf alterungsbeständigem Papier (ISO 9706). Contents 1 Introduction 1 1.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2 Publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.3 Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . 4 2 Background 5 2.1 Genetic mapping . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.1.1 Meiotic recombination and genetic linkage . . . . . . . . 6 2.1.2 Genetic markers . . . . . . . . . . . . . . . . . . . . . . 8 2.1.3 Plant mapping populations . . . . . . . . . . . . . . . . 12 2.1.4 Genetic map construction . . . . . . . . . . . . . . . . . 15 2.2 Next generation sequencing technologies . . . . . . . . . . . . . 16 2.2.1 Roche 454 . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.2.2 Illumina . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 2.3 Genome assembly strategies . . . . . . . . . . . . . . . . . . . . 22 2.3.1 DNA fragment assembly . . . . . . . . . . . . . . . . . . 23 2.3.2 Hierarchical shotgun sequencing . . . . . . . . . . . . . . 24 2.3.3 Whole genome shotgun sequencing . . . . . . . . . . . . 27 2.4 Methods for anchoring sequence assemblies . . . . . . . . . . . 31 2.4.1 Integrating physical and genetic maps . . . . . . . . . . 31 2.4.2 The sequence-enriched physical and genetic map of barley 32 2.4.3 Genome zippers: anchoring by collinearity . . . . . . . . 36 2.4.4 Direct anchoring of sequence contigs . . . . . . . . . . . 38 3 The POPSEQ method 41 3.1 Software used in the POPSEQ pipeline . . . . . . . . . . . . . . 41 3.1.1 BWA . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 3.1.2 SAMtools . . . . . . . . . . . . . . . . . . . . . . . . . . 44 3.1.3 MSTMAP . . . . . . . . . . . . . . . . . . . . . . . . . . 45 3.2 Barley populations and sequence data . . . . . . . . . . . . . . 46 3.3 From FASTQ to marker-by-genotype matrix . . . . . . . . . . . 46 3.4 Framework maps . . . . . . . . . . . . . . . . . . . . . . . . . . 48 3.4.1 Morex × Barke iSelect map . . . . . . . . . . . . . . . . 50 3.4.2 Morex × Barke GBS map . . . . . . . . . . . . . . . . . 50 3.4.3 OWB GBS map . . . . . . . . . . . . . . . . . . . . . . 51 3.5 Mapping SNPs and WGS contigs to the framework map . . . . 53 i 4 Proof-of-principle of POPSEQ 61 4.1 Comparisontosequencedbacterialartificialchromosomes(BACs) 61 4.2 Comparison to the integrated physical and genetic map of barley 62 4.3 Using different framework maps for one population . . . . . . . 66 4.4 Using different populations . . . . . . . . . . . . . . . . . . . . 67 5 Applications of POPSEQ in genome-assisted research 71 5.1 Reference-based genetic mapping . . . . . . . . . . . . . . . . . 71 5.2 Gene isolation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 5.2.1 Mapping the Vrs1 gene . . . . . . . . . . . . . . . . . . . 76 5.2.2 Mapping-by-sequencing with exome capture . . . . . . . 78 5.3 Anchoring physical maps . . . . . . . . . . . . . . . . . . . . . . 79 5.3.1 Genetic anchoring of BAC contigs . . . . . . . . . . . . 81 5.3.2 Genetic anchoring of single BAC clones . . . . . . . . . 84 5.4 Comparative genomics . . . . . . . . . . . . . . . . . . . . . . . 86 5.4.1 Collinearity . . . . . . . . . . . . . . . . . . . . . . . . . 87 5.4.2 Evolutionary and population genomics . . . . . . . . . . 87 6 Discussion and outlook 93 6.1 Impact of assembly quality and sequencing depth . . . . . . . . 93 6.2 Limitations of genetic anchoring in the Triticeae. . . . . . . . . 96 6.3 POPSEQ for polyploid and outbred species . . . . . . . . . . . 98 6.3.1 Polyploids . . . . . . . . . . . . . . . . . . . . . . . . . . 98 6.3.2 Outbred species . . . . . . . . . . . . . . . . . . . . . . . 100 6.4 Validation and improvement of the POPSEQ algorithm . . . . 102 6.5 Conclusion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 Bibliography 107 ii List of Figures 2.1 Crossover . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.2 Segregation in a dihybrid cross . . . . . . . . . . . . . . . . . . 9 2.3 Mapping populations . . . . . . . . . . . . . . . . . . . . . . . . 13 2.4 Historic development of sequencing costs . . . . . . . . . . . . . 17 2.5 Schematic overview of hierarchical shotgun sequencing . . . . . 25 2.6 The sequence-enriched physical and genetic framework of barley 34 3.1 Overview of POPSEQ . . . . . . . . . . . . . . . . . . . . . . . 42 3.2 Distribution of coverage in the Morex × Barke GBS data . . . 47 3.3 Missing data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 3.4 Collinearity between IBSC and GBS maps . . . . . . . . . . . . 52 3.5 Placing SNPs into a framework . . . . . . . . . . . . . . . . . . 55 3.6 POPSEQ parameters . . . . . . . . . . . . . . . . . . . . . . . . 57 4.1 Comparison between MxB iSelect POPSEQ and IBSC anchoring 65 4.2 Comparison between MxB iSelect and MxB GBS POPSEQ . . 67 4.3 Comparison of MxB and OWB POPSEQ . . . . . . . . . . . . 69 5.1 POPSEQ and consensus map . . . . . . . . . . . . . . . . . . . 72 5.2 OWB graphical genotypes . . . . . . . . . . . . . . . . . . . . . 74 5.3 Mapping-by-sequencing Vrs1 in OWB . . . . . . . . . . . . . . 77 5.4 Mapping-by-sequencing . . . . . . . . . . . . . . . . . . . . . . 80 5.5 POPSEQ anchoring of the barley physical map . . . . . . . . . 82 5.6 Physical vs. genetic distance in barley . . . . . . . . . . . . . . 83 5.7 Syntenic blocks between H. vulgare and B. distachyon . . . . . 88 5.8 Barley whole exome capture performance . . . . . . . . . . . . 91 6.1 Observed and expected coverage . . . . . . . . . . . . . . . . . 94 6.2 Suppressed recombination in the genetic centromere of barley . 97 iii iv List of Tables 2.1 Sequencing platforms . . . . . . . . . . . . . . . . . . . . . . . . 19 2.2 Progress in genome sequencing . . . . . . . . . . . . . . . . . . 28 2.3 Features of the barley physical map. . . . . . . . . . . . . . . . 33 2.4 Features of the whole genome shotgun assembly . . . . . . . . . 35 3.1 Sequence data generated in this study. . . . . . . . . . . . . . . 48 3.2 Number of anchored WGS SNPs . . . . . . . . . . . . . . . . . 56 3.3 Anchoring statistics. . . . . . . . . . . . . . . . . . . . . . . . . 59 4.1 ConsistencyofPOPSEQpositionsofphysicallycloseWGScontigs 63 4.2 Evaluation of different parameter sets for POPSEQ . . . . . . . 64 5.1 POPSEQ anchoring statistics of FP contigs and BAC clones. . 85 v vi List of abbreviations A. thaliana Arabidopsis thaliana BAC Bacterial artificial chromosome BAM Compressed binary SAM format (see below) B. distachyon Brachypodium distachyon BLAST Basic local alignment search tool bp Basepair BWA Burrows-Wheeler aligner cDNA Complementary DNA of messenger RNA cM CentiMorgan cv. Cultivar DH Doubled haploid DNA Deoxyribonucleic acid E. coli Escherichia coli EST Expressed sequence tag F , F , ... First, second, ...filial generation 1 2 FP contig Fingerprint contig GATK Genome Analysis Toolkit Gb Giga base pair GBS Genotyping-by-sequencing HSP High scoring pair H. vulgare Hordeum vulgare IBSC International Barley Genome Sequencing Consortium Indel Short insertion and deletion polymorphism kb Kilo base pair LD Linkage disequilibrium MAD Median absolute deviation MTP Minimum tiling path MxB Morex × Barke Mb Mega base pair N50 Weighed average contig size, half of the assembly is contained in contigs larger than the N50 NGS Next generation sequencing nt Nucleotide OWB Oregon Wolfe Barleys PCR Polymerase chain reaction RIL Recombinant inbred line vii RNA Ribonucleic acid RNA-seq RNA sequencing using NGS technology SAM Sequence Alignment/Map format VCF Variant call format SNP Single nucleotide polymorphism Vrs1 Six-rowed spike gene WGS Whole genome shotgun viii
Description: