Next Generation Crop Improvement Peter L. Morrell Agronomy & Plant Genetics - Minnesota Monday, February 27, 12 REVIEWS Crop genomics: advances and applications Peter L. Morrell1, Edward S. Buckler2 and Jeffrey Ross-Ibarra3 Abstract | The completion of reference genome sequences for many important crops and the ability to perform high-throughput resequencing are providing opportunities for improving our understanding of the history of plant domestication and to accelerate crop improvement. Crop plant comparative genomics is being transformed by these data and a new generation of experimental and computational approaches. The future of Morrell et al. 2012 crop improvement will be centred on comparisons of individual plant genomes, and some of the best opportunities may lie in using combinations of new genetic mapping Monday, February 27, 12 strategies and evolutionary analyses to direct and optimize the discovery and use of genetic variation. Here we review such strategies and insights that are emerging. The completion of reference genome sequences for many and Brachypodium distachyon7,8. Comparative genom- Genome-wide association important crops and model plants has the potential to aid ics — which is traditionally thought of as the analysis studies (GWASs). Studies that search in the realization of the long-standing promise of plant of synteny (gene order) and sequence comparisons for a statistical association genomics to dramatically accelerate crop improvement1. among related species — is now being redefined by between a phenotype and Since the late 1960s, it has been possible to survey molec- the rapid publication of increasing numbers of refer- a particular allele by screening ular markers across a plant genome2, but for decades the ence genomes, by estimation of sequence diversity from loci (most commonly by genotyping SNPs) across number of markers that could be readily assayed placed high-throughput resequencing, by the examination of the entire genome. limits on the genetic resolution that could be achieved the genomic distribution of large insertions and dele- using either experimental or comparative genetic tions (indels) and copy number variants (CNVs) and approaches. Only a few years ago, the highest-density by the emergence of a new generation of experimental genetic maps required the laborious assay of several thou- and computational approaches. From genetic mapping sand markers (for example, REF. 3). Experimental popula- to evolutionary analysis, the future of crop improve- tions were generally limited to simple crosses between ment will revolve around the comparisons of individual 1Department of Agronomy and Plant Genetics, two parents; more elaborate study designs that might plant genomes. Maximizing the use of this genomic data University of Minnesota, provide an assessment of the genomic distribution of for crop improvement is of fundamental importance if St Paul, Minnesota, 55108. agronomically important mutations and their frequency we are to continue increasing crop production in the 2US Department of in the relevant germplasm were proscribed by limits on face of growing human populations and changing cli- Agriculture–Agriculture marker technologies and the analytical approaches that mates while minimizing the environmental impact of Research Service (USDA–ARS) and Institute for Genomic could be used to distinguish the contribution of multiple agricultural activity. Diversity and the Department parents. Comparative approaches for the identification In this Review, we begin by addressing the chal- of Plant Breeding and of functionally important mutations based on analysis of lenges for comparative crop genomics that are posed by Genetics, Cornell University, marker frequency among populations had also been pro- the complex organization of plant genomes and the high Ithaca, New York 14853, USA. 3Department of Plant posed4, but the high variance in expected allele frequency levels of nucleotide and structural diversity that are found Sciences and the Genome between populations5 made the discovery of function- in many crop species. We then discuss the importance of Center, University ally important variants among the high number of loci understanding domestication, as the origin and demog- of California Davis, surveyed highly improbable. raphy of a crop affect the genetic basis of agronomic traits California, 95616, USA. A reference genome is now available for a number of and influence patterns of nucleotide diversity genome- Correspondence to P.L.M. and J.R.-I. crops (FIG. 1), and progress is being made towards refer- wide. We examine the ways in which our understanding e-mail: [email protected]; ences for crops with large genomes6 (for example, see of the genetics of agronomic traits is being fundamentally [email protected] links in Further information). In addition, reference reshaped by genomic data. High-density genetic markers doi:10.1038/nrg3097 genomes have been published for a number of other are being used in genome-wide association studies (GWASs) Published online 29 December 2011 model plant systems, including Arabidopsis thaliana and can also be exploited for genomic selection. NATURE REVIEWS | GENETICS VOLUME 13 | FEBRUARY 2012 | 85 © 2012 Macmillan Publishers Limited. All rights reserved Soybean Genomics Strategic Plan • Goal 1: Improve Utility of Genome Sequence • Improve bioinformatics resources - practical applications • Goal 2: Translational Genomics - Optimize Breeding Efficiency Boerma et al. 2011 Monday, February 27, 12 Topics • Next generation populations • Selection against deleterious mutations • Applications of genome- wide SNP data Monday, February 27, 12 Observational Astronomy • Information about places we won’t ever visit • The visible portion of the electromagnetic spectrum is only a fraction of what exists • DNA resequencing data latent with information about the past Monday, February 27, 12 DNA Resequencing • Most direct measure of genetic diversity • Can assay all heritable variation • Can now be collected very rapidly Monday, February 27, 12 Genome Size R E V I E W S 18,000 No published sequence 16,000 Published sequence 14,000 Average angiosperm 12,000 ) b M e ( 10,000 z i s e m 8,000 o n e G 6,000 4,000 2,000 0 Cucumber PeacSthrawberry Orange PapayaMedicFaogxotail millet Cacao Rice Grape CassavaSorghuPimgeonpea Potato TomatoSoybeaSnugar beet MaizSeugarcane BarBlreeyad wheat Figure 1 | Crop genome size. Genome size of all published crop genomes (shown in green) and the five most Nature Reviews | Genetics important production crops with unpublished genome sequences (shown in blue). The average angiosperm genome size of ~6 Gb is shown by the dotted line for comparison. Understanding of agronomic traits is also being improved use of local patterns of linkage disequilibrium (LD), will Monday, February 27, 12 by a new generation of multiparent genetic mapping pop- be useful for identifying paralogous reads in complex ulations (or next-generation populations). As we discuss, crop genomes. Although there may be no simple solu- higher-throughput resequencing and marker genotyping tion to the complexity of polyploid genomes, sequencing will also enable new approaches towards crop improve- diploid relatives15,16 or double haploid lines17 can provide ment, such as the identification and selective elimination a baseline for future genome-level research in polyploid of deleterious mutations. crops. The high levels of nucleotide diversity in some crop Challenges of plant genomes genomes pose a challenge for comparative analyses, as The genomic tools that are applied to plants are often higher numbers of mismatches between a sample and developed for and tested against data from humans or a reference will result in reduced sequence read map- other model systems, such as fruitflies or mice9,10, but ping (FIG. 2) or reduced hybridization to oligonucleotide the size and dynamic nature of plant genomes adds to arrays. For example, the maize and human genomes are or exacerbates challenges that are faced in other systems similar in size, but an average pair of maize individuals (FIG. 1). Plants tend to have a larger number of multi- differs at tenfold more sites than any two humans gene families11 and a higher frequency of polyploidy do18. Although many crops do not have high levels of than occurs in mammals. This makes paralogy a more diversity, the difficulties of a diverse genome are not substantive issue because the short sequence reads unique to maize as an outcrossing species: diversity is that are typical of high-throughput sequencing may also high in the clonally propagated grape19 and even in Paralogy not map uniquely to a reference genome, and allelic self-fertilizing (‘selfing’) species, such as barley20. Unlike orthologous genes, variation cannot then be distinguished from differences Another challenge in plant comparative genomics is which trace their common among closely related gene family members (FIG. 2). genome size (FIG. 1). Plant genome size varies by more origin to a locus in an ancestral Paralogy remains a problem even in plant species that than three orders of magnitude in currently character- species, paralogous loci consist of gene copies that trace their have a high-quality reference genome owing to the ized species21, largely owing to the prevalence of trans- common origin to a duplication prevalence of extensive copy number variation12,13. For posable elements22. Size alone makes genomic analysis event within a genome. instance, estimates suggest that the maize reference more difficult: shotgun sequencing reads that are suf- genome accounts for only ~70% of the low-copy-number ficient to provide deep (25×) coverage of four Drosophila Linkage disequilibrium sequences that are present in the parents of a diverse set melanogaster genomes — enabling the identification of (LD). Nonrandom association of alleles at two or more loci. of maize inbreds and that this copy number variation heterozygous sites and structural variation — would pro- The pattern and extent of LD leads to a high percentage of false-positive variants14. It vide a meagre ~1× coverage of the wheat genome. The in a genomic region is affected seems likely that continued improvement in sequence density of transposable elements in plant genomes also by mutation, recombination, read length, along with methodological approaches that means that a large fraction of shotgun sequencing data genetic drift, natural selection and demographic history. assess allelic segregation among lines14 and that make is of limited use for reference-based genomic analysis, as 86 | FEBRUARY 2012 | VOLUME 13 www.nature.com/reviews/genetics © 2012 Macmillan Publishers Limited. All rights reserved Short Read Mapping R E V I E W S a Demographic history and geographic origins. b Genome-wide polymorphisms make it possible to c examine the demographic history and geographic d origins of crops. Domestication is an evolutionarily e ? recent phenomenon, and most of the genealogical history at any locus will be shared between a domesti- cate and its wild progenitor28. Comparisons of alleles within and between domesticated and wild taxa will reveal divergence times that greatly predate the origin of the cultivated form29,30, reflecting the time to most Transposable elements Genes Gene with SNP recent common ancestor of the species rather than the time of divergence of the domesticate. A detailed Figure 2 | Challenges of read mapping in plant genomes. The mapping of short understanding of domestication history requires a Nature Reviews | Genetics sequence reads to a reference plant genome is shown with the genome at the bottom large number of loci in conjunction with modelling and with sequencing reads above. Coloured shapes represent transposable elements of population demography. Some of the earliest work or genes; the two orange ovals represent a pair of paralogous genes. Short sequence on demographic modelling in plants used mean pat- reads are shown directly above where they would map to the reference. Different terns of genetic diversity to fit a bottleneck model of Morrell et al. 2012 scenarios are shown in lines a–e. a | Uniquely mapping reads, including junctions domestication31, an approach that was later extended between sequence repeats. b | A sequence from a diverse genome that would fail to Monday, February 27, 12 to include an explicit likelihood framework32,33. More map to the reference owing to an excess of SNP differences. c | A read from one recently, investigators have used methods that incor- paralogue that maps incorrectly owing to a sequence error or a SNP. The correct porate more detailed information, such as the site mapping is shown with a grey read. d | Reads that would map multiply and are usually filtered from further analysis. e | A read from a third copy of the orange gene that is frequency spectrum34,35, to distinguish among different incorrectly mapped to one of the reference copies, leading to a false SNP. This is likely evolutionary models. to be the result of a copy number variant that was not included in the reference One of the most fundamental issues that influ- genome (as indicated by the question mark). ences the genetic architecture of agronomic traits and the levels of genetic diversity in crop genomes is the number of times that a species has been domesticated. reads map with equal probability to multiple positions There are compelling examples for both single domes- in the reference (FIG. 2). It is not surprising that the crop tications (such as maize and soybeans)36,37 and multiple genomes that have been sequenced to date have all been domestications (such as avocados, common beans and relatively small — the largest crop genome sequenced, barley)38–40, but the number and location of domestica- maize, is less than half the size of the average angiosperm tion events for most crops remain unresolved. Simple genome (FIG. 1; TABLE 1). statistical methods that cluster individuals or popula- Although plant genomes pose a number of challenges tions based on genetic diversity within the domesticate Bottleneck for genomic analysis, they do offer some advantages. can be misleading, as the number of genetic groupings A temporary marked reduction Unlike most animals, crops can be propagated clonally is not necessarily reflective of domestication history41,42. in population size. or maintained as inbred lines, and the seeds of many spe- For example, although genetic evidence suggests two cies can be stored indefinitely, which effectively immor- domestications of the common bean39, genetic drift in Site frequency spectrum The distribution of allele talizes genotypes of interest. This makes it possible to cultivated populations leads to the identification of frequencies in a population: sequence a line once but to phenotype the line many multiple genetic groups43. essentially a count of the times, and it allows replication across environments23. The details of even the simplest of domestication sce- number of alleles in a Inbred lines or specially created double haploids also narios are likely to be complex. For example, geograph- population at a given frequency. avoid the difficulties of sequencing highly heterozygous ical spread of the domesticate followed by admixture genomes. Sequencing of the grape genome has provided with wild relatives can obscure geographic origins44,45. Genetic drift a useful comparison of the advantages and difficulties Extensive admixture may be one explanation for Fluctuations in allele of sequencing a diploid outcrossing accession13 or an the continued controversy regarding the origins of frequencies that are due to the inbred line24. the domesticated indica and japonica subspecies of rice. effects of random sampling. Analyses from recent genome-wide resequencing have Admixture Origin and evolution of crops failed to reach a consensus on the number of domesti- The mixing of two or more Understanding the origins and domestication of crop cations of rice: modelling of genetic differentiation sup- genetically differentiated plants is of substantial evolutionary interest, as domesti- ports separate domestications followed by introgression populations. cated plants provide a model system for studying adapta- at agronomically important loci46, whereas the site Introgression tion25,26. An understanding of crop origins has long been frequency spectrum and phylogenetic analysis of mul- The incorporation of genetic held as central to the identification of useful genetic tiple data sets argue for a single origin35. As whole- material from one population resources for crop improvement27. Domestication shapes genome data become available for more crops and or species into another by the genetic variation that is available to modern breed- their wild relatives, application of methods that make hybridization and backcrossing. ers as it influences levels of nucleotide diversity and better use of additional information from detailed Haplotype patterns of LD genome-wide. The demographic his- haplotype structure and patterns of admixture across The combination of alleles tory of domestication also informs our expectations of the genome (for example, REFS 47,48) will improve or genetic markers found the genetic architecture of traits and thus our ability to insight into the complex demographic histories of on a single chromosome of an individual. identify causal genetic variants for crop improvement. many crops. NATURE REVIEWS | GENETICS VOLUME 13 | FEBRUARY 2012 | 87 © 2012 Macmillan Publishers Limited. All rights reserved Molecular Population Genetics resequencing human Hap in Drosophila Map (1M SNPs) (Kreitman) 2nd generation GenBank human Hap Map begins (3.1M SNPs) allozyme DNA diversity in sequencing resequencing maize genome Drosophila (Maxam & in maize sequenced (Lewontin & Gilbert, Sanger (Shattuck- human genome Solexa DNA Hubby) et al.) Eidens et al.) sequenced sequencing 1960 1970 1980 1990 2000 2010 infinite alleles estimator of estimator of estimators forward time model θ = 4Neµ ρ = 4Ner of ρ simulation (Kimura & (Watterson) (Hudson) (Hudson) (Nordborg, Crow) (Kuhner, Thornton, et coalescent Yamato, al.) infinite sites model theory Felsenstein) (Kimura) (Kingman) (Wall) 4 gamete neutral theory (Fearnhead test (Kimura) & Donnelly) (Hudson & Kaplan) neutral theory Approximate (Kings & Jukes) coalescent Bayesian with Computation recombination (Pritchard et (Hudson) al.) Monday, February 27, 12 Derived Site Frequency Spectrum (SFS) GGGATGGCC......GGCACGGGC .......................G ..C...A.............C..G ........................ ..C...A..........G...... ..C...A................G ........G............... ..C...A.............C..G 4 4 1 1 2 4 Monday, February 27, 12
Description: