ebook img

Handbook on Analyzing Human Genetic Data PDF

339 Pages·2009·2.41 MB·English
by  
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Handbook on Analyzing Human Genetic Data

Handbook on Analyzing Human Genetic Data Shili Lin (cid:129) Hongyu Zhao Editors Handbook on Analyzing Human Genetic Data Computational Approaches and Software 123 Editors ProfessorDr.ShiliLin DepartmentofStatistics TheOhioStateUniversity Columbus,Ohio43210 USA [email protected] ProfessorDr.HongyuZhao DepartmentofEpidemiologyandPublicHealth YaleUniversity SchoolofMedicine 60CollegeSt. NewHaven,CT06520-8034 USA [email protected] ISBN978-3-540-69263-8 e-ISBN978-3-540-69624-5 DOI10.1007/978-3-540-69264-5 SpringerHeidelberg Dordrecht LondonNewYork LibraryofCongressControlNumber:2009931713 (cid:2)c Springer-VerlagBerlinHeidelberg2010 Thisworkissubjecttocopyright.Allrightsarereserved,whetherthewholeorpartofthematerialis concerned,specificallytherightsoftranslation,reprinting,reuseofillustrations,recitation,broadcasting, reproductiononmicrofilmorinanyotherway,andstorageindatabanks.Duplicationofthispublication orpartsthereofispermitted onlyundertheprovisionsoftheGermanCopyrightLawofSeptember9, 1965,initscurrentversion,andpermissionforusemustalwaysbeobtainedfromSpringer.Violationsare liabletoprosecutionundertheGermanCopyrightLaw. Theuseofgeneraldescriptivenames,registerednames,trademarks,etc.inthispublicationdoesnotimply, evenintheabsenceofaspecificstatement,thatsuchnamesareexemptfromtherelevantprotectivelaws andregulationsandthereforefreeforgeneraluse. Coverdesign:WMXDesignGmbH,Heidelberg Printedonacid-freepaper SpringerispartofSpringerScience+BusinessMedia(www.springer.com) Preface and Introduction Thedisciplineofstatisticalgeneticsishighlycomputational.Beitexactcomputa- tionalmethods,simulationbased,orahybridofthetwo,computationalpackagesare indispensabletoolsandconstantcompanionsofresearchersinthefield.Thishand- bookisintendedtoprovidehumangeneticistsandotherbiomedicalresearcherswith guidance on selections of appropriate computational methods and software pack- agesfortheirspecificgeneticproblems.Itmayalsobeusedbystudentsandother learnersasareferenceinconjunctionwithamoretheoreticaland/ormethodologi- callyorientedtextbook.Thisbooktriestostrikeabalancebetweenmethodological expositions and practical guidelines for software selections. Wherever possible, comparisonsamongthecompetingmethodsandsoftwarearemadetohighlightthe relativeadvantagesanddisadvantagesoftheapproaches,sothatthereadercanmake informedchoicestobestmatchtheirspecificneeds. Human genetics has been undergoingan evolution in the past several years as new knowledge and technologies are transforming the field, leading to numerous newdiscoveriesofgenesassociatedwithcomplextraitssuchascancer,obesity,and diabetes. Many recent genome-wide association studies employ the case–control design,wherethestudy subjectsconsistofunrelatedaffectedindividualsandnor- mal controls. For each individual, a large number of genetic markers are queried. Ageneticmarkerreferstoalocationinthehumangenomewherepeoplemaydiffer inthegeneticmaterialtheycarry.Geneticmarkerscancomeindifferentforms,with thesinglenucleotidepolymorphisms(SNPs)mostcommonlyusedduetotheirhigh abundanceinthegenomeandtheavailabilitiesofreliableandaffordabletechnolo- giestogenotypethem.ForaSNP,twodifferentforms(calledalleles)generallyexist atasinglenucleotideposition.Becauseeachpersoncarriestwochromosomes,fora givenSNPwithtwoallelesAanda,therearethreepossiblegenotypesapersoncan have:AA,Aa,andaa.Inthissetting,ageneticassociationstudyamountstoiden- tifying markers that are associated with disease status. This can be accomplished byexaminingwhetherthereisastatisticalassociationbetweenthemarkergenotype and the disease status. Although this analysis resembles a standard epidemiologi- calstudywhereeachmarkercanbetreatedasapotentialriskfactor,therearemany issuesthatareuniquetogeneticsstudiesthatneedtobeaddressed.Forexample,one majorconcerninthesestudiesissampleheterogeneityintheirgeneticbackground, andignoringthisissuemayresultinmanyfalsepositivefindingsthathavenothing v vi PrefaceandIntroduction to do with disease etiology. On the other hand, much research has been done to empiricallycharacterizeandtheoreticallymodelthedistributionsanddependencies ofgeneticmarkers,andsuchknowledgeisverybeneficialforassociationanalysis. Infact,athoroughgeneticassociationanalysisisnotpossiblewithoutagoodunder- standingofthebasicprinciplesinpopulationgenetics,afielddevotedtothestudy oftheallelefrequencydistributionandchangeundervariousfactorsthatcanimpact them, including mutations, random sampling, migrations, and natural selections. ThechapterbyDr.Weir providesanoverviewofthebasic conceptsofpopulation geneticsandservesasthestartingpointoftheanalysisofhumangeneticsdata. Althoughcurrentgenotypingplatformscangenotypeuptoonemillionmarkers, therearemanymoremarkersinthegenomethatarenotqueriedontheseplatforms. The reason that these typed markers can provide a good coverage of the genome isthedependenceamongphysicallyclosemarkers,andsuchdependenceis called linkagedisequilibrium.Forexample,ifoneSNPhasallelesAandaeachwithallele frequency50%,andanothermarkerwithallelesBandbeachwithfrequency50%. If the two markers are independent of each other, we would expect that 25% of chromosomescarrybothAandBinthepopulation,andsimilarlyforallotherthree possible combinations: Ab, aB, and ab. However, it is often the case that if these twomarkersareveryclosetoeachotheronthesamechromosome,thetwoalleles carried on the same chromosome are not independent. In the most extreme case, there are only two types of chromosomes, those carrying AB and those carrying ab, a phenomenon called perfect linkage disequilibrium. Haplotypes refer to the combinationofallelesonthesamechromosome,andthepresenceofsuchmarker dependency is the key underlying recent successes of genetic association studies collectingthegenotypesfromonlyasmallfractionofallknownmarkers.Thereare manystatisticalchallengespresentedintheanalysisofhaplotypes,bothforpopula- tiongeneticsstudiesandformoreeffectivegeneticassociationstudies.Thesetopics arediscussedinthechapterbyDrs.ZhangandNiufocusingonpopulationgenetics and in the chapterby Drs. Epstein and Kwee in the contextof disease association analysis. Genetic association studies can be performed on unrelated individuals using traditional epidemiological designs, for example, case–control design and cohort design,ordesignsuniquetogeneticstudies,forexample,family-basedassociation design.Becausesampleheterogeneityingeneticbackgroundisonemajorconcern in the validity of a genetic association study based on unrelated individuals, var- ious statistical methods have been proposed to utilize genetic information in the collected markergenotypesto make appropriateadjustmentsin association analy- sis. For example, with enough marker information, it is possible to infer genetic backgroundfor each individualand such inferred backgroundinformation can be incorporatedin association analysisto make the results less susceptible to sample heterogeneity.ThisissueisthoroughlystudiedandaddressedinthechapterbyDrs. ZhuandZhang. With data from related individuals,genetic association tests may be conducted inamannerthatisvalid(i.e.,notsubjecttobiasduetosampleheterogeneity)even withoututilizinggeneticmarkerstoinfergeneticbackground.Thebasicprincipleis todetectwhetherthereisadeparturefromrandommarkersegregationatacandidate PrefaceandIntroduction vii locus. For example, if a study population consists of affected children and their parentsandamarkerwithtwoallelesAandaisstudiedforitspotentialinvolvement inthedisease.Ifthemarkerhasnothingtodowithdiseasephenotype,weexpectthat a parentwho is heterozygousAa wouldhave equalchance to transmitallele A or atohis/heraffectedoffspring.Ontheotherhand,ifalleleAincreasesdiseaserisk, we would expectto observe a preferentialtransmission of allele A to the affected offspring.Thistestingprocedureisrobusttosampleheterogeneityastheinference is conditional on each parent’s genotype and the only genetic principle tested is randommarkeralleletransmissionfromparentstooffspring,theMendel’sfirstlaw. Manystatisticaldevelopmentsalongthisresearchroutearediscussedinthechapter byDrs.ZhangandZhao. Both population-basedand family-basedassociation studies examinestatistical associations between a phenotype and the genotypes at a marker. One implicit assumptionisthatthesamemarkergenotypewouldexertthesameorsimilareffects onaphenotype.Whilethisisexpectedtobethecaseformostgeneticmarkersthat havedirectfunctionalimpact,thisassumptionmaywellbeviolatedformanymark- ers.Forexample,consideramarkerwithtwoallelesAandastudiedisnotfunctional butrather is in linkage disequilibriumwith a truly functionalone with two alleles Dandd.ItispossiblethatAispositivelyassociatedwithDinonepopulation,that is, someonecarryingA on one chromosomeis also more likely to carry D on the samechromosome,butAisnegativelyassociatedwithDinanotherpopulation.In this case, an analysis using samples from these two populationstogether may not even be able to detect a genetic association. More importantly,when the markers aresparseandnotexpectedtoprovideagoodcoverageofthegenome,theassoci- ationanalysisparadigmdiscussedabovewillnotbeeffectiveasalargeproportion ofthegenomethatlikelyharborsdisease genesmaybemissedduetopoorcover- age.Thiswasinfactthecaseonlyafewyearsagowhenonlyfewermarkerscould beusedforgeneticanalysis.Inthisscenario,althoughthemarkerswerenotdense enoughtocoverthegenomeforassociationanalysis,theyweremorethanadequate to allowgeneticiststo inferwhethertworelativesinan ascertainedpedigreeshare asegmentinthegenomefromthesameancestor.Forexample,iftwosiblingshave thesamemarkergenotypesacrossasetofcloselylinkedmarkersonthesamechro- mosome,thentheylikelyhaveinheritedthesamegeneticmaterialsfromboththeir parents. A genetic linkage analysis is to statistically assess whether there is a co- segregationofgeneticmaterialswithinacandidateregionandthephenotypewithin a family.For example,this canbedonebystudyingwhetherthereisa correlation between trait similarities and inheritance similarities at a candidate region among a set of individuals from the same family. Consider a study enrolling affected sib pairs. If majority of them share the same genetic materialsfrom their parentsin a region,thenthisregionislikelyinvolvedindiseaseetiology.Notethatincontrastto associationanalysisthatis performedacrossallstudysubjects,linkageanalysisis conductedwithinfamiliesandevidenceisthensummedoveracrossindividualfam- ilies.Statisticalmethodsforlinkageanalysiscanbeconductedforeitherqualitative traits(thechapterbyDr.LiandAbecasis)orquantitativetraits(thechapterbyDrs. Amos,Peng,Xu,andMa). viii PrefaceandIntroduction Exactinferenceofinheritancepatternswithinapedigreeistractableeitherfora small pedigreeor for a few markers, but such inferencebecomescomputationally prohibitive for large pedigree with many genetic markers. In this case, the exact probabilitiesmaybeestimatedbyMonteCarlosimulations.InthechapterbyDrs. Igo, Luo, and Lin, the principles and implementations underlying the simulation methodsforlinkageanalysisinlargecomplexpedigreesarediscussed. One central topic in statistical inference is the control of false positive results soastominimizeanyconsequencesresultingfromfalseleads.Thisissuehasbeen welladdressedwhenonlyoneorasmallnumberofstatisticalhypothesesaretested. However, hundreds of thousands of markers are tested for their associations with disease in a genome-wideassociation study, and false positive controlat the indi- vidualmarkerlevelswillnotbeadequate.Forexample,ifastudyconsiders500,000 markersandthestatisticalsignificancelevelissetat0.01,wewouldexpecttosee 5,000falsepositiveresultsevenwhenthereisnoassociationbetweendiseasestatus andanyofthemarkers.Similarissueexistsinthelinkageanalysiscontext,although nottothesamegreatextentasassociationanalysis.ThechapterbyDrs.Zhangand Ott presents some recent developments on appropriately controlling overall false positiveresultsingeneticstudiesatthegenomelevel. Theidentificationsofdiseasegenescanleadtobiologicalinsightsonpathways involved in disease etiology, and these findings can also be used to predict an individual’s disease risk. In the chapter by Drs. Gail and Chatterjee, they discuss statisticalmethodsthatcanbeusedtomakeuseoffindingsfromgeneticstudiesto identifyindividualsathigherrisksfordisease. ThebookconcludeswiththelastchapterbyDrs.Molony,Sieberts,andSchadt, wheretheydiscussintegratinggeneticsandgenomicsdatatobetterdelineatebiolog- icalpathwaysunderlyingcomplextraits.Inadditiontodiseasestatusandpossibly other clinical outcomes, they consider gene expression data that can now be rou- tinely gathered to measure the expression levels of tens of thousands of genes simultaneously for each study subject. These gene expression data add another wholenewdimensionofstatisticalanalysisandareveryinformationrich.Inprin- ciple, the expression levelof each gene can be thoughtas a quantitative trait, and linkage/associationanalysiscanbeconductedtoidentitygenesregulatingagene’s expressionlevel.Therefore,basedonthisperspective,wewouldbeinapositionto conductgenetic analysis for tens of thousandsof traits. Some of these expression levels may be associated with disease outcome, and so it is natural to investigate how a genetic variation affects the expression levels as well as disease outcomes. Manybiologicalquestionsontheunderlyinggeneticnetworksrelatinggeneticvari- ations,expressionvariations,andphenotypevariationscanbeposedandanswered with these data. This chapter discusses topics falling into the domain of systems biologywherethewholebiologicalsystemisthefocusofastudyandgenome-level dataofdifferenttypesareneededtodissectthenetworks. Wehopethatthisbookwillprovideanoverviewofthemostimportantareasin geneticdataanalysismethods.Wefocusonfundamentalprinciplesand,whenpos- sible,demonstratetheseprincipleswithrealdataexamples.Despiteourefforts,this is not an encyclopedia of statistical methods in human genetics, and some topics PrefaceandIntroduction ix are not included such as the experimental design of a genetic study, data prepro- cessing from high-throughputgenotyping platforms, and copy number variations. Mostimportantly,thisis a veryrapidlydevelopingfield and newtechnologiesare constantly introduced that demand novel statistical approaches to make the most useofthedatacollected.Forexample,thestatisticalmethodsdiscussedinthisbook maynotbethemosteffectiveforinferringinheritancepatternsinapedigreeusing high density SNP data. On the other hand,the availabilitiesof re-sequencingdata fromalargenumberofstudysubjectsleadtoanewsetofinformaticsandstatistical challenges,suchastheincorporationofSNPannotationinformationandthedeal- ingofraregeneticvariations.Wehopethebasicprinciplesandstatisticalmethods discussedinthisbookwillmotivatethereaderstodeveloptheirownapproachesif necessarytoaccelerateourprogressesinmappingdiseasegenes. Contents PopulationGenetics................................................................ 1 BruceWeir HaplotypeStructure............................................................... 25 YuZhangandTianhuaNiu LinkageAnalysisofQualitativeTraits........................................... 81 MingyaoLiandGonc¸aloR.Abecasis LinkageAnalysisofQuantitativeTraits.........................................119 ChristopherI.Amos,BoPeng,YajiXu,andJianzhongMa MarkovChainMonteCarloLinkageAnalysisMethods ......................147 RobertP.IgoandYuqunLuo,ShiliLin Population-BasedAssociationStudies...........................................171 XiaofengZhuandShuangLinZhang Family-BasedAssociationStudies................................................191 KuiZhangandHongyuZhao HaplotypeAssociationAnalysis ..................................................241 MichaelP.EpsteinandLydiaC.Kwee MultipleComparisons/TestingIssues............................................277 QingrunZhangandJurgOtt Estimating the Absolute Risk of Disease Associatedwith IdentifiedMutations ...............................................................289 MitchellH.GailandNilanjanChatterjee xi xii Contents ProcessingLarge-Scale,High-DimensionGeneticandGene ExpressionData....................................................................307 ClionaMolony,SolveigK.Sieberts,andEricE.Schadt Index.................................................................................331

Description:
book is intended to provide human geneticists and other biomedical researchers with guidance on Among the. 19 SNPs in the “dumped region IL10 CEU.txt” dataset, 7 SNPs had minor allele Riordan JR, Rommens JM, Kerem B, Alon N, Rozmahel R, Grzelczak, Zielenski J, Lok. S, Plavsic N
See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.