IntJLegalMed(2013)127:559–572 DOI10.1007/s00414-012-0788-1 ORIGINAL ARTICLE First all-in-one diagnostic tool for DNA intelligence: genome-wide inference of biogeographic ancestry, appearance, relatedness, and sex with the Identitas v1 Forensic Chip BrendanKeating& Aruna T. Bansal&Susan Walsh& Jonathan Millman& Jonathan Newman&Kenneth Kidd&BruceBudowle& ArthurEisenberg& Joseph Donfack&PaoloGasparini&Zoran Budimlija&Anjali K.Henders& HareeshChandrupatla&David L. Duffy&Scott D.Gordon&PirroHysi& Fan Liu&Sarah E. Medland&LaurenceRubin&Nicholas G.Martin& TimothyD.Spector&ManfredKayser& on behalf ofthe International Visible Trait Genetics (VisiGen) Consortium Received:30August2012/Accepted:17October2012/Publishedonline:13November2012 #TheAuthor(s)2012.ThisarticleispublishedwithopenaccessatSpringerlink.com Abstract When a forensic DNA sample cannot be as- missing persons can be both costly and time consuming. sociated directly with a previously genotyped reference Here, we describe the outcome of a collaborative study sample by standard short tandem repeat profiling, the using the Identitas Version 1 (v1) Forensic Chip, the investigationrequiredforidentifyingperpetrators,victims,or first commercially available all-in-one tool dedicated to Electronicsupplementarymaterial Theonlineversionofthisarticle (doi:10.1007/s00414-012-0788-1)containssupplementarymaterial, whichisavailabletoauthorizedusers. BrendanKeatingandArunaT.BansalaswellasTimothyD.Spector andManfredKaysercontributedequallytothiswork,respectively. B.Keating(*) K.Kidd TheUniversityofPennsylvania, YaleUniversitySchoolofMedicine, Office1016,AbramsonBuilding,3615CivicCenterBvld., POBox208005,NewHaven,CT06520-8005,USA Philadelphia,PA19104-4399,USA e-mail:[email protected] : B.Budowle A.Eisenberg : InstituteofAppliedGenetics, A.T.Bansal L.Rubin DepartmentofForensicandInvestigativeGenetics, IdentitasInc., UniversityNorthTexasHealthScienceCenter, 1115Broadway,12thFloor, 3500CampBowieBlvd, NewYork,NY10010,USA FortWorth,TX76107,USA : : S.Walsh F.Liu M.Kayser(*)I J.Donfack DepartmentofForensicMolecularBiology, LaboratoryDivision,FederalBureauofInvestigation, ErasmusMCUniversityMedicalCenterRotterdam, 2501InvestigationParkway, POBox2040,3000CARotterdam,TheNetherlands Quantico,VA22135,USA e-mail:[email protected] : P.Gasparini J.Millman J.Newman InstituteforMaternalandChildHealth,IRCCSBurloGarofolo, CentreofForensicSciences, UniversityofTrieste, 25GrosvenorStreet, PiazzaleEuropa1, Toronto,ONM7A2G8,Canada 34127Trieste,Italy 560 IntJLegalMed(2013)127:559–572 the concept ofdevelopingintelligenceleads based onDNA. Introduction The chip allows parallel interrogation of 201,173 genome- wide autosomal, X-chromosomal, Y-chromosomal, and mi- There have been, and likely will continue to be, forensic tochondrial single nucleotide polymorphisms for inference cases where the evidentiary DNA profile does not directly of biogeographic ancestry, appearance, relatedness, and match that of a known individual or any reference sample sex. The first assessment of the chip’s performance profile contained within a national DNA database. In addi- was carried out on 3,196 blinded DNA samples of tion, current forensic DNA profiling has provided, and varying quantities and qualities, covering a wide range likely will continue to provide, little or no information in a of biogeographic origin and eye/hair coloration as well number of missing person cases, including mass disaster as variation in relatedness and sex. Overall, 95 % of the identification, where scant information is available on the samples (N 0 3,034) passed quality checks with an putative identity of the remains found. Traditional policing overall genotype call rate >90 % on variable numbers places heavy reliance on human eyewitnesses to enable of available recorded trait information. Predictions of investigators to identify suspects. While eyewitness reports sex, direct match, and first to third degree relatedness have been shown to be helpful, they are highly error prone were highly accurate. Chip-based predictions of biparen- [1, 2], and consequently a number of people convicted on tal continental ancestry were on average ~94 % correct the basis of eyewitness identification evidence have been (further support provided by separately inferred patrilin- exoneratedthroughforensicDNAtesting[2].Theemerging eal and matrilineal ancestry). Predictions of eye color fieldofDNAintelligenceallowsnovelinvestigativeleadsto were 85 % correct for brown and 70 % correct for blue be developed directly from the DNA of a forensic sample eyes, and predictions of hair color were 72 % for that can help in identifying persons previously unknown to brown, 63 % for blond, 58 % for black, and 48 % for the authorities. This has the benefit of reducing reliance on redhair.Fromthe5%ofsamples(N0162)with<90%call human eyewitness accounts and can provide leads in the rate, 56 % yielded correct continental ancestry predictions many cases without known human eyewitnesses [3, 4]. while 7 % yielded sufficient genotypes to allow hair Valuableinformationinthisrespectincludesbiogeographic and eye color prediction. Our results demonstrate that ancestry and externally visible characteristics (EVC) of the the Identitas v1 Forensic Chip holds great promise for a unknown sample donor such as sex, eye color and hair wide range of applications including criminal investiga- color, and others, via the discipline of Forensic DNA Phe- tions, missingpersoninvestigations,andfornationalsecurity notyping (FDP), as well as the relatedness of the unknown purposes. donor with alleged family members. Worldwide scientific initiatives such as the Human Ge- Keywords DNAintelligence .ForensicDNAphenotyping . nomeProject[5,6]andsubsequentlytheInternationalHap- SNP.Prediction .Relatedness .Kinship .Ancestry.Eye Map Project [7–10] together have laid the foundations for color.Haircolor.Sex the discovery and large-scale population diversity cata- logues of several millions of single nucleotide polymor- phisms (SNPs). Highly effective massively parallel SNP genotyping platforms using microarray technology were Z.Budimlija developed from these resources allowing genome-wide NewYorkCityOfficeofChiefMedicalExaminer, analysis of currently over a million SNPs in a single test 421East26thStreet, [11, 12]. High-resolution SNP microarrays have been used NewYork,NY10016,USA in various human population studies, such as in the world- : : : : A.K.Henders D.L.Duffy S.D.Gordon S.E.Medland wide Human Genome Diversity Panel (HGDP-CEPH) [13, N.G.Martin 14], and in population studies within continents such as QueenslandInstituteofMedicalResearch, Europe[15],Asia[16],Africa[17],India[18],andOceania RoyalBrisbaneHospital, [19]. Together with the HapMap data resources [10] and LockedBag2000,Herston, Brisbane,Queensland4029,Australia candidatemarkerstudiesincludingthoseonY-chromosomal and mitochondrial DNA diversity [20], substantial knowl- H.Chandrupatla edge on DNA-based inference of biogeographic ancestry AnjinSolutions, has begun to be realized. From these datasets, so-called 34DowningLane, Voorhees,NJ08043,USA ancestry-informative DNA markers (AIMs) have been de- : veloped.AutosomalAIMsetsusuallyprovideancestryres- P.Hysi T.D.Spector olution at the level of broad geographic regions such as DepartmentofTwinResearch,King’sCollegeLondon, St.Thomas’Hospital,WestminsterBridgeRoad, continents [21–25], whereas with some particular Y- LondonSE17EH,UK chromosomal and mitochondrial AIMs, within-continental IntJLegalMed(2013)127:559–572 561 resolution can be achieved [4, 20]. In some efforts, multi- consortium members and data from governmental forensic plexAIMpanelssuitableforforensicapplicationshavebeen labs in the USA and Canada. A total of 3,196 DNA developed for autosomal, Y-chromosomal, and mitochon- samples collected from around the world were analyzed. drial SNPs [26–32]. However, especially when it comes to Many of them have recorded sex, continental ancestry, and recombining autosomal markers, such reduced AIM panels eye and hair color information, and, in some cases, details tendtoprovidelessancestryresolutionthanhigh-resolution of relatedness. The DNA samples were of varying quality SNP microarrays[4]. and quantity as a result of titration and degradation experi- Moreover, high-resolution SNP microarrays have been ments, and the establishment of mock case-work samples. used in genome-wide association studies (GWASs) to Genotype quality was assessed, and predictions of sex, discover SNPs involved in human EVCs, most notably biogeographic ancestry, hair color, eye color, and kinship eye and hair color [33–37]. From these studies, and from were derived and compared with study-recorded trait data, candidate gene studies [38–42], DNA markers predictive where available. This study provides the first insights into for human eye and hair color categories have been iden- the performance and feasibility of the Identitas v1 Forensic tified [37, 43–48]. The first multiplex tools for DNA- Chip, the first all-in-one diagnostic tool dedicated to DNA based eye and/or hair color prediction have been made intelligence. available recently [49–53] of which at least one, the IrisPlex system for blue and brown eye color prediction [49], has already been successfully validated for forensic Materialand methods applications [54]. All conventional efforts to develop diagnostic tools for DNA samples and available individual information DNA intelligence however were restricted to specific ele- ments and were analyzed separately. This limited approach A total of 3,196 DNA samples were studied. In part, sam- is not only partly due to the incremental nature of the ples were deliberately drawn from a highly biogeographi- research but also because of technological restrictions on cally diverse set of individuals, in order to investigate the the number of SNPs that could be genotyped reliably in qualityofpredictionofbiogeographicancestry.Forasubset single multiplex assays, suitable for forensic samples. For of 2,780 individuals, self-reported or site-reported ancestry instance, the SNaPshot chemistry, which is the most wide- information was available. For the purposes of obtaining spread SNP typing technology in the forensic genetic accuracy estimates, individuals were categorized into five field, only allows for the analysis of up to a few dozen major biogeographic groups, plus another category that SNPs in a single multiplex assay. Thus, several separate included, for example, West Asians and individuals from DNA tests needed to be performed to enable screening of Oceania,forwhomHapMapv3referencesampleswerenot sufficient markers for comprehensive DNA intelligence available. The breakdown of site-reported ancestry, where efforts. In many forensic cases, however, input DNA available, and without masking the inevitable overlap in amounts are substantially limited, restricting the number categorizations, was as follows (count in parentheses)— of independent DNA tests that can be performed. A single 1,880 individuals were categorized as European descent multiplex SNP typing tool including a large number of includingthefollowingself-declaredorsite-reportedgroups: SNPs is therefore needed that allows various elements of Adygei (25), Austrian (2), Azerbaijani (38), British (682), DNA intelligence to be inferred, in parallel, from a single Caucasian (46), Chuvash (25), European (380), Georgian forensic DNA aliquot. (117), German (2), Hungarian (25), Irish (303), Italian (2), In an international, industry–academic collaboration, the Komi(25),Poland(4),andWhite(204)and240individuals International Visible Trait Genetic (VisiGen) Consortium, werecategorizedasEastAsiandescentincludingthefollow- the Identitas Version 1 (v1) Forensic Chip was developed. ing self-declared or site-reported groups: Ami (25), Asian The chip, based on well-established Illumina Infinium (20),Atayal(25),Cambodian(20),Chinese(1),Hakka(25), technology, allows simultaneous genotyping of 192,658 Japanese (25), Korean (24), Laotian (25), Micronesian (25), autosomal SNPs of genome-wide distribution, 3,012 Y- andYakut(25).AlthoughMicronesiaisintheWesternPacif- chromosomal, 5,075 X-/XY-chromosomal, and 428 mito- ic, Micronesiansgenetically are known tobelargelyofEast chondrial SNPs. The genome-wide SNPs were selected pri- Asianancestryasresultoftheirmigrationhistory(althougha marilyforkinshipandbiogeographicancestryinference.The NearOceanianancestrycomponentexistsaswell)[55],which panel though was enriched with SNPs that were previously justifiestheirgroupinghere;176individualswerecategorized establishedtohavepredictivevalueforbiogeographicances- as African descent including the following self-declared or try and several appearance traits most notably eye and hair site-reported groups: African (2), African American (3), color. Herein, the first performance study ofthe Identitas v1 Barbadian(2),Black(108),Ethiopian(25),Hausa(25),and Forensic Chip is reported based both on data established by Jamaican (11); 123 individuals were categorized as South 562 IntJLegalMed(2013)127:559–572 American descent including the following self-declared or time points and two different concentrations were assessed, site-reported groups: Karitiana (25), Mayan (25), Pima for each of the four DNA samples (24 samples altogeth- (25), Quechua (23), and Ticuna (25); 31 individuals were er). Samples were exposed to UV light for time intervals categorizedasSouthAsiandescentincludingthefollowing of 0, 5, 10, and 30 min, using the Bio-Link (Vilber self-declaredorsite-reportedgroups:Kerala(25)andSouth Lourmat) at a strength of 50 J/cm2. Three time points Asian(6);and330werecategorizedoutsideofthesemajor for two different concentrations (100 and 10 ng total biogeographic groups: Druze (29), Gimi (15), Inuit (19), DNA per 5 μl) were derived for each of four samples Iraq (2), Kazakhstani (57), Khanty (25), Lebanese (8), (24 samples altogether). Mixed Race (34), Nasioi (17), Other (21), Uzbek (78), Priortochipgenotyping,DNAsamplescollatedfromthe andYemeni(25). different collaborator sites were re-quantified using a Pico- The majority of the aforementioned DNA samples Green-based assay (Life Technologies). Genotyping fol- came from TwinsUK and the QIMR Twin Registry stud- lowed the standard Illumina Infinium iSelect protocol ies, as well as from the Yale collection as described in (www.illumina.com).Fluorescenceintensitiesweredetected detail elsewhere [56–58]. For those samples, DNA was by the Illumina iScan and analyzed using Illumina’s Bead- derived from whole blood either directly or from blood- Studiosoftware.Thereactionvolumeswere2μlforquality derived lymphoblastoid cell lines. Additionally, a subset checking and 5 μl for genotyping; additional volume was of 171 DNA samples was derived from more forensically required to allow for pipetting. In the data received from relevant sources which included (counts in parentheses): Illumina, samples were categorized as either “passed” or hair (2), buccal swab (97), blood swab (9), semen (2), “failed” using the standard control metrics from Illumina, vaginal swab (3), saliva (3), mucus (1), gum (3), drink withfailureassignedtosampleswithmorethan10%miss- container swab (8), cigarette butt (10), chap-stick swab inggenotypes calls. (1), swab of tape ends (1), saliva/blood mixture (1), and vaginal swab/semen mixtures (30). Two groups contributed Statistical analyses sexual-assaulttypesamples.Twenty-ninesamples,consisting ofeithervaginalorbuccalsamplesfromfemaledonors(N08), Female-derived DNA was inferred on the basis of X- werespikedwithvaryingamountsofsemenfrommaledonors chromosomeheterozygosity; the presenceofY-chromosome (N 0 3) and subjected to a standard differential extraction genotypes confirmed the presence of male-derived DNA. procedure.ResultsderivedusingPlexorHY(Promega),adual Biparental biogeographic ancestry predictions were con- autosomal and male-specific quantification assay, indicated ducted using a marker subset of 81,031 autosomal SNPs thatthepercentageofmaleDNArangedfrom100%tonone. exhibiting low correlation (low linkage disequilibrium). Onegroupsubmittedanadditionalmixturemadeupofvaginal Patrilineal ancestry was derived from a subset of 484 Y- swabDNAplussemen. chromosomal SNPs, and matrilineal ancestry was derived Sensitivitysampleswerecontributedbytwogroups.From using a subset of 280 mitochondrial SNPs, for which the first group, four samples were run in a dilution series of phylogenetic and geographic origin information was avail- totalDNAinputat200,100,50,10,3.3,and1ng,leadingtoa able [59–61]. Principal components analysis (PCA) was totalof24samples.TheDNAconcentrationpersamplewas conducted using, as a reference, data from HapMap ver- measured twice using the Quantifiler HumanDNA quantifi- sion 3 [10]. Individuals were assigned to continental cationkit(AppliedBiosystems)todeterminetotalconcentra- groups using a simple distance-based clustering algorithm. tions of 40 ng/μl down to 200 pg/μl, using a 5-μl sample Model-based clustering, assuming five populations with volume. The second group applied serial dilution to five distinct allele frequencies, was conducted using the same reference sample extracts of 500 ng to 50 pg total DNA, reference set [62, 63]. Hair color and eye color were yielding total DNA concentrations for each sample of 25, predicted using multinomial logistic regression models of 2.5,0.25,0.025,and0.0025ng/μl.Measurementsweretaken predictive SNPs as described elsewhere [49, 52] with the usingPlexorHY(Promega). difference that the following four hair color-predicting Degraded samples were experimentally derived from DNA variants from the MC1R gene could not be imple- four pre-extracted DNA samples using (a) titrated DNase mented on the chip: N29insA (INDEL), Y152OCH, treatments and (b) by subjecting samples to ultraviolet rs1805007, and rs1805009; the hair color prediction model (UV) light time courses. Quantified dilutions at concen- used was adjusted accordingly. Degrees of relatedness trations of 20 and 2 ng/μl, in a total sample volume of were inferred for each pair by calculation of the propor- 110 μl (Quantifiler Human DNA quantification kit, Life tion of the genome shared identical by state (IBS) based Technologies), underwent DNase (Sigma) treatment (0.1 U upon 192,576 autosomal markers with minor allele fre- on each 110-μl volume) for 0, 1, 5, and 10 min with quencies greater than 1 %, and less than 5 % missing approximately 10 μl of each sample extracted. Thus, three genotypes [64]. IntJLegalMed(2013)127:559–572 563 Results of markers for hair and eye color prediction, but all four provided an accurate prediction of biogeographic ancestry. Technical chip performance withrelevance for phenotype At the next level of 3.3 ng reaction DNA, one passed QC, inference onehadthefullcomplementofmarkersforhairandeyecolor prediction,andallfourprovidedaccuratepredictionsofbio- Genotype reproducibility geographic ancestry. At the higher concentrations of 10, 50, and100ngtotalDNA,allsamplespassedQC,allhadthefull Concordance testing using different SNP microarray plat- complementofmarkersforhairandeyecolor,andallprovid- forms was performed on 102 QIMR samples genotyped in edaccuratepredictionsofbiogeographicancestry. thisstudyontheIdentitasv1ForensicChipandpreviously, Despitethepreliminarycharacterofthesensitivitytesting at a different laboratory, on the Infinium 610-Q (Illumina) performed here with small sample sizes, these results dem- GWAS arrays [65]. Up to 107,262 SNPs directly overlap onstratethatbiogeographicancestrymaybeaccuratelypre- between both arrays. The observed discordance was 92 out dictedfromaslittleas1.75ngDNAoreven,insomecases, of the total of 10,831,289 genotype calls (rate 0.00085 %, aslittleas0.175ngDNA.Forothertraits,predictivesuccess dataavailableonrequest).Although,thesedatadonotshow isdependenton generating sufficient relevant genotypes. whichofthetwomicroarraysproducedthebonafidegeno- type,thehighconcordancerateof>99.999%indicatesthat Degradation testing the reproducibility of genotypes from the Identitas v1 Fo- rensic Chip, at least for samples of similar quality and Twenty-foursamplesderivedfromfourinitialDNAsamples quantityas the102 tested here, isveryhigh. weresubjectedtosevereultravioletdegradation.Upongeno- typing, onlythreepassedplatformquality checks.Theycor- Sensitivity testing respondedtothreeofthe100ngsamplesatthefirsttimepoint, andallhadaccuratepredictionsofbiogeographicancestryand SerialdilutionswereconductedbyonegroupforfiveDNA sufficient genotypes to allow hair and eye color prediction. samplesextractedfrombuccalsamplesoffiveindividualsof Theremaining21sampleshadbetween18and50%missing Europeanbiogeographicorigin.Eachwasdilutedtocontain genotypes, none had sufficient genotypes to allow hair and 175, 17.5, 1.75, 0.175, or 0.0175 ng in the 7-μl reaction eye color prediction, but four (all 10 ng at first time point) volumeforchipgenotyping.Forthelowestconcentrationof providedaccuratepredictionsofbiogeographicancestry. 0.0175 ng reaction DNA, all five samples failed platform Thesame24samplesweresubjectedtolesssevere,enzy- QCwithoverallgenotypecallrate<90%,nonehadthefull matic degradation. Median concentration for the set was complement of markers for hair and eye color prediction, <1ng/μl.Upongenotyping,fivesamplesfailedplatformqual- and only one provided an accurate prediction of biogeo- itychecks:oneatadegradationtimeof1min,twoat5min, graphicancestry(quantitativemethod,seelater).Atthenext andthree atthefinaltimepointof10 min.Thefive samples level of DNA concentration, 0.175 ng reaction DNA, one hadbetween11and21%missinggenotypes,comparedtoless samplepassedplatformQC,onehadthefullcomplementof than10%missinggenotypesinthe19sampleswhichpassed markersforhairandeyecolorprediction,andthreeprovid- platformqualitychecks.Theelevatedfailurerateinthissmall ed accurate predictions of biogeographic ancestry. At the degradationsubsetconfirmsthedamagingeffectofnucleases nextlevel,1.75ngreactionDNA,threepassedplatformQC, on DNA genotyping performance. A total of 20/24 samples three had the full complement of markers for hair and eye however had sufficient genotypes to allow prediction of hair color prediction, and all five provided accurate predictions colorandeyecolor,andall24enzymaticallydegradedsamples of biogeographic ancestry. At higher concentrations, ledtoaccurateestimatesofbiogeographicancestry. amounting to 17.5 and 175 ng total DNA, 9/10 passed platform QC, 8/10 had the full complement of markers for Analysisof forensic-type samples hair and eye color prediction, and all 10 provided accurate predictions ofbiogeographic ancestry. A total of 141 single-source DNA samples extracted from Serialdilutionswereconductedbyasecondgroupforfour hair,buccalswab,bloodswab,semen,vaginalswab,saliva, DNA samples extracted from blood of four individuals of mucus,gum,drinkcontainerswab,cigarettebutt,chap-stick European biogeographic origin. Each was diluted to 200, swab, and swab of tape ends were examined, to assess 100, 50, 10, 3.3, and 1 ng in a 5-μl volume. It is noted that performanceinmoretypicalcase-worksamples.DNAcon- for this set, further dilution was required toincrease volume centration was measured by PicoGreen and found to be prior to genotyping, so the amount of DNA used may be generally low (median 0 2.7 ng/μl). Of these 141 samples, lower.Forsamplesatthelowestconcentration(1ngreaction 102 passed QC and had median PicoGreen-based concen- DNA),allfailedplatformQC,nonehadthefullcomplement tration of 3.1 ng/μl (range 0.7–56.6 ng/μl). A total of 102/ 564 IntJLegalMed(2013)127:559–572 141 (72 %) samples had sufficient genotypes to allow pre- these11individuals.Aplausibleexplanationhoweveristhat diction of hair color and eye color. Biogeographic ancestry DNA mix-ups at some stage may have occurred for these wasavailablefor125ofthesamples,themajority(85%)of samples, which would correspond to a male–female sample whomwereofEuropeanancestry.Atotalof110outof125 mix-uprateof0.69%inourstudy.Oneofthe1,588samples, (88 %) of the predictions were correct on the level of unblindedasfemale,waswronglypredictedtobemaleonthe continental ancestry. There were no clear differences in basisoflowX-chromosomeheterozygosityalonebuthadno performance among sample types; correct predictions were Y-chromosome haplogroup inferable. This sample would obtained from blood swab, buccal swab, semen, vaginal have represented a sex misclassification based solely on the swab, mucus, gum, drink container swab, and hair. X-chromosome heterozygosity approach, emphasizing the In addition, a total of 30 multisource DNA samples from importance of using data from both X- and Y-chromosome simulated sexual assault material extracted after differential SNPsforinferringsexfromthesechipdata. lysiswereexamined.Again,DNAconcentrationtendedtobe low (median 0 1.6 ng/μl, range 0.8–61.4 ng/μl). A total Inference ofcontinental biogeographic ancestry of 19/30 samples passed quality checks; however, upon unblindingatsource,itemergedthatonly13ofthemcontained A three-dimensional PCA plot of the HapMap 3 reference maleDNA.Forall13samples,theY-chromosomehaplogroup data[10]ledtotheclusteringofindividualsamplesintofive, was obtained and their assumed geographic region of origin somewhatseparated,maindescentgroups:Europeandescent was found to be consistent with site-reported biogeographic (CEU,TSI),Africandescent(ASW,LWK,MKK,YRI),East ancestry. Asiandescent(CHB,CHD,JPT),SouthAsiandescent(GIH), andSouth American descent (MEX) (Fig. 1).Thisreference Chip-based inference of sex,ancestry, appearance, datasetwasthenusedtoclassifythestudysamplesaccording and relatedness to their biparental continental biogeographic ancestry via a distance-based clustering algorithm. For example, the black Across the whole study, a total of 3,034 (95 %) samples crossinFig.1representsanindividualsample,unblindedas passed platform quality checks with overall genotype call “White” classified by self-reported ancestry information, rates >90 %, while 162 samples (5 %) failed this threshold. whichappearsveryclosetheEuropean-descentHapMapref- Inthefollowingsections,resultsarepresentedoftheanalysis erencesamples (CEUand TSI), highlightingthe mostlikely ofthe3,034DNAsamplesthatpassedqualitycontrolchecks. Europeanbiparentalgeneticancestryofthisindividual.Self- or site-reported biogeographic ancestry on the continental Inference ofsex level was available for 2,688 out of 3,034 samples. Overall, PCA clustering led to good prediction accuracy for certain Atwo-prongedapproachwastakenforchip-basedprediction categoriesofbiparentalgeneticancestry.PredictionsofEuro- of whether the DNA used was derived from a man or a peanancestrywere97%consistentwithself-declaredorsite- woman. Y-chromosome haplogroups, derived from known reportedancestry,predictionsofAfricanancestrywere88% non-recombining male-specific SNPs, were obtained for consistentwithself/sitereport,andpredictionsofEastAsian 1,114 DNA samples, indicating thatthey were derived from ancestrywere97%consistentwithself/sitereport.Predictions males.Inaddition,X-chromosomeheterozygositywasdeter- ofSouthAsianancestry,however,werelower(69%consis- mined forall samples onthe basis of5,066X-chromosome- tentwithself/sitereport),whilepredictionsofSouthAmerican specific markers (without a homologue on the Y chromo- ancestrywerepoor.Specifically39%(52/134)ofthepredic- some),withanestimatedmeanheterozygosityof<0.2imply- tions of South American ancestry were self-/site-categorized ingmale-derivedDNAandanestimatedmeanheterozygosity outsideofthemainfivemajorbiogeographicancestrygroups of >0.8 implying female-derived DNA [64]. Intermediate (they were19Khanty, 9 mixedrace, 9 Other, 1 Lebanese, 1 values were considered inconclusive. Twelve conflicts, be- Kazakhstan,5Uzbekistan,and8InuitAborigine). tweenchip-predictedandsite-reportedsexinformation,were Aquantitativeapproachwasthereforepursued,whichled obtainedinthe1,588sampleswithsite-reportedsexinforma- to the assignment of a probability for each of the five tionavailable.Foursampleswerepredictedtobederivedfrom reference groups, allowing more accurate inferences to be malesonthebasisofbothidentifiedY-haplogroupandlowX- drawn.Bythisquantitativeapproach,sampleswereassigned chromosome heterozygosity but were unblinded as being toasinglecontinentalancestrygroupwhenevertheprobabil- derived from females from the records. Seven samples were ityforthatgroupwasgreaterthan0.70.Whenthemaximum predicted to be derived from females on the basis of no probability for any single continental group was ≤0.70, the inferable Y-haplogroup and high X-chromosome heterozy- samplewasassignedto“multiplegroups.”Inthisway,89% gositybutwereunblindedasbeingderivedfrommalesfrom ofsampleswereassignedtoasinglecontinentalorsubconti- records.Furthernongeneticsexdatacouldnotbeobtainedfor nentalgroup;theremainderwereassignedtomultipledefined IntJLegalMed(2013)127:559–572 565 CEU CHB ASW MKK TSI CHD YRI GIH MEX JPT LWK 0 2 0. nent 0.15 ponent o m al Comp 0.10 0.06 cipal Co hird Princip 0 0.05 0.00 0.02 0.04 econd Prin T 0.0 −0.02 S −0.04 5 0.0 −0.06 −−0.06 −0.04 −0.02 0.00 0.02 0.04 First Principal Component Fig. 1 Three-dimensional principal component analysis plot for a Luhya in Webuye, Kenya; MKK Maasai in Kinyawa, Kenya; YRI DNA sample from a European individual denoted by a black cross, YorubainIbadan,Nigeria),EastAsiandescent(CHBHanChinesein togetherwithreferencedatafromHapMapv3i.e.,fromindividualsof Beijing,China;CHDChineseinMetropolitanDenver,CO;JPTJap- European descent (CEU Utah residents with Northern and Western aneseinTokyo,Japan),SouthAsiandescent(GIHGujaratiIndiansin European ancestry from the CEPH collection; TSI Tuscans in Italy), Houston,TX),andSouthAmericandescent(MEXMexicanancestryin African descent (ASW African ancestry in Southwest USA; LWK LosAngeles,CA) groups.Detailsoftheaccuracyofthesepredictionsaregiven shareaconsiderableproportionoftheirgeneticancestrywith inthefollowingparagraphs. Europeans, likely due to recent European admixture. Some Outofatotal of1,877predictions ofEuropeanancestry, may represent cases were continental ancestry is difficult to 93 % were correct as given by self-report/site report of inferfromgeneticdatabecausetheindividualsdescendedfrom Adygei,Austrian,Azerbaijani,British,Caucasian,Chuvash, a geographic area located between major geographic regions European, German, mixed German/Polish, Georgian, Hun- (suchasfromtheMiddleEastorwesternpartsofAsia). garian,Irish, Komi, Polish,orWhite(meaning likely Euro- Out of a total of 233 predictions of East Asian ancestry, pean). Of the 128 individuals whose predicted European 94%werecorrectasgivenbyself-/site-reportedancestryof ancestry was not clearly consistent with self-report/site re- Ami,Asian,Atayal,Cambodian,Chinese,Hakka,Japanese, port, 26 were Druze, 24 were Yemenite, 15 were mixed Korean, Laotian, Micronesian, or Yakut. Of the 14 individ- Italian/ Greek/ Syrian, 15 were Uzbeks, 13 were Kazakh- uals whose predicted East Asian ancestry was not clearly stani,11wereInuit,8wereLebanese,4weremixedrace,4 consistent with self-report/site report, one was site-reported were “Other,” 3 site-reported as African-American, 2 were as British, two were Gimi, one was Inuit, one was mixed Iraqi, 1 was Japanese, 1 was Quechua, and 1 was South (Italian/Greek/Syrian), one was Kazakh, one was Uzbek, Asian.Y-chromosomehaplogroupswereobtainedfor52out three were Nasioi, and four were Other. The mitochondrial of these 128 samples that showed inconsistency between haplogroups for these individuals, except the Gimi and the ancestry group inferred from the genome-wide SNPs data Nasioi, indicated Eastern Eurasian maternal ancestry sug- andself-/site-recordedancestryinformation.Yhaplogroups gesting, together with the genome-wide results,considerable for 49 of these samples are found in people of European EastAsianadmixture.SixY-chromosomehaplogroups were paternalorigin.Furthermore,themajorityofthese128samples obtained and all indicated East Asian, Southeast Asian, (N 0 102), including three site-reported African-Americans, or Oceanic origin. The finding that mitochondrial and hadmitochondrialhaplogroupswhichindicatedWesternEur- Y-chromosome haplogroups correctly assignOceanic ori- asian maternal origin. These Yand mtDNA results, together ginintheGimiandNasioimayillustratethelimitationsinthe withthegenome-wideresults,suggestedthatdespiteself-/site- resolutionlevelofthegenome-widedataintermsofseparat- reportedancestryinformation,manyofthese128individuals ing Oceanians fromEastAsiansusing our approach. This is 566 IntJLegalMed(2013)127:559–572 likelybecauseOceanicsamplesaremissingintheHapMap3 determination of the Y-chromosome haplogroup as E-U290, referencedatausedasreferenceforthisanalysis. which isobserved in Africa, WesternAsia, and Europe, and Ofthetotalof145predictionsofAfricanancestry,88% themitochondrialhaplogroupHV,whichisobservedinWest- were correct as given by self-/site-reported ancestry of Af- ern Eurasia, indicated that the paternal line is African while rican, African-American, Barbadian, Ethiopian, Hausa, or thematernallineisEuropean,inagreementwithrecord-based Jamaican. Of the 17 individuals whose predicted African ancestryinformation. ancestry was not clearly consistent with self-report/site re- Furtherusefulinsightswereobtained.Figure3showsthe port, 15 were site-reported as “British,” 1 site-reported as box plot for 24 individuals of Ethiopian descent. There is a Other, and 1 site-reported as White. Without exception, all substantial European component compared with the mainly these17samplescarriedAfricanmitochondrialhaplogroups West African reference samples, indicating the genetically indicating, together with the genome-wide results, consid- admixedsituationofNorthAfricans.Similarly,Fig.4shows erableAfricanadmixtureinthesesamples.Furthermore,all theboxplotfor57individualsfromKazakhstan,whichlieson of thefour males hadAfrican Yhaplogroups. thesilkroutebetweenChinaandEurope.Itcanbeseenthat Ofthetotalof107predictionsofSouthAmericanancestry, individuals who come from a location that is intermediate 98%werecorrectasgiven byself-/site-reportedancestryof betweenthe originsofthereferencegroupsmaybeindistin- Karitiana, Mayan, Pima, Quechua, or Ticuna. Both of the guishablefromanindividualofmixed-raceorigin,usingthese individuals whose genome-wide predicted South American first-passtechniques.Theongoingdevelopmentofmethodsto ancestry was not clearly consistent with self-/site-reported estimate,forexample,LDblocksizeinindividualsofmixed ancestry and who classified themselves as Other had mito- ancestrymayelucidatethematterfurther. chondrialhaplogroupsofEasternEurasianorigins.Bothwere female,soY-chromosomehaplogroupswerenotavailable. Inference ofeye and haircolorcategories Ofthe24predictionsofSouthAsianancestry,96%were correct as given by self-/site-reported ancestry of Keralite, Table1showsthebreakdownofchip-predictedversussite- mixed Indian/Pakistani, or South Asian. The individual, reportedeyecolorforsampleswhichpassedplatformqual- whosegenome-widepredictedSouthAsianancestrywasnot itychecksandhadbothsite-reportedeyecolor,aswellasa consistent with self-report/site report, was site-reported as complete genotype profile for the six SNPs required (N 0 British and carried a Western Eurasian mitochondrial hap- 1,136). It canbeseenthat 70%ofpredictions ofblue eyes logroup. This individual was female, so a Y-chromosome and 85 % of predictions of brown eyes agreed with site- haplogroupwasnotavailable. reportedeyecolorusingthep>0.7thresholdrecommended Our approach was insightful also in terms of genetically previously [50]. classifying individuals ofmixedcontinentalancestry.Unfor- Table 2 shows the breakdown of predicted versus site- tunately, only a handful of such individuals were unblinded reportedhaircolorforsampleswhichpassedplatformquality withdetailsoftheirparentalorigin.Takingasingleexample, checks and had both site-reported hair color as well as the Fig. 2 shows the quantitative assessment of an individual complete genotype profile of the 18 SNPs required (N 0 whose fatherwas ofAfrican descent and whose motherwas 1,137). It can be seen that using the previously developed ofEuropeandescentaccordingtotherecordinformation.The prediction guide [52], 58 % of predictions of black/dark almostequalproportionsofAfricanandEuropeanDNAwere brownhaircorrespondedtositereportofblack,darkbrown, accurately captured by the method. Furthermore, the orbrownhair;72%ofpredictionsofbrown/lightbrown/dark 0 1. 0 1. 8 bility 0.8 bility 0. oba 0.6 oba 0.6 Pr Pr erior 0.4 erior 0.4 st st o o P 2 P 0. 2 0. 0 0. Africa Europe E Asia S Asia S America 0.0 Africa Europe E Asia S Asia S America Fig. 2 Quantitative assessment of biogeographic ancestry from an individualwhosefatherwasofAfricanoriginandwhosemotherwas Fig.3 Boxplotforquantitativeassessmentsofbiogeographicancestry ofEuropeanorigin for24Ethiopianindividuals IntJLegalMed(2013)127:559–572 567 0 1. highly outbred populations. For populations which are, or have been in the past, genetically isolated, there will be an 8 0. overprediction ofdistantrelatives. y bilit a 6 ob 0. Pr erior 0.4 Discussion st o P The Identitas v1 Forensic Chip is the first all-in-one 2 0. diagnostic tool targeted for DNA intelligence purposes, allowing for massively parallel genome-wide inference of 0 0. ancestry, appearance, relatedness, and sex. This DNA Africa Europe E Asia S Asia S America chip, manufactured by Illumina using their well-established Fig.4 Boxplotforquantitativeassessmentsofbiogeographicancestry Infinium technology, is able to deliver highly reproducible for57individualsfromKazakhstan genotypes as indicated by the >99.999 % genotyping agree- mentachievedinanindependentcomparisonwiththeInfin- blondecorrespondedtositereportofdarkbrown,brown,light ium 610-Q (Illumina) GWAS array. v1 of the Identitas brown, or dark blonde hair; 63 % of predictions of blonde/ ForensicChipexhibited,fromsamplesthatpassedthequality darkblondecorrespondedtosite report oflight brown,dark control threshold of >90 % overall genotype call rate, high blonde, or blonde hair; and 48 % of predictions of red hair predictive power for inference of sex, continental biogeo- correspondedtositereportofredhair. graphic ancestry, individual relatedness up to third-degree level, and somewhat less power also for eye and hair color. Inference ofrelatedness Evenwith<90%callrate,highpredictiveassignmentsuccess wasachievedforcontinentalancestry,althoughthenumberof TheproportionofthegenomesharedIBSwasestimatedfor incorrectinferencesincreaseswithdecreasingoverallcallrate. all pair-wise combinations of the 3,034 samples, using The power of prediction of continental biogeographic 192,576 markers. For themajority ofthe samples however, ancestry may be explained by the large number of array thetruerelationshipswerenotknown/site-reported.Thetrue SNPs used for inference, and the partial redundancy of relatedness of samples was available from one source, for ancestry information across such SNPs. This also was true which3,240pair-wisecomparisonshadbeenperformedfor forthechip-basedinferenceofrelatedness,whichwasbased 81 samples. Inthis set, all 27 first-degree relativepairs,ten on an even larger number of SNPs, and for the sex- second-degree relative pairs (four uncles/aunts, five grand- inference, based on >5,000 X-chromosomal SNPs. The parents, and one half-sibling), three third-degree relative situation is very different for chip-based eye and hair color pairs (two first cousin pairs and one great aunt), and 3,199 prediction, where only a small number of particular nonre- unrelatedpairswerecorrectlyidentified.Oneadditionalpair dundantSNPswithhighpredictivevaluewereused,i.e.,six of individuals was observed to share 9 % of the genome foreyecolorpredictionand18forhaircolorprediction.As IBS, in line with a fourth-degree relationship (e.g., first long as these particular SNPs are genotyped correctly, eye cousinonceremoved).Sitereportforthetwowasunrelated, and hair color prediction can be achieved. The overall call but both were of Caribbean origin. It is acknowledged that rate therefore provides only a first indication of the chip’s thepredictionoffourth-degreerelationshipsisonlyvalidin practicalperformanceonancestry,relatedness,sex,andeye/ Table1 Chip-predictedversussite-reportedeyecolorfor1,136individualswithchipandrecordinformationavailable Predictedeyecolora Site-reportedeyecolor Total Accuracyoverall Accuracyp>0.7threshold Blue(%)b Intermediate(%)c Brown(%)d Blue 428(63) 205(30) 50(7) 683 63% 70% Intermediate 4(29) 7(50) 3(21) 14 50% 0 Brown 21(5) 105(24) 313(71) 439 71% 85% aUsingthepredictionmodeldescribedelsewhere[49,50] bIncludesblue,blue-gray,andgray, cIncludesheterochromia,blue-green,green,green-hazel,gray-green,yellow,andintermediate dIncludeshazelandbrown 568 IntJLegalMed(2013)127:559–572 Table2 Chip-predictedversus site-reportedhaircolorfor1,137 Predictedhaircolora Site-reportedHairColor Total individualswithchipandrecord informationavailable Black(%) Darkbrown/ Lightbrown/dark Blonde(%) Red(%) brown(%) blonde(%) Black/darkbrown 70(10) 351(48) 138(19) 41(6) 133(18) 733 Brown/lightbrown 3(3) 44(39) 37(33) 13(12) 15(13) 112 aUsingthepredictionmodeland Blonde/darkblonde 6(3) 50(23) 76(35) 61(28) 26(12) 219 thepredictionguidedescribed Red 1(1) 17(23) 14(19) 6(8) 35(48) 73 elsewhere[52] haircolorpredictionbutshouldnotbeviewedascategorical The estimates of prediction accuracy for eye and hair go/no gocriteria. color obtained here were lower than those previously Developing a detailed knowledge about the direct and reported with the IrisPlex system for eye color [49, 50] ideal relationship between DNA quantity and quality, chip and the HIrisPlex system for hair color prediction [52]. performance,andtheaccuracy ofancestry,relationshipand Whereas all six IrisPlex SNPs together with the IrisPlex eye/haircolorinferencerequiresadditionaltailoreddatasets eyecolorpredictionmodel[49,50]wereappliedherewith- tobegeneratedinthefuture.Onetechnicalcomplicationin out modification, only 18 of the previously reported 22 the current study was the limited availability of PCR-based HIrisPlex DNA variants for hair color prediction [48, 52] sample quantification prior to genotyping. For the majority couldbeimplementedintheIdentitasv1ForensicChip.An of samples, PicoGreen provided the only measure of DNA adjustedhaircolorpredictionmodelwasthereforeused.The concentrationavailable.AlthoughPicoGreenisknowntobe four DNA variants missing on the chip lay in the MC1R reliable in measuring higher DNA concentrations, i.e., tens gene and are known to be predictive of both red hair and of nanograms per microliter and beyond, it tends to be less dark hair [48, 52]. Indeed, nearly 60 % of individuals with accurate in the single nanogram and sub-nanogram per site-reported red hair were missed. This strongly contrasts microliterrangeencounteredinmanyforensicapplications. with only 14 % instances of red hair missed using the Similarly,thetoleranceoftheIdentitasv1ForensicChipin HIrisPlex system containing all 22 DNAvariants [52]. The using degraded DNA for successful inference of ancestry, mis-categorization of those individuals with red hair had a appearance, and relatedness needs to be followed up with knock-oneffectontheaccuracyofotherpredictedhaircolor more extensive testing. Based on our preliminary data, the categories.Secondly,incontrasttothepreviouseyeandhair chip genotyping of partially degraded DNA still allowed color prediction studies that solely or mainly used single high predictive value for biogeographic ancestry, but more grader color classifications [48–50, 52], the current study dataareneededtodevelopcleardegradationthresholds.As usedself-reportedeyeandhaircolor.Theinevitablesubjec- with DNA quantity, the large number of SNPs used for tivity of self-report means that the estimates achieved here ancestry, relatedness, and (less so) inference of sex worked are likely to be conservative. For instance, in two different infavor of thechip when dealing with degraded DNA. European studies where single-grader eye color phenotyp- Onechallengeinconductingouranalyseswastheconsol- ing was applied, the intermediate eye color category was idation of data from multiple sources, each with different observedatfrequenciesof9.6%[43]and14%[50].These conventions for recording biogeographic ancestry informa- valuesareconsiderablylowerthanthe35%ofself-reported tion.Thelargegeographiccategoriesdescribedherearenec- eye colors that fall into the intermediate category from the essarily simplifications and the accuracy estimatesare likely present study. Previous studies demonstrated that the inter- to be conservative. For instance, a total of 682 individuals mediateeye color category with thesix IrisPlexSNPs used were British by self-report/site report and it was clear, upon here cannot be predicted with as high accuracy as blue and examination of haplogroups, that this term was, in some brown eyes [43, 49, 50]. The inflation of the intermediate instances, intended in the sense of “nationality” rather than eye color category is caused by self-reporting errors and biogeographic origin. Our genome-wide analysis however also has an impact on the estimated prediction accuracies groupedthemwithindividualsoftrueEuropeanoriginwhich for blue and brown. Notably, if we exclude the self- consequentlyloweredtheaccuracyrateobtainedforEuropean reported intermediate eye color individuals from the pre- ancestryassignment.Othersimplificationsincludedthecate- diction analysis, we receive much higher accuracies at gorization of Mayans with South Americans, as well as the 90 % for blue and 94 % for brown eye color. These categorizationofMicronesianswithEastAsians.Themotiva- values are similar to the 94 % accuracy for blue and tion in providing such wide groupings was to maximize the brown achieved in the previous study using single-grader useoftheavailabledata. eye color phenotypes [50].
Description: