ebook img

Predicting expected progeny difference for marbling score in Angus cattle using artificial neural PDF

13 Pages·2013·1.1 MB·English
by  
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Predicting expected progeny difference for marbling score in Angus cattle using artificial neural

Okutetal.GeneticsSelectionEvolution2013,45:34 Genetics http://www.gsejournal.org/content/45/1/34 Selection Evolution RESEARCH Open Access Predicting expected progeny difference for marbling score in Angus cattle using artificial neural networks and Bayesian regression models Hayrettin Okut1,2*, Xiao-Liao Wu1,3, Guilherme JM Rosa1,4, Stewart Bauck5, Brent W Woodward6, Robert D Schnabel7, Jeremy F Taylor7 and Daniel Gianola1,3,4 Abstract Background: Artificial neural networks (ANN) mimic the function of the human brain and are capable of performing massively parallel computations for data processing and knowledge representation.ANN can capture nonlinear relationships between predictors and responses and can adaptively learn complex functional forms, in particular, for situations where conventional regression models are ineffective. In a previous study, ANN with Bayesian regularization outperformed a benchmark linear model when predicting milk yield in dairy cattle or grain yield ofwheat. Althoughbreeding values rely on theassumption of additive inheritance, the predictive capabilities ofANN are ofinterest from the perspective of their potential to increase theaccuracyof prediction ofmolecular breeding values used for genomic selection.This motivated thepresentstudy, in which the aim was to investigate theaccuracy ofANN when predicting theexpectedprogeny difference (EPD) of marbling score inAngus cattle. Various ANN architectures were explored, which involved twotraining algorithms,two types of activation functions, and from 1to 4 neurons inhidden layers. For comparison, BayesCπ models were used to select a subset of optimal markers (referred to as feature selection), under theassumption of additive inheritance, and then the marker effects were estimatedusing BayesCπwith π set equalto zero. This procedure is referred to as BayesCpC and was implemented on a high-throughput computing cluster. Results: The ANN with Bayesian regularization method performed equally well for prediction ofEPD as BayesCpC, based onprediction accuracyand sum of squared errors.With the 3K-SNP panel, for example,prediction accuracy was 0.776 using BayesCpC,and ranged from 0.776to 0.807using BRANN. With the selected 700-SNP panel, prediction accuracy was 0.863for BayesCpC and ranged from 0.842to 0.858 for BRANN. However,prediction accuracy for the ANN with scaled conjugate gradient back-propagation was lower, ranging from 0.653 to 0.689with the3K-SNP panel, and from 0.743to 0.793 with theselected 700-SNPpanel. Conclusions: ANN with Bayesian regularization performed as well as linear Bayesian regression modelsin predicting additive genetic values, supporting theidea thatANN are useful as universal approximatorsof functions ofinterest inbreeding contexts. *Correspondence:[email protected] 1DepartmentofAnimalSciences,UniversityofWisconsin,Madison,WI53706, USA 2DepartmentofAnimalScience,BiometryandGeneticsBranch,Universityof YuzuncuYil,Van65080,Turkey Fulllistofauthorinformationisavailableattheendofthearticle ©2013Okutetal.;licenseeBioMedCentralLtd.ThisisanOpenAccessarticledistributedunderthetermsoftheCreative CommonsAttributionLicense(http://creativecommons.org/licenses/by/2.0),whichpermitsunrestricteduse,distribution,and reproductioninanymedium,providedtheoriginalworkisproperlycited. Okutetal.GeneticsSelectionEvolution2013,45:34 Page2of13 http://www.gsejournal.org/content/45/1/34 Background complexfunctionalforms,throughaseriesoftransforma- The availability of genome-wide dense marker panels for tions (i.e., activation functions) driven by parameters. many species of plants and animals has opened doors Multilayer feed-forward is the most common architecture for incorporating genomic information into practical used in ANN, which consists of neurons assembled into breeding programs, an approach known as genomic se- layers. The first layer is called the input layer (the left- lection[1].Itisnoweasytogenerategenome-widescans mostlayeroftheANN)thatacceptsdata(e.g.,SNPgeno- with more than one million SNPs (single nucleotide types) from sources external to the network, and the last polymorphisms) but these huge databases pose chal- layer (the right-most layer) is called the output layer that lengesincomputationalcapacity, dataanalysisandinter- contains output units of the network. In between these pretation of results for genomic selection [2]. For twolayersareso-calledhiddenlayers,becausetheir values example, even with an initial screening to reduce the arenotobservedinthetrainingset.Hence,anANNarchi- number of markers to less than ten thousand SNPs, it is tectureisspecifiedbythenumberoflayers,thenumberof still not feasible for most computational platforms to neuronsineachlayer,andthetypeofactivationfunctions evaluate all combinations of SNPassociations,evenwhen that are used. Neurons in each layer are connected to the low-dimension interactions are explored [3]. Hence, re- neurons from the previous and the subsequent layer duction of dimensionality and feature extraction arguably throughadaptablesynapticweights. play pivotal roles in current genomic studies [4]. The in- The network weights are evaluated in two steps: the tensive computation inherent in these problems has al- feed-forward and the back-propagation steps. In the teredthecourseofmethodologicaldevelopments, and the feed-forward step, information comes from the input same is true for genomic selection [5]. layer and each unit evaluates its activation function, In the genome-enabled prediction of genetic merit for which is transmitted to the units connected to the out- breeding purposes, parametric statistical methods tend put layer. The back-propagation step consists of running to make strong assumptions about functional forms and the whole network backwards to minimize the error the statistical distribution of marker effects. On the one function in the space of weights using the method of hand, ridge regression best linear unbiased prediction gradient descent. A set of weights that minimizes the assumes that all markers have an effect on the trait of error function is considered to be a solution of the interest, and that these effects share a common variance learning problemfor theANN. intheirdistribution.Thissimpleassumptionisobviously Determination of the number of neurons to be placed not true in reality. On the other hand, hierarchical linear in the hidden layer is a critical task in the design of Bayesian regression models, such as BayesA and BayesB ANN. A network with an insufficient number of neurons [1], allow marker effects to be estimated with differential maynotbecapableofcapturingcomplexpatterns.Incon- shrinkage. However, posterior inference, in particular for trast, a network with too many neurons will suffer from the variances of marker effects, depends heavily on the over-parameterization, leading to over-fitting and poor prior assumptions used in these models [6]. Hence, predictive ability of yet to be observed data. Two popular BayesCπ [7] was proposed to overcome some of the techniques to overcome the over-fitting problem in ANN above mentioned drawbacks. A BayesCπ model postu- models are Bayesian regularization and cross-validated lates an unknown probability π that a SNP has no effect early stopping (CVES). Bayesian regularization (BR) con- at all on the trait, while all non-zero SNP effects are as- strains (shrinks) the magnitude of the networks weights, sumedto berandomsamplesfrom anormaldistribution improvesgeneralizationandallowsbiasinparameteresti- with null mean and a common variance. Recently, there mates towards values that are considered to be plausible, has been interest in the use of non-parametric methods while reducing their variance; thus, there is a bias- for the prediction of quantitative traits, such as reprodu- variance trade-off [13]. Unlike other back-propagation cing kernel Hilbert space regressions [8,9], radial basis trainingalgorithmmethodsthatuseasinglesetofparam- functionmodels[10]andnon-parametricBayesianmodels eters,BRconsidersallpossiblevaluesofnetworkparame- with Dirichlet process priors [11]. These nonparametric tersweightedbytheprobabilityofeachsetofparameters. methodsmakeweakerassumptionsandcanbemoreflex- In practice, Bayesian regularized ANN can be computa- iblefordescribingcomplexrelationships[12]. tionally more robust and cost-effective than standard Artificial neural networks (ANN), also known as back-propagationnetsbecausetheycanreduceorelimin- neuro-computationalmodels,provide an appealing alter- atetheneedforlengthycross-validation. native for genome-enabled prediction of quantitative With cross-validated early stopping (CVES) methods, traits [13,14]. These machine learning methods can act the training data set is split into a new training and a asuniversalapproximatorsofcomplexfunctions[15]be- validationsetandagradientdescentalgorithmisapplied causetheyarecapableofcapturingnonlinearrelationships to the new training data set. The ANN performs an it- betweenpredictorsandresponsesandcanadaptivelylearn erative process, first using the training data set to Okutetal.GeneticsSelectionEvolution2013,45:34 Page3of13 http://www.gsejournal.org/content/45/1/34 calculate the weight and bias estimates, and then applies BeadChip [16] was extracted. After data quality control these parameter estimates in the validation data set to and screening, a total of 2421 polymorphic SNPs were calculate the prediction errors. The process iterates re- retained for analysis. The target variable to be predicted peatedly, substituting parameter estimates from the was EPD for marbling score, which had been computed training data set into the validation data set to find the by the American Angus Association using BLUP (best smallest possible average prediction errors for the valid- linear unbiased prediction) for each of these animals, ation data set. Training ceases when the error in the val- based upon their pedigree data and progeny carcass and idation data set increases in certain consecutive epochs ultrasound data [17]. In animal breeding, an EPD is de- (iterations) in order to avoid the problem of over-fitting fined as the predicted performance of a future offspring (the number of consecutive epochs is 6 by default in of an animal for a particular trait (marbling score), cal- MATLAB). The ANN parameter estimates with the best culated from measurement(s) of the animal’s own per- performanceinthe validationsetisthenusedonthetest- formance and/or the performance of its relatives under ingdatatoevaluatethepredictiveabilityofthenetwork. the assumption of additive inheritance. Hence, EPD rep- In a previous study, ANN with BR outperformed a resent a typical linear system since the EPD of an indi- benchmark linear model when predicting milk yield in vidual can be represented as a linear function of the dairy cattle or grain yield of wheat [14]. However, be- EPD of relatives and a residual term that reflects the fact cause breeding values are defined in terms of linear that an individual inherits a random sample of the genes functions based upon additive inheritance, the predictive present within its parents. The distribution of marbling performance of ANN relative to linear additive systems score EPD in the Angus sample was symmetric and is of some interest. This motivated the present study, in suggested a normal distribution (Figure 1), with mean which the aim was to investigate the accuracy of ANN and standard deviation estimated at 0.0265 and 0.254, for predicting expected progeny differences (EPD) for respectively [18]. marbling score in Angus cattle relative to hierarchical linear Bayesian regression models. Various ANN archi- Statisticalmethods tectures were explored, which involved two training Artificialneuralnetworks algorithms, two types of activation functions, and from 1 A schematic presentation of a multilayer perceptron to4neuronsinhiddenlayers. (MLP) feed-forward of an ANN is presented in Figure 2. This is a multi-layer feed-forward neural network with a Methods single middle (hidden) layer and four neurons. Training Datasets an ANN involves the minimization of an error function The data contained 3079 registered Angus bulls, geno- that depends on the network’s synaptic weights (the w’s typed with the Illumina BovineSNP50 BeadChip, from in Figure 2), and these weights are iteratively updated by which the SNP content for the Illumina Bovine3K a learning algorithm, to approximate the target variable. Mean 0.0265 Skewness 0.468 StdDev 0.254 Kurtosis 0.356 Range 1.920 Interquartile 0.330 Figure1Histogramanddensityplotofderegressedexpectedprogenydifferencesformarblingscorefor3079Anguscattle. Okutetal.GeneticsSelectionEvolution2013,45:34 Page4of13 http://www.gsejournal.org/content/45/1/34 b(1) 1 SNP1 X1 w1,1 ∑ f1 w2,1 b2(1) w1 SNP2 X2 . ∑ f2 SNP3 X3 . b3(1) w2 b(2) . . . ∑ f3 w3 ∑ g y . b(1) 4 SNP2421 X2421 w4,2421 w4 ∑ f4 Input layer Hidden layer, 4 neurons hyperbolic tangent Output layer, 1 neuron linear activation Figure2Schematicrepresentationofathree-layerfeed-forwardneuralnetwork.Genotypesfor2421(or700)SNPswereusedasinputs x={x|i=1,2,…,n},wherenisthenumberofindividualswithgenotypes;eachSNPwasconnectedtoupto4neuronsviacoefficientsw , j ij kj wherekdenotesneuronandjdenotesSNP;here,w isaweightfromahiddenlayerunitstotheoutputunit,f isanactivationfunctionapplied k k tohiddenlayerunits(e.g.,thehyperbolictangent),gisanactivationfunctionappliedtotheoutputlayerunit(e.g.,linear),b(1)andb2arebiases ofhiddenandoutputlayerunits,and^y isapredictedvalue. Theupdatingisusuallyaccomplishedbyback-propagating Xp bð1Þþ w x, where p=700 or 2421, and bð1Þ is the the error, which is essentially a non-linear least-squares k kj j k problem [19]. Back-propagation is a supervised learning j¼1 bias parameter defined in the hidden layer and this algorithmbasedonasuitableerrorfunction,the valuesof quantity is then transformed using some linear or which are determined by the target (i.e., marbling EPD) nonlinear activation function (f ) as: and the mapped (predicted) outputs of the network (i.e., k ! fitted values ofmarbling EPD).Weightsin anMLP archi- Xp tecture are determined by a back-propagation algorithm f bð1Þþ w x : ð1Þ k k kj j to minimize the sum of squares of errors using gradient- j¼1 descent methods [13]. During the training, weights and The above is the output of the hidden layer, which in biases in the ANN are successively adjusted based on the turn is delivered to the output layer (e.g., each neuron k input and the desired output. Each iteration of feed- in the hidden layer sums ANN input x after multiplying forward in a MLP constitutes two sweeps: forward acti- j them by the strengths of the respective connection vation to produce a desired output, and backward weights, w , and adds biases b and then computes its propagation of the computed error to update the values kj k outputasafunction ofthe sum),andiscollected as: for the weights and biases. The forward and backward ! sweeps are repeatedly performed until the ANN solution XK Xp agrees with the desired value to within a pre-specified bð2Þþ wkfk bð1Þþ wijxj ; ð2Þ tolerance [20,21]. k¼1 j¼1 (cid:1) (cid:3) Let x0j ¼ x1j;x2j::::::;xnj be a row vector that con- where wk is the weight specific to the kth neuron (k=1, tains SNP genotypes for all i = 1,…,n individuals, for 2,…,K), and b(2) is the bias parameter defined in the the jth SNP, where j = 1,…..,p, and p = 700 (referred output layer. Next, quantity (2) is activated with the fol- to as the 700-SNP panel) or p = 2421 (referred to as lowing function: " # the 3K-SNP panel). In an ANN, SNP genotypes are XK connected to each neuron in a single hidden layer via gð:Þ¼g w f ð:Þþbð2Þ : ð3Þ k k weights wkj, for k=1,…,K neurons, with each weight k¼1 defining a specific SNP-neuron connection with ap- The above yields the fitted marbling EPD, ^y , for each propriate biases (intercepts), bð1Þ;bð1Þ; ::::::;bð1Þ, each individual in the training set. Finally, the prediicted value 1 2 k pertaining to a specific neuron. The input into ofmarblingscoreEPD canbecomputedinthetestingset neuron k, prior to activation is expressed linearly as using a formula similar to (3). We used the hyperbolic Okutetal.GeneticsSelectionEvolution2013,45:34 Page5of13 http://www.gsejournal.org/content/45/1/34 tangent sigmoid and linear (identity) activation functions minimize the following function, equal to a penalized inthehiddenandoutputlayers,respectively. log-likelihood, We used BR and scaled conjugate gradient (SCG) F ¼βE ðDjw;MÞþαE ðwjMÞ; ð5Þ back-propagation as training algorithms. The basic idea D W of back-propagation algorithms is to adjust weights in the steepest descent direction (negative of the gradient), where E (w|M) is the sum of squares of network W such that the objective function decreases most rapidly weights, M is the ANN architecture, and α and β are [22]. However, in practice, this does not necessarily pro- positive regularization parameters that must be esti- duce the fastest convergence, although the function mated. The second term on the right hand side of (5), may decrease most rapidly along the negative of the known as weight decay, favors small values of w and de- gradient. Several conjugate gradient algorithms have creases the tendency of a model to over-fit the data. been proposed, in which a search is performed along Hence, training involves a tradeoff between model com- conjugate directions, generally leading to faster conver- plexityandgoodnessoffit.Largevaluesofαleadtopos- gence to a local function minimum than steepest des- terior densities of weights that are highly concentrated cent directions [22,23]. On the one hand, with the around zero, so that the weights effectively disappear, exception of SCG, the conjugate gradient algorithms the model discounts connections in the network [24,25], use linear searches at each iteration and thus, are com- and complex models are automatically self-penalized. putationally expensive as they require that the network From equation (5), if α << β, the fitting or training algo- response to all training inputs be computed several rithm places the most weight on goodness of fit. If α >> times for each search. On the other hand, the SCG al- β, emphasis is placed on reducing the magnitude of the gorithm combines a model-trust region approach with weights at the expense of goodness of fit, while produ- the conjugate gradient approach and uses a step size cinga smoothernetwork response[26]. scaling mechanism to avoid time-consuming linear In an empirical Bayesian framework, the “optimal” searches [23]. Hence, SCG can significantly reduce the weights are those that maximize the conditional poster- number of computations performed in each iteration, ior density P(w|D,α,β,M), which is equivalent to min- but may require more iterations to converge than do the imizing the regularized objective function F in equation other conjugate gradient algorithms [22]. SCG uses the (5). Minimization of F is identical to finding the (locally) CVES technique to prevent over-fitting, which is com- maximum a posteriori estimates of w, denoted wMP, monly used in neural networks and can improve their which minimize E using the back-propagation training D predictive ability (generalization). algorithms [24]. However, this is possible only if n > m, Training proceeds as follows. Let the data set be D= where misthenumber ofparameterstobeestimated. {y,{xi}i=1,…,n}, where xi is a vector of inputs (SNP geno- Bayestheoremyieldstheposteriordensityofαandβas: types) forindividualiandyisavectoroftarget variables (EPD). Once a set of weights, w, is assigned to the con- Pðα;βjD;MÞ¼PðDjα;β;MÞPðα;βjMÞ: nections in the networks, this defines a mapping from PðDjMÞ the input x to the output ^y . Let M denote a specific i i If the prior density P(α,β|M) is uniform, maximization network architecture (in terms of numbers of neurons of P(α,β|D,M) with respect to α and β is equivalent to and choice of activation functions), then the typical ob- maximization of P(D|α,β,M). Bayesian optimization of jective function used for training a neural network using the regularization parameters requires computation of CVESisthesum ofsquaredprediction errors(E ): D the Hessian matrix of the objective function F evaluated Xn at the optimum point wMP [27], but directly computing E ðDjw;MÞ¼ ð^y−yÞ2 ð4Þ D i i the Hessian matrix is not always necessary. As proposed i¼1 by MacKay [28], the Gauss-Newton approximation to the Hessian matrix can be used if the Levenberg-Marquardt forninput-target pairsdefining D. optimization algorithm is employed to locate the mini- Regularization produces shrinkage of parameter esti- mum of F [13,29,30]. The Levenberg–Marquardt training mates towards some plausible values, while at the same algorithm [31] achieves second-order trainingspeed with- time reducing their variance. In Bayesian models, re- out computing the Hessian matrix. Briefly, the Hessian gularization techniques involve imposing certain prior matrixisapproximatedas: distributions on the model parameters. In Bayesian ANN (e.g., BRANN), the objective function that was H¼J0J; ð7Þ specified in (4) has an additional term that penalizes large weights, in the hope of achieving a smoother map- where J is the Jacobian matrix that contains first deriva- ping. Gradient-based optimization is then used to tives of the network errors with respect to network Okutetal.GeneticsSelectionEvolution2013,45:34 Page6of13 http://www.gsejournal.org/content/45/1/34 parameters(theweightsand biases).Thegradientiscom- unknown and needs to be inferred, with the prior distri- putedas: bution ofπtaken tobeuniform between0and1, g¼J0e; ð8Þ πeUniformð0;1Þ: ð14Þ andnetworkparametersareupdatedas: A Bernoulli indicator variable, δ, is introduced to fa- j cilitate sampling from the mixtures of the SNP effects, wkþ1 ¼wk−ðJ0JþμIÞ(cid:2)J0e: ð9Þ asfollows: Here, μ is Levenberg's damping factor, which is ad- pðδijπÞ¼π1−δið1−πÞδi: justed at each iteration and guides the optimization process. If the reduction of error sum of squares is Hence, unconditionally,(cid:1) the va(cid:3)riable αj follows a rapid, a smaller value of μ is used to bring the algo- univariate-t distribution, t 0;S2α;υα , if δj=1, or is equal rithm closer to the Gauss–Newton algorithm. Alterna- to zero otherwise [6]. Posterior inference of unknown parameters in the Bayesian model via Markov chain tively, the damping factor is increased to give a step Monte Carlo (MCMC) implementation is described in to gradient descent direction if an iteration provides [7]. With a subset of, say k≤p, selected markers, the insufficient reduction in the error sum of squares [32]. Optimal values of regularization parameters, α and β, statistical model takes the same form as (12), replacing p with kforthenumberofmarkers. in BRANN can be calculated as: By assuming that all k of the selected SNPs (based on γ αMP ¼ ð10Þ the posterior model probability and including the fre- 2EwðwMPÞ quency of marker k during MCMC) have non-null effects on the quantitative trait, we define a BayesCπ and model with π=0, which was used for the statistical in- n−γ ference and model cross-validation subsequent to selec- βMP ¼ ; ð11Þ 2E ðwMPÞ tion of markers (referred to as post-selection hereafter). D So, posterior inference in BayesCπ with π = 0 was as for where 0 ≤ γ=m−2αMPtr(HMP)−1≤m the number of BayesC π, except thatπ was fixedatzeroandhence sam- effective parameters in the neural network, and m is plingoftheindicatorvectorδwasnolongerrelevant. the total number of parameters. Here HMP is the max- imum a posteriori estimate of H described in (7). Computationalimplementation MATLAB [31] was used tofit the ANN. Each neural net- BayesCπandBayesCwithπ=0 workconsistedofthreelayers(i.e.,input,hiddenandout- Considerthefollowinglinearmodel: put layers). The number of neurons in a single hidden layer variedfrom1to4.EachANNhad2421(or700)in- Xp y ¼μþ x α þe; ð12Þ puts (SNPs). Before processing, MATLAB automatically i ij j i rescaled all input and output variables using the j¼1 “mapminmax”functionsuchthattheyresidedintherange where y is the marbling score EPD for the ith animal; μ [−1, +1], to enhance numerical stability. Two combina- i is the overall mean; α is the substitution effect associ- tions of activation functions were used: (i) a set of hyper- j ated with the jth SNP (j=1,…,p); x is an indicator vari- bolic tangent sigmoidal activation functions from the ij able corresponding to the genotype(cid:1)at the(cid:3) jth SNP (0, 1, input layer to the hidden layer, plus a linear activation 2) for the ith individual, and eeN 0;σ2 is a residual functionfromthehiddenlayertotheoutputlayer,and(ii) i e term,whereσ2 istheresidualvariance. a set of linearactivation functions from the input layerto e The BayesCπ model [7] assumes that each SNP effect the hidden layer and from the hidden layer to the output is null with p(cid:1)roba(cid:3)bility π, or that it follows a normal dis- layer.Trainingwas stopped ifanyofthefollowingcriteria tribution,N 0;σ2 ,with probability 1-π,i.e.: were met: (i) a maximum number (1000) of epochs was α reached, (ii) performance had met a pre-specified (the (cid:5) (cid:4) (cid:1) (cid:3) performance function for feed-forward networks is the αi(cid:5)(cid:5)(cid:5)π;σ2α ¼eN00;σ2α wwiitthhpprroobbaabbiilliittyyð1−ππÞ: ð13Þ msueitaanblesqtuaragreet,eorrro(riv))ltehveelL,e(viiei)nbthereg-gMraadriqenutarwdtapsabraemlowetear exceeded1010. Here, σ2α is a variance common to all non-zero SNP For each ANN architecture, eight replicates were run. effects, which (cid:1)is assig(cid:3)ned a scaled inverse chi-square dis- Each replicate was independently initialized, in order to tribution, χ−2 vα;s2α . Furthermore, the value of π is eliminate spurious effects caused by the starting values, Okutetal.GeneticsSelectionEvolution2013,45:34 Page7of13 http://www.gsejournal.org/content/45/1/34 and to improve predictive ability. The results were pre- results than prediction using a smaller panel, yet the op- sented as averages across the eight replicates per ANN timal panel size may depend on many factors. In this architecture. study, we empirically chose the 700-SNP subset as the BayesCπ with π set equal to zero (referred to as the optimal panel. The fitting accuracy in the training set BayesCpC procedure) was implemented via a high- and predictive accuracy in the testing set using the opti- throughput computing pipeline to select SNPs for post- mal700-SNPsubset areillustrated inFigure3b. selection statistical inference and cross-validation [5]. This pipeline ran multiple chains for both feature selec- DeterminationofanoptimalANNarchitecture tion and cross-validation. A three-fold cross-validation The performance of the ANN architectures was exa- approachwasemployed,inwhichthewholedatasetwas mined based on the sum of squared errors (SSE) of divided into three approximately equal portions, with prediction with a 3K panel, averaged over eight inde- two-thirdsusedfortrainingandone-thirdusedfortest- pendent replicates of each ANN, for both BRANN and ing, and the portions used for training and testing were SCGANN. Each ANN had a distinct combination of rotated three times. Each cross-validation experiment training algorithm, transformation method, and number was randomly replicated eight times. Three parallel of neurons, but both, BRANN and SCGANN, had an MCMC chains were run for each feature-selection ana- input of 3K SNPs. The average SSE ranged from 13.936 lysis, and each consisted of 50 000 iterations after a to 16.882 for BRANN, and from 36.531 to 39.571 for burn-in of 5000 iterations, thinned every tenth iterate. SCGANN. Smaller SSE were produced by BRANN with MCMC sampling for each cross-validation consisted of nonlinear activation functions and from 2 to 4 neurons, 100 000 iterations, with a burn-in of 10 000 iterations, and by SCGANN with nonlinear activation functions thinnedeverytenthiterate. and from 1 to 4 neurons. There was no evidence that more complex networks (e.g., with more neurons or a Results non-linear transformation function) produced better DeterminationofanoptimalSNPpanelsize predictions than the linear model, as the ANN were The predictive performance of each ANN was examined similar in terms of their SSE. Possibly, this was because usingeitherthe3K-SNPpaneloranoptimalsubsetof700 marbling score EPD is estimated under an additive selectedSNPs.Thelatterwerederivedfromthe3K-panel, model of inheritance in which additive genetic merit has selected using the BayesCpC procedure with three-fold a linear relationship with SNP effects. Nevertheless, cross-validation.This was accomplishedbyexaminingthe BRANN performed as well as the linear models when predictionperformanceof varyingpanelsizesfrom100to predicting this linear system. Also, BRANN consistently 2400 SNPs in 100-SNP increments, and the optimal set produced a more accurate prediction of marbling score thatgavethebestpredictionincross-validationswasiden- EPD than did SCGANN. On average, SSE obtained from tified. The reason for not choosing the optimal subset BRANN were about 40% to 50% of those obtained from based on ANN models was because the selection tasks SCGANN (Figure 4). This is attributed to the use of with a grid of 24 candidate SNP-panels of varying sizes Bayesian regularization, which handles over-fitting better were too computationally intensive for BRANN. Never- than doestheSCG back-propagation. theless, the parallel-BayesCpC pipeline handled this task easily, because all jobs were submitted to run in parallel Predictiveperformanceusingthe3K-SNPpanel onaclusterofdedicatedcomputers. BRANN and BayesCpC performed very similarly with As shown in Figure 3a, the correlation between marb- the 3K-SNP panel and both methods yielded higher pre- ling score EPD and their fitted values in the training set diction accuracies than did SCGANN (Figure 5). On (referred to hereafter as fitting accuracy) increased al- average, the correlation in the training set was 0.941 most monotonically with panel size, until reaching a with BayesCpC. This correlation ranged from 0.942 to plateau at a panel size of 1400 SNPs. However, the 0.967 with BRANN, and from 0.796 to 0.897 with correlation between marbling score EPD and their pre- SCGANN. The average correlation (in the testing set) dicted values in the testing set (referred to hereafter as was 0.776 with BayesCpC, and ranged from 0.776 to predictive accuracy) reached its peak (0.863) with a 0.807 with BRANN, and from 0.653 to 0.689 with panel size of 700 SNPs, and decreased thereafter. The SCGANN. In general, these correlations increased decrease in prediction accuracy with > 700 selected slightly with the number of neurons, but no consistent SNPs possibly reflects over-fitting in the training set, pattern wasobserved(Figure 5). which, inthis case, happened much before the panel size Withthe3K-SNPpanel,thenumberofSNPs(i.e.,2421) exceeded the training set size (i.e., approximately 2000 exceeded the number of animals (~2000 animals) in the animals). Hence, with Bayesian regression models, pre- training set. This means that there were more parameters dictionusingmoreSNPsmaynotnecessarilyyieldbetter to be estimated than data points, even when all SNPs Okutetal.GeneticsSelectionEvolution2013,45:34 Page8of13 http://www.gsejournal.org/content/45/1/34 a Training Testing 0.96 0.94 0.92 0.9 0.88 s n 0.86 o ati 0.84 rel 0.82 r o 0.8 C 0.78 0.76 0.74 0.72 0.7 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 Number of Selected Markers X 100 b Figure3SelectionofanoptimalSNPpanelsizeforpredictingmarblingscoreexpectedprogenydifference(EPD)usingtheBayesCpC approach.(a)CorrelationsbetweenmarblingscoreEPDandtheirfitted(predicted)valuesinthetrainingandtestingsets,wheretheSNPpanel consistedof100,200,…,2400markers,respectively;(b)correlationbetweenEPD(inputvalues)andfittedvaluesinthetrainingset(upper)and correlationbetweenEPD(inputvalues)andpredictedvaluesinthetestingsetfortheoptimalsetof700SNPs(lower). entered the model linearly. This explains the better pre- the posterior mean (standard deviation) of π was 0.621 dictive performance of BRANN over SCGANN: Bayesian (0.0830), based on 2421 polymorphic SNPs; this means regularization imposes a penalty for model complexity that, on average, 918 SNPs were fitted in the model at and prevents over-fitting, while SCGANN captures both each MCMC iteration. Hence, over-fitting was not a signalandnoiseinthetrainingset,leadingtopoorpredic- concern. Posterior densities for π are shown in Figure 6 tion. The results illustrate that ANN with Bayesian and were computed using posterior samples obtained regularization can behave better than ANN with SCG from each of the three parallel chains and from all back-propagation. BRANN penalize the estimation of pooled samples of the three chains. The density plots for largeweightstoachieveasmoothermapping,whichleads the three parallel chains highly resembled each other toeffectivelymoreparsimoniousmodels. and had similar means, indicating that convergence and In the BayesCpC procedure, the BayesCπ model pos- length of the MCMC chains were appropriate for π. We tulates that a portion, π, of all SNPs have no effect on noticed that the estimate of π obtained from the 3K marblingEPD.Inahigh-densitySNPpanel,πistypically panel wassmallerthanthatbased on BovineSNP50geno- expected to be large, meaning that the portion of “sig- types, because the latter has typically been greater than nal” SNPs, 1 - π, is small, and the chance of over-fitting 0.95[5].Thus,itwouldseemthat,withalow-densitySNP is reduced. Using the Illumina Bovine3K SNP genotypes, panel, the interpretation of SNPs having putatively non- Okutetal.GeneticsSelectionEvolution2013,45:34 Page9of13 http://www.gsejournal.org/content/45/1/34 Figure4Sumofsquarederrors(y-axis)inthetestingsets,computedasaveragesfromeightreplicatesforBayesianneuralnetworks (BRANN)andANNcoupledwiththeSCGalgorithm(SCGANN),forANNwithdifferentnumbersofneurons. zero effects on the target traits should be taken with cau- thebestpredictionswhengeneralizedbeyondthetraining tion, because many could be distant from the functional set.Thisphenomenonisreferredtoaspoorgeneralization genes or quantitative trait loci. We also observed that the inmachinelearningmethods[12]. selected number of SNPs having non-zero effects based on BayesCπ in the training set did not correspond to the Predictiveperformanceusingthe700-SNPpanel number of SNPs in the optimal SNP panel for prediction Unlike the case with 3K-SNPs, prediction using the (918 vs 700 SNPs). We suspect that parameter π in a selected 700-SNPs was not challenged by over-fitting. BayesCπmodeldoesnotfullyinformonthesizeofanop- Hence, we observed a smaller difference in prediction timal SNP panel for prediction; a model that describes performance between BRANN and SCGANN because variation in a training set well does not necessarily yield regularization was not a decisive issue in the prediction. 1.2 Training Testing 1 0.8 0.6 0.4 0.2 0 ayes CpC Linear 1_Neuron _Neurons _Neurons _Neurons Linear 1_Neuron _Neurons _Neurons _Neurons B 2 3 4 2 3 4 BRANN SCGAN Figure5Correlationsbetweenmarblingscoreexpectedprogenydifferencesinthetraining(testing)setsandtheirfitted(predicted) valuesusingBayesCpCandBRANNandSCGANNwithdifferentnumbersofneuronsinthehiddenlayerandusingthe3K-SNPpanel. 1training=correlationsinthetrainingsets;testing=correlationsinthetestingsets;2BayesCpC=Bayesianregressionmodel,wheretheBayesCπ modelisusedforfeatureselectionandBayesCπwithπ=0isusedforpost-selectionstatisticalinferenceandcross-validation;BRANN=artificial neuralnetworkwithBayesianregularization;SCGANN=artificialneuralnetworkwithscaledconjugategradientback-propagation. Okutetal.GeneticsSelectionEvolution2013,45:34 Page10of13 http://www.gsejournal.org/content/45/1/34 5 5 4 4 sity 3 sity3 n n De 2 De2 1 1 0 0 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Chain #1 Chain #2 5 5 4 4 sity 3 sity3 n n De 2 De2 1 1 0 0 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Chain #3 pooled Figure6PosteriordensityforparameterπintheBayesCπmodelobtainedfromthreeparallelchainsandfrompooledsamplesofthe threechains. With the 700-SNP panel, the predictive performance of inheritance, and this smoothing may mask non-linear BayesCpC was slightly better than that of BRANN, variation in the response variable. A better way to possibly because the subset of SNPswasoptimal, at least analyze this trait would be to remove variation due to as selected by BayesCpC. As expected, BayesCπ and contemporary groups from the field data and then BRANN outperformed SCGANN in the prediction of analyze individual marblingphenotypes, but this was not marbling score EPD but differencesin performance were possible here because we did not have access to the raw smaller with the 700-SNP panel than with the 3K-SNP data, which were, in general, collected on progeny of the panel. Average prediction accuracy was 0.863 with genotyped bulls. Theaccuracy ofEPD varies between in- BayesCpC and ranged from 0.843 to 0.858with BRANN, dividuals, which suggests that the residuals may be whereas the average prediction accuracy with SCGANN heteroscedastic due to unequal prediction error vari- ranged from 0.743 to 0.793. For both BRANN and ances. An alternative is to use deregressed EPD as the SCGANN, the difference in predictive accuracy between target variable, for which the parent averages are re- the linear and non-linear activation functions was negli- moved and appropriate weights can be applied to ac- gible (Figure 7). count for heteroscedastic residuals [33]. Some reports suggest that training on deregressed EBVcould generate Discussion higher accuracy of genome-enhanced breeding values We investigated the predictive performance of ANN in than training on EBV [34]. However, the main purpose comparison to a hierarchical Bayesian regression model, of the present research was to investigate the predictive using 3K-SNP and 700-SNP panels. The 700 SNPs were performance of ANN in comparison with Bayesian re- preselected from the former based upon their power to gressionmodelsinalinearsystem.WeusedEPDinstead predict marbling EPD. Various ANN architectures to of deregressed proofs because correct deregression in- predict marbling score EPD were examined, in conjunc- volves matrix operations (the data available were incom- tionwithtwotrainingalgorithms,twotypesofactivation plete for correct deregression), or approximations. Both functions, and from 1 to 4 neurons in the hidden layer. EPD and deregressed EPD are heteroscedastic because We did not observe significant differences in predictive of an unequal amount of information, but the neural accuracy between linear and non-linear models, prob- network software useddidnot allowfor incorporation of ably because the relationship between marbling score heteroscedasticityina straightforwardmanner. EPD and SNP effects is theoretically linear. An EPD In genomic selection, the joint estimation of a genome- produces a smoothed data point based on additive wide SNP set is challenging due to the so-called “large p,

Description:
For comparison, BayesCπ models were used to select a subset of optimal markers (referred to . After data quality control and screening, a 22. Beal MH, Hagan MT, Demuth HB: Linear filters. In Neural Network Toolbox™. 6 User's Guide. Natick: The Math Works Inc; 2010:135–154. 23. Møller FM: A
See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.