ebook img

Statistically invalid classification of high throughput gene expression data. PDF

0.68 MB·English
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Statistically invalid classification of high throughput gene expression data.

Statistically invalid classification of high throughput gene expression data SUBJECTAREAS: ShaharBarbash&HermonaSoreq COMPUTATIONAL NEUROSCIENCE HIGH-THROUGHPUTSCREENING TheEdmond&LilySafraCenterforBrainSciencesandtheDepartmentofBiologicalChemistryattheHebrewUniversityofJerusalem. FUNCTIONALCLUSTERING MICROARRAYS Classificationanalysisbasedonhighthroughputdataisacommonfeatureinneuroscienceandotherfields ofscience,witharapidlyincreasingimpactonbothbasicbiologyanddisease-relatedstudies.Theoutcome ofsuchclassificationsoftenservestodelineatenovelbiochemicalmechanismsinhealthanddiseasestates, Received identifynewtargetsfortherapeuticinterference,anddevelopinnovativediagnosticapproaches.Giventhe 29November2012 importanceofthistypeofstudies,wescreened111recently-publishedhigh-impactmanuscriptsinvolving classificationanalysisofgeneexpression,andfoundthat58ofthem(53%)basedtheirconclusionsona Accepted statisticallyinvalidmethodwhichcanleadtobiasinastatisticalsense(lowertrueclassificationaccuracy 27December2012 thenthereportedclassificationaccuracy).Inthisreportwecharacterizethepotentialmethodologicalerror anditsscope,investigatehowitisinfluencedbydifferentexperimentalparameters,anddescribestatistically Published validmethodsforavoidingsuchclassificationmistakes. 22January2013 Rapidlyincreasingnumbersofhighthroughputstudiesinlifesciencesingeneralandinneuroscience1in particularuseclassificationanalysisforcomparingtwoormorebiologicalstatesandidentifyingdifferences Correspondenceand betweenthem.Thegrowingimpactoftheseanalysesisevidentbytheiruseinthefieldsofneurobiology,cell requestsformaterials biology,immunologyandoncology,tonameafew1–3.Perhapsthemostcommoncomparisonisbetweensamples shouldbeaddressedto frompatientsandthosefrommatchedhealthyindividuals.Suchanalysesoftenserveforidentifyinga‘molecular H.S.(hermona.soreq@ signature’or‘fingerprintgenes’–biomarkersthatelucidatetheunderlyingmechanismofthestudieddisease4and mail.huji.ac.il) lead to diagnostic and drug discovery applications5. The continuously reduced costs, improved technological accuracy,andshortenedexperimentaltimeofhighthroughputgeneexpressionstudiespredictafutureincrease in similar uses of high throughput platforms. However, many studies, including classification analysis, apply statisticallyinvalidmethodsthatcanleadtoerroneousconclusions. Results Definingtheclassificationinaccuracy.Themostcommoninvalidprotocolstartswiththeselectionofasubsetof transcripts, usually a couple hundred out of the (roughly) 25,000human genes studied in different high throughput platforms, which are altered between the two test groups. It then uses this subset of genes for differentiating between the two groups. This approach usually yields a clear distinction between the two groups; however, this is not surprising given the fact that the genes on which the classification was based were initially selected BECAUSE of their differential expressionbetweenthetwo groups. The classificationof such groups is often presented as one of the main findings of these studies; furthermore, some studies argue explicitlythat,basedonthecleardistinctionbetweengroups,thesegenesmayserveasa‘signature’or‘fingerprint’ fordiagnosis of clinicalconditionsin the globalpopulation.Othersmake a more subtleargument, albeitalso erroneous,thatthe‘successful’classificationhighlightsthebiologicaldifferencebetweengroups.Forinstance,one suchstudystatedthat‘‘Distinctionsinthepatternofexpressionofthisgroupofgenes(thesubsetofsignificantly alteredgenes) between thecells of group A and the cellsof group B are clearly evident’’, even thoughsuch a distinctionwouldalsooccurforrandomlyselectedmatricesofnumbers,aswedemonstrateinFig.1. Thereasonforsuchseeminglysuccessfulclassificationsbeingerroneousisthattheyarebasedonacircular argument,andclassifyingtwogroupsbasedonasubsetofgenesthatwereselectedtobedifferentbetweenthe groupsunavoidablyyieldsa‘successful’classification.Thecorrectalternative,tryingtoclassifybasedonthefull matricesoronasubsetofrandomlyselectedgenes,willoftenfailregardlessoftheclassifierbeingused.Figure2 presentsanexampleforsuch‘failed’classificationsandthestatisticaltoolsthatwereemployed.First,wepresent thechancesforcorrectclassificationasdendrogramplots(seeMethodsfordetails)ofrealsetsofexperimental dataand comparablerandom datasets.These aimtoclassify between genesetsofgroupsof6patientsand 6 healthy matched controls (Fig. 2 a–c) or between two normal random datasets of a similar size (Fig. 2 d–f). SCIENTIFICREPORTS |3:1102|DOI:10.1038/srep01102 1 www.nature.com/scientificreports Scopeoftheinvalidanalysisinrepresentativejournals.Todefine the frequency of misclassification errors we reviewed 250 articles encompassing different life science disciplines that employed vari- oushighthroughputassays.Thesepaperswerepublishedbetween 2002–2011 in high-impact journals (e.g. Nature, Nature Neuro- science, Cell and J. of Neuroscience). In about 50% of the articles theresearchersperformedsomesortofclassification;however,53% outoftheseusedstatisticallyinvalidclassification.Figure3showsthe prevalenceofusageofthisstatisticallyinvalidprocedureanddemon- stratesconsistentnumbersofstudiesthatperformclassificationand aconsistentfrequencyofthestatisticallyinvalidclassification. Statistically biased classification depends on sample size and P value.Thedataofanybiologicalexperimentwillincludeastochas- Figure1|Experimentalclassificationandclassificationbasedon ticelement,duetoeitherbiologicalortechnicalreasons.Therefore, every biological experiment includes in its background a random arbitrarynumbersfromanormalrandommatrixyieldcomparable matrix of this stochastic element. Most of the time this random classifications. Normalizedexpressionmatricesforasubsetofgenestaken matrix is harmless in the sense that the chances of it producing fromalargerlistofdepositeddatasetsareshown.Rowsaregenesand falseresults areverysmall.ThisisexactlywhyP-valuesarecalcu- columnsarebiologicalreplicates.Theleftmatrixisbasedonreal lated–toestimatetheprobabilityoftheobservedresultoccurringby experimentaldatawhereastherightoneisbasedonrandom,normally chance. In classification analysis, on the other hand, this random distributeddata.Normalizedexpressioniscolorcoded. matrix in the background may produce statistically biased results. There are two major experimental parameters that influence the Classificationswerebasedonthe100mostdrasticallyalteredgenes probabilityofthestochasticelementtohaveaneffectonthestatis- (Fig.2aandd),whichinbothcasesemergedas‘successful’,present- ticalrigorousnessoftheanalysis.ThesearethesamplesizeandtheP- ing statistically significant (but nevertheless invalid) differences valuethatisselectedasathreshold.Toestimatetheoriginsofsuch betweenthetwotestedgroups.However,comparisonswhichtested statisticalbias,westudiedtheperformanceofrandomdatasetsfor 100randomly-selectedgenes(Fig.2bande)andthefullexpression differentsamplesizesandP-values(Figure4).Thisanalysisshowed matrices(2candf)bothemergedas‘failed’fortheexperimentalas thatthesmallerthegroupsize,thelessvalidtheclassification–the wellastherandomdatasets. reason being that random data produce more false positives Ofnote,therearecaseswherecomparisonoffulldatasetsyields (erroneous ‘correct’ classifications). Thus, groups of less than 20 significant,andhencevalid,classification.Severalmethodsareread- samples are likely to yield biased classification results much more ilyavailablewhichcancorrectlydifferentiatebetweengroupsbased frequentlythanlargerones.Incomparison,theeffectofP-valueson onhighthroughputdatasets.47%ofthe111studieswescreenedand thechanceofmisclassificationwasmaximalformediumsizevalues which performed high-throughput classification analyses of gene (e.g.P50.05and1000testedgenes)andsimilarlylowforverysmall expression datasets used one of these methods and avoided this orverylargeP-values,below0.001orabove0.7. statisticalbiasinclassificationexperiments.Belowwedescribesuch AhighP-valueimpliesthatalargefractionofthegenesanalyzedin validclassificationmethods,whicharebasedonunbiasedclassifica- theexperimentwastakenintoconsideration,butthesegenesshowa tion of the expression levels of all the tested genes, on a selection smalldifferencebetweenthegroups.Thisisthereasonforthedropin rationalforidentifyinggroupsofcandidategenes,oronsplittingthe thechanceofcorrectclassificationinrandomdata(asshowninthe cohortintotrainingandtestdatasets. right-handpanelinFigure4).AttheextremityofP-valueapproach- Theoutcomeofselectingthemostdrasticallyalteredgenesbased ing1weactuallydonotselectanysubsetofgenes,butrathertryto onastatisticaltestperformedpergeneisthatthevariancebetween classify based on all of them. And, as expected, this is where clas- the members of a group becomes much smaller than the variance sificationfails.Incomparison,intheleft-handsidedomainoflowP- between members ofdifferent groups. Thisis shown in Figure 2g, values,takingtoofewgenesintoconsiderationhampersourabilityto wherethedecisionregionsforclassificationarebasedontheminimal classify. This is where 10 genes are more informative than 5. The distancebetweenthecoordinateandthemeanofthegroupsinthe distinctionisbestevidentinthecentralregionofFig.4:forexample, plane composed of components one and two of the Principal ingroupsincludingaround20sampleseachandforaP-valuethresh- ComponentAnalysis(PCA)(see‘methods’formoreinformation). old of 0.001, classification of random data is 100% successful. Thefirstprincipalcomponentwhichcapturesthemaximalvariance, Unfortunately,studiesofsuchsizearerare(theyareexpensiveand andwhichispresentedontheXaxisinfigure2g,easilydiscriminates technicallydemanding)andlowersamplesizesareoftenemployed- betweenthetwogroups.Althoughtheresultspresentedinfigure2g whichwouldmakethestatisticallybiaseddefinitionof‘correct’clas- look ‘real’ or ‘significant’, the distribution of the total number of sificationduetorandomnessevenmorerobust(comparegroupsof6 misclassifications (figure 2j–l) demonstrates that the outcome is samplestothoseof20inFigure4).Thisimpliesthatthestochastic always zero when using the 100 ‘most discriminative genes’. The elementoftheexperimentisbyitselfsufficienttoseparatethegroups chance of perfectly correct albeit arbitrary classification, for this when using a subset of mostly altered genes. The ‘successful’ clas- scenario,isactually100%.Inotherwords,theP-valuethatshould sification,therefore,cannotbeattributedtothebiologicalsignalof beassignedis1. theexperimentandhencehasnobiologicalmeaning.Usingavalid Usingstatisticalterminologytheissueisoneofselection(ascer- statisticalmethod-LeaveOneOutCrossValidation(LOOCV)for tainment)anderrorsinestimatingeffectsizes.Selectingasetofgenes example-reducesthevaluestoaround50%correctclassifications out of all genes based on a treatment or group differences when across all group sizes and P-values (grey line in Figure 4), dem- individualeffectsongeneexpressionareestimatedwitherrorleads onstratingtherobustnessofthisvalidatingprocedure. to a correlation between treatment difference and random error. Creatingapredictoroutofthe‘best’genesandfittingthesebackin Discussion the same data then amplifies these errors and may lead to serious Severalpossiblereasonsforthehighfrequencyofstatisticallybiased bias. classificationstudiesarepresented,asfollows.First,whilestudiesin SCIENTIFICREPORTS |3:1102|DOI:10.1038/srep01102 2 www.nature.com/scientificreports Figure2|Comparisonofclassificationsbasedonasubsetof100‘discriminativegenes’(ornumbers),100randomly-selectedones,andfullmatrices. Dendrogramplotsareshownforgroupsof6patientsand6healthymatchedcontrolvolunteers,basedonthe100mostalteredgenes(a),on100 randomly-selectedgenes(b)andonthefullexpressionmatrices(c).(d–f)Dendrogramplotsbasedonnormallydistributedrandomdataofsimilarsizes. (g–i)ScatterPCAplotsanddecisionregionsbasedonthe‘minimaldistancetogroupmean’classifier,ontheplaneofPC1andPC2.(j–l)Distributionsof thenumberofmisclassificationsbasedonthe‘minimaldistancetogroupmean’classifierforthethreeprotocolsdescribedin(a–c).Numberofiterations 5500.SeeMethodsforfurtherdetails. thelifesciencesshowagrowingtendencytousecomputationaltools, these are often internet-based software tools in which the user’s controlovertheappliedparametersisverylimitedandthefullpic- tureoftheunderlyingalgorithmthatservedtodevelopthissoftware toolisinaccessibletothetypicaluser.Second,theseeminglyunbi- ased automatic clustering of samples, which provides a ‘reliable’ distinction between groups and presents a dramatic difference be- tween different experimental conditions or patient groups. This pointsattheveryattractiveprospectofpossiblediagnosticsordrug target applications. Of note, the conflict between the increasingly simpleusageoflargedatabasesandtheeffortrequiredforvalidation ofsingle targetgenes can leadto limiting thenumber ofvalidated genesthatareselectedforfurtherfocusedstudies.This,inturnleads Figure3|Scopeofinvalidanalysesintheliterature. Stackedbargraphs tostatisticallybiasedselectionof‘‘signature’’genesandislikelyto representthenumberofstudiesinwhichstatisticallyinvalid(red)or leadtotedious,costlyandfrustratingeffortsintranslationalstudies. valid(blue)classificationprotocolswereused,inthenotedyears. SCIENTIFICREPORTS |3:1102|DOI:10.1038/srep01102 3 www.nature.com/scientificreports Methods Dendrogramplot.Thenodesatthebottomofadendrogramplotrepresent samples.Whileclimbingupthedendrogramtreethenodesarejoinedtogether accordingtotheirproximityinamultidimensionalspace.Inthebeginningsamples arescatteredinamultidimensionalspaceaccordingtotheirexpressionvalues(of wholetranscriptsorsubsetsofit).Inthefirststep,eachpairofsamplesisjoined togethertoformanewnodewhosecoordinatesarethemeansofthepair.Inthe secondstepasecondlevelofnodesisformedsimilarlyfromthenodesintheprevious step.Thiscontinuesuntilallnodeshavebeenassignedtoapair.Thedistancebetween eachtwonodesisthepeakontheYaxisoftheinverseUshapedlinethatconnects them. ‘Minimaldistancetogroupmean’classifier.Decisionregionsaredeterminedas follows:thesamplesarescatteredonatwodimensionalplane(PCA1andPCA2in Figure2g–I,butcanbeanyotherspaceaswell).Eachpixelontheplaneisassignedas ‘red’or‘green’accordingtoitsdistancefromthemeansofthegroups.Thecolorofthe groupwiththeclosestmeantothepixelistheonethatisassignedtoit.Statistically biasedclassificationsarecountedasreddotsinagreenregion,andviceversa. 1. Cooper-Knock,J.etal.Geneexpressionprofilinginhumanneurodegenerative Figure4|EffectofsamplesizeandselectedP-valuethresholdonfalse disease.Naturereviews.Neurology8,518–530(2012). classification. Thepercentofcorrectclassification(Yaxis)wasbasedon 2. Luciani,F.,Bull,R.A.&Lloyd,A.R.Nextgenerationdeepsequencingandvaccine randomdatausingasubsetofmostlyalteredgenes.Thevaluesaremeans design:todayandtomorrow.Trendsinbiotechnology30,443–452(2012). for30iterations.TheP-valueusedasathresholdforthesubsetisonthe 3. Roychowdhury,S.etal.Personalizedoncologythroughintegrativehigh- throughputsequencing:apilotstudy.Sciencetranslationalmedicine3,111ra121 upperXaxis,thenumberofgenesinthesubsetisonthelowerXaxis. (2011). Percentvaluesaredrawnfor5samplesizes,called‘biologicalreplicates’in 4. Barbash,S.&Soreq,H.Threshold-independentmeta-analysisofAlzheimer’s thegraphandrepresentedaslineswithdifferentcolors,foreachofthetwo diseasetranscriptomesshowsprogressivechangesinhippocampalfunctions, groups.Forexample,‘biologicalreplicates’520indicatesclassification epigeneticsandmicroRNAregulation.CurrentAlzheimerresearch9,425–435 (2012). between2groupsof20datasetseach.Thedottedgreylinerepresentsvalues 5. Segman,R.H.etal.Peripheralbloodmononuclearcellgeneexpressionprofiles fortheLOOCVfor20biologicalreplicates(Othersizesofbiological identifyemergentpost-traumaticstressdisorderamongtraumasurvivors. replicatesgivethesameLOOCVresult). Molecularpsychiatry10,500–513,425(2005). 6. Ambroise,C.&McLachlan,G.J.Selectionbiasingeneextractiononthebasisof microarraygene-expressiondata.ProceedingsoftheNationalAcademyofSciences We have shown that clustering based on a list of ‘informative oftheUnitedStatesofAmerica99,6562–6566(2002). genes’ presenting large expression differences between the studied 7. Wood,I.A.,Visscher,P.M.&Mengersen,K.L.Classificationbasedupongene expressiondata:biasandprecisionoferrorrates.Bioinformatics(Oxford, groupsleadstostatisticalbiasandmaythusdrawbiologicallyirrel- England)23,1363–1370(2007). evant classifications. However, several strategies exist for avoiding 8. Petermann,K.B.etal.CD200isinducedbyERKandisapotentialtherapeutic sucherrorsandreachingvalidclusteringanalyses6,7.First,usingana targetinmelanoma.TheJournalofclinicalinvestigation117,3922–3929(2007). priorilistofgenesforclusteringtests,asdonein8,isabsolutelyessen- 9. Lopes,A.R.etal.Bim-mediateddeletionofantigen-specificCD8Tcellsin patientsunabletocontrolHBVinfection.TheJournalofclinicalinvestigation118, tial,becausethelistwouldnotdependonthespecificexperimentthat 1835–1845(2008). isbeingassessed.Forexample,definingallofthegenesinvolvedin 10.Nieuwenhuis,S.,Forstmann,B.U.&Wagenmakers,E.J.Erroneousanalysesof inflammation as a sub-set is a valid strategy for classification of interactionsinneuroscience:aproblemofsignificance.Natureneuroscience14, groups with predictably distinct inflammatory features. Second, 1105–1107(2011). onecan testforclustering based ontheentire set ofdata, without anyfiltering,asdonein9.Thismethodhasactuallybeenthepreferred approach among those studies that avoided the statistically biased Acknowledgments classificationwhichwediscusshere.Finally,onecandividethesam- TheauthorsaregratefultoProf.IsraelNelkenandDr.EstelleR.Bennett,Jerusalem,for fruitfulandcomprehensivediscussionsandtotheGatsby,LegacyHeritage,ISFand plepopulationintotwosets:atrainingsetinwhichtheseparating Rosetreefoundationsforsupport.S.B.isanincumbentofaPhDfellowshipfromtheELSC lineissearchedfor,andatestset,inwhichtheclusteringperform- BrainCenter. anceisassessed,asdonein5. Recently, a paper by Nieuwenhuis et al. published in Nature Authorcontributions Neuroscience10 showed that 50% of the articles examined made a H.S.andS.B.designedandwrotethemainmanuscripttext.S.B.performedthesimulations. statistical mistake by which the interaction between compared groupswasnottakenintoconsideration.Insomeoftheexamined Additionalinformation articlestheimplicationofthismistakewasthatthemainclaimofthe Competingfinancialinterests:Theauthorsdeclarenocompetingfinancialinterests. articlecouldnotbestatisticallysupported.Webelievethatthestat- License:ThisworkislicensedunderaCreativeCommons isticallybiasedclassificationwehavedrawnattentiontoisofsimilar Attribution-NonCommercial-NoDerivs3.0UnportedLicense.Toviewacopyofthis significance,andthatitmayhaveadirectimpactonscientificargu- license,visithttp://creativecommons.org/licenses/by-nc-nd/3.0/ mentsthataremadeinvariousresearchfields. Howtocitethisarticle:Barbash,S.&Soreq,H.Statisticallyinvalidclassificationofhigh throughputgeneexpressiondata.Sci.Rep.3,1102;DOI:10.1038/srep01102(2013). SCIENTIFICREPORTS |3:1102|DOI:10.1038/srep01102 4

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.