ebook img

FiGS: a filter-based gene selection workbench for microarray data. PDF

0.29 MB·English
by  HwangTaeho
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview FiGS: a filter-based gene selection workbench for microarray data.

Hwangetal.BMCBioinformatics2010,11:50 http://www.biomedcentral.com/1471-2105/11/50 SOFTWARE Open Access FiGS: a filter-based gene selection workbench for microarray data Taeho Hwang1, Choong-Hyun Sun2, Taegyun Yun1, Gwan-Su Yi1,2* Abstract Background: The selection of genes that discriminate disease classes from microarray data is widely used for the identification of diagnostic biomarkers. Although various gene selection methods are currently available and some of them have shown excellent performance, no single method can retain the best performance for all types of microarray datasets. It is desirable to use a comparative approach to find the best gene selection result after rigorous test of different methodological strategies for a given microarray dataset. Results: FiGS is a web-based workbench that automatically compares various gene selection procedures and provides the optimal gene selection result for an input microarray dataset. FiGS builds up diverse gene selection procedures by aligning different feature selection techniques and classifiers. In addition to the highly reputed techniques, FiGS diversifies the gene selection procedures by incorporating gene clustering options in the feature selection step and different data pre-processing options in classifier training step. All candidate gene selection procedures are evaluated by the .632+ bootstrap errors and listed with their classification accuracies and selected gene sets. FiGS runs on parallelized computing nodes that capacitate heavy computations. FiGS is freely accessible at http://gexp.kaist.ac.kr/figs. Conclusion: FiGS is an web-based application that automates an extensive search for the optimized gene selection analysis for a microarray dataset in a parallel computing environment. FiGS will provide both an efficient and comprehensive means of acquiring optimal gene sets that discriminate disease states from microarray datasets. Background a gene selection procedure that is optimized for each Gene selection methods for microarray data analysis are microarray dataset in question through a rigorous per- important to identify the significant genes that distin- formance test of representative types of methods while guish disease classes and to use these selected genes as varying the parameters in the methods. To complete diagnostic markers in clinical decisions. Due to the sig- these complex works conveniently and efficiently, the nificance of this matter, many gene selection methods tool should automate the comprehensive tests and facili- have been introduced. Each method demonstrated a tate high-performance computing. proper level of quality to predict disease states in its There have been several tools that support a compara- own test datasets, but the performance level was only tive analysis of different gene selection methods. Prophet partially validated in the sense that a limited number of enables a comparison of the performance of different sample datasets, gene selection algorithms, and the para- feature selection methods and classifiers with leave-one- meters in the test were used. It is not only difficult to out cross-validation (LOOCV) errors [1]. The proce- choose the best performing gene selection method dures are automated to test the multiple classifiers but among many methods for a newly introduced dataset, it the user should select each feature selection method. is also doubtful if such methods exist that can always Gene Expression Model Selector (GEMS) provides an guarantee the performance with all types of microarray automatic selection of several feature selection methods datasets. A desirable approach would be to find and use and different types of multi-category support vector machine (MC-SVM) algorithms [2]. The authors con- *Correspondence:[email protected] cluded that MC-SVM was the most effective classifier 1DepartmentofBioandBrainEngineering,KAIST,Daejeon305-701,South after a test of several different types of classifiers, Korea ©2010Hwangetal;licenseeBioMedCentralLtd.ThisisanOpenAccessarticledistributedunderthetermsoftheCreativeCommons AttributionLicense(http://creativecommons.org/licenses/by/2.0),whichpermitsunrestricteduse,distribution,andreproductionin anymedium,providedtheoriginalworkisproperlycited. Hwangetal.BMCBioinformatics2010,11:50 Page2of6 http://www.biomedcentral.com/1471-2105/11/50 including tests involving an ensemble classification classification algorithms that showed excellent perfor- method, K-nearest neighbors and neural networks. mance in recent comparative analyses [8,9]. The features M@CBETH [3] is a tool designed to compare only clas- used for classifier training can be diversified in two sification algorithms. The aforementioned tools are use- types: the original values and the discretized values. Pre- ful but each has room for improvement in terms of vious studies showed that feature discretization can con- automation, the diversity of method comparisons, or the siderably improve classification performance [10,11]. In information content in the output report. addition, we developed a new form of feature vector for The present study introduces FiGS, a new web-based classifier training. It forms a new feature vector by add- gene selection workbench. FiGS automates the genera- ing the gene expression features of selected genes tion and evaluation of various gene selection procedures (Figure 2). As shown in Figure 2, it can generate an in a parallel computation environment. In addition to improved feature by amplifying a similar expression pat- the well established feature selection methods and clas- tern or offsetting the outliers. The performance of clas- sification algorithms, FiGS uniquely incorporates several sifier is evaluated by the .632+ bootstrap method [12]. methodological options which can improve performance All algorithms in the introduced gene selection proce- in certain types of datasets. Those involve the specified dureareimplementedinR[13].Thee1071[14]andran- selection of up- or down-regulated genes in the feature domForest [15] R-package were used to implement the selection step, feature discretization, and new feature SVM and the random forest, respectively.The computa- vector formation with the addition of the expression tions are parallelized using the Parallel Virtual Machine vectors of the selected genes in classifier training step. A (PVM)viatherpvm[16]andsnow[17]intheR-packages test using six cancer microarray datasets showed that on a cluster of 9 nodes, each with dual quad-core Intel different types of gene selection procedures should be Xeon2.46GHzCPUsand24GbRAM. applied to obtain the optimal gene sets for each dataset. Intheweb-baseduserinterface,FiGStakesatwo-class microarray dataset as input in a tab-delimited text file. Implementation Users are allowed to run the default setup suggested by FiGS generates gene selection procedures having two FiGSortodesigntheirowncomparativestudybyselect- separate steps: feature selection and classifier training ing from among the methods and techniques described (Figure 1). The feature selection step is used for the above. Users can specify several desired numbers of prioritization of differentially expressed genes. The dis- genestobeselected.Theoutputreportstheperformance tributional assumption for microarray data can change ofeachgeneselectionprocedurewiththeselectedgenes. for each dataset based on the dataset size, presence of The output is available through the web site or can be experimental artifacts, underlying gene expression setup automaticallyemaileduponauser’srequest. and other factors [4]. One of our goals is to find proper statistical measures that match these unknown types of Results microarray data distributions automatically. To facilitate Several analyses were conducted to demonstrate the this process, three proven test statistics were selected necessity and usefulness of the comprehensive approach from among a wide range of parametric and non-para- of FiGS. We tested the performance of all the possible metric or model-free methods. The t-test and the Wil- gene selection procedures that can be generated by FiGS coxon rank sum test were included as a representative to six binary (two-class) microarray datasets. The micro- parametric and non-parametric analysis, respectively. In array datasets were chosen from the literature because addition, an information gain method using an entropy- they have been extensively analyzed in previous publica- based discretization measure [5] was included as another tions. The six microarray datasets were as follows: type of non-parametric method. Using these methods, leukemia (33 samples with 3051 genes) [18], colon (64 FiGS can evaluate the significance of genes with diverse samples with 2000 genes) [19], prostate (102 samples input data features: the original numeric values, the with 6033 genes) [20], adenocarcinoma (76 samples with rank-transformed values and discretized values. In addi- 9868 genes) [21], breast (77 samples with 4869 genes) tion, we designed a new option to specify up- or down- [22], and diffuse B-cell lymphoma (DLBCL) (77 samples regulated genes for input. The genes are assigned to up- with 5469 genes) [23]. Figure 3 shows the range of .632 or down-regulated genes based on their differential + bootstrap errors in the classification models generated expression in two classes. In some cases, the selection of by all the gene selection procedures for each microarray genes in those separate groups can enhance the classifi- dataset. Even in the same dataset, the errors differ cation power (See Table 1 to find the examples in our greatly depending on which gene selection procedure is test). During the classifier training step, the support vec- used. The deviation between the best and worst .632+ tor machine (SVM) [6] and the random forest [7] meth- bootstrap error for a dataset is in the range of 0.09 (ade- ods were incorporated. These are the state-of-the-art nocarcinoma) to 0.15 (leukemia). This range implies that Hwangetal.BMCBioinformatics2010,11:50 Page3of6 http://www.biomedcentral.com/1471-2105/11/50 Figure1VariousgeneselectionproceduresinFiGS.Asmanyas60differentgeneselectionprocedurescanbedevelopedbycombiningthe featureselectionmethods,classificationalgorithmsandvariousoptionaltechniques.Thefeaturevectoradditiontechniqueisappliedonlyto caseswherethespecifiedselectionofup-regulatedordown-regulatedgenesisusedinthefeatureselectionstep.Therangeforthenumberof genescanbealsosetbyusers,thoughitisnotshownhere. there is a high chance of getting a much poorer classifi- found in all six cases that the model-free or non-para- cation model if an inappropriate gene selection proce- metric methods are preferable to the parametric t-test. dure is used without caution. In addition, the best According to the literature, non-parametric methods are performing gene selection procedures for each datasets more applicable in the analysis of microarray data, are different. We examined the best performing meth- where the sample size is often small and an underlying ods which produce the smallest .632+ bootstrap errors distribution can be hardly assumed [4]. Nevertheless, with a minimum number of genes for each dataset. The some microarray datasets can be appropriate for the best optimized methods for each dataset are listed in parametric t-test, especially studies involving large num- Table 1 with their .632+ bootstrap errors and the num- bers of samples. Within two non-parametric methods, ber of selected genes. For the feature selection step, we we can notice that there is no dominant appearance of a Figure2Thefeaturevectoradditiontechnique.ThesymbolCijisthejthsampleintheithclass.Thedarkergrayrepresentsthehigher expressionvalues. Hwangetal.BMCBioinformatics2010,11:50 Page4of6 http://www.biomedcentral.com/1471-2105/11/50 Table 1The bestperforming gene selection procedure withthe .632+bootstrap erroridentified by FiGS foreachof the six microarray datasets. Dataset Featureselection k Geneexpression Feature Featurevector Classifier Error method pattern discretization addition Leukemia Wilcoxonranksumtest 10 Down-regulated Notapply Notapply SVM 0.02 Leukemia Wilcoxonranksumtest 10 Down-regulated Notapply Notapply RF 0.02 Leukemia Wilcoxonranksumtest 10 Down-regulated Notapply Apply SVM 0.02 Leukemia Wilcoxonranksumtest 10 Down-regulated Notapply Apply RF 0.02 Leukemia Wilcoxonranksumtest 10 Down-regulated Apply Notapply SVM 0.02 Leukemia Informationgainmethod 10 Down-regulated Notapply Notapply SVM 0.02 Leukemia Informationgainmethod 10 Down-regulated Notapply Notapply RF 0.02 Leukemia Informationgainmethod 10 Down-regulated Notapply Apply SVM 0.02 Leukemia Informationgainmethod 10 Down-regulated Notapply Apply RF 0.02 Leukemia Informationgainmethod 10 Down-regulated Apply Notapply SVM 0.02 Leukemia Informationgainmethod 10 Down-regulated Apply Notapply RF 0.02 Colon Informationgainmethod 30 Up-regulated Notapply Notapply RF 0.11 Prostate Informationgainmethod 25 Total Notapply Notapply RF 0.05 Adenocarcinoma Wilcoxonranksumtest 10 Up-regulated Notapply Notapply RF 0.10 Breast Wilcoxonranksumtest 15 Down-regulated Notapply Apply SVM 0.31 Breast Informationgainmethod 15 Down-regulated Notapply Apply SVM 0.31 DLBCL Wilcoxonranksumtest 20 Total Notapply Notapply RF 0.08 kisthenumberofselectedgenes;anderroristhe.632+bootstraperrorachievedbythebestperforminggeneselectionproceduretestedon100bootstrap samples.Inthecaseoftheleukemiaandbreastdatasets,themultiplegeneselectionproceduresarethebest. Figure 3 Box plots of the .632+ bootstrap errors obtained by different gene selection procedures for each of the six cancer microarraydatasets. Hwangetal.BMCBioinformatics2010,11:50 Page5of6 http://www.biomedcentral.com/1471-2105/11/50 single method. Both Wilcoxon rank sum test and the information gain method) with two classification algo- information gain method can be used for leukemia and rithms(namely, the SVM andrandom forest). Thenum- breast datasets. However, the best feature selection ber of genes to select was set to 200 on the assumption method for the adenocarcinoma and DLBCL datasets is that200isasufficientlylargenumberforthosesixmeth- the Wilcoxon rank sum test, and the best method for ods to give their best performance. While substantially the colon and prostate datasets is the information gain optimizingthenumberofgenes,FiGScouldfindbetter,or method. For the classifiers, the SVM and the random at least comparable, level of classification accuracy than forest are both competitive, though as with the feature that achieved by the six methods (Figure 4) with much selection methods the applicability of each method dif- fewer genes. The numberofgenesthatsatisfies the level fers for each dataset. The random forest was chosen for ofaccuracyforFiGSasshowninFigure4is50forleuke- the colon, prostate, adenocarcinoma, and DLBCL dat- mia,30forcolon,25forprostate,10foradenocarcinoma, sets; the SVM was selected for the breast dataset. There 15forbreast,and70forDLBCL.Wealsotriedtocompare is no correlation between the types of feature selection theperformanceofFiGSwithrecentlypublishedmethods. method and classifiers among the best selected proce- However,itishardtofairlyandcomprehensivelyevaluate dures. Note also that our newly developed methodologi- theperformanceofdifferentgeneselection methods due cal designs in FiGS are frequently selected as part of the tothedifferentsetupsoftheexperimentaldesign,datasets, best gene selection procedures for those datasets. Those andtheperformancedefinitions[2].Ofthevariousmeth- are the strategy to select the genes in a separate group ods,wefoundvarSelRF,ageneselectionmethodbasedon having up- or down-regulated pattern and the feature random forest backward feature elimination [15], as an vector addition technique. All the results underscore the appropriate method that can be readily compared with need for a comprehensive comparison of the many FiGS. The classification performances ofvarSelRF for all applicable procedures. datasets that FiGS has tested except the DLBCL dataset Using the same six microarray datasets, we compared are available in [8]. For the DLBCL dataset, we obtained the classification performance of the best gene selection the resultsby using the varSelRF toolsprovidedby the proceduresfoundinFiGSwiththoseoftypicalgeneselec- R-package. Figure 4 shows the results of two versions of tionapproaches.Sixgeneselectionprocedureswerefabri- varSelRF.Thebestoptimizedgeneselectionproceduresby cated by combining three feature selection methods FiGS outperform the two different modes of varSelRF in (namely, the t-test, the Wilcoxon ranksum test, and the termsoftheclassificationaccuracyforalltesteddatasets. Figure4ComparisonofthebestperforminggeneselectionproceduresidentifiedbyFiGSwithothergeneselectionapproachesin termsoftheclassificationaccuracy.Thenamesofthecomparedgeneselectionproceduresareabbreviatedasfollows:ttest,t-test;Wilcoxon, Wilcoxonranksumtest;andInfoGain,informationgainmethod.200isthenumberofgenestoselect.varSelRF_SE0andvarSelRF_SE1aretwo versionsofvarSelRFeachwiththestandarderror(SE)termsetto0and1,respectively.FiGS_bestisthebestgeneselectionprocedureidentified byFiGS;itproducesthebestclassificationaccuracywiththesmallestnumberofgenes.Theclassificationaccuracyrepresentedinthey-axisis1- .632+bootstraperror. Hwangetal.BMCBioinformatics2010,11:50 Page6of6 http://www.biomedcentral.com/1471-2105/11/50 Although varSelRFautomaticallyselected very few genes Received:9June2009 Accepted:26January2010 Published:26January2010 onthebasisofitsinternaloptimizationstrategy,theclassi- ficationperformanceseemstohavebeencompromised. References During our comprehensive comparison of the various 1. MedinaI,MontanerD,TarragaJ,DopazoJ:Prophet,aweb-basedtoolfor gene selection procedures in FiGS for the six real micro- classpredictionusingmicroarraydata.Bioinformatics2007,23(3):390-391. 2. StatnikovA,AliferisCF,TsamardinosI,HardinD,LevyS:Acomprehensive array datasets, we found no single method is dominant evaluationofmulticategoryclassificationmethodsformicroarraygene in its performance. Thus, a comprehensive search for expressioncancerdiagnosis.Bioinformatics2005,21(5):631-643. various gene selection procedures is both necessary and 3. PochetNLMM,JanssensFAL,SmetFD,MarchalK,SuykensJAK,MoorBLRD: M@CBETH:amicroarrayclassificationbenchmarkingtool.Bioinformatics useful. Although the gene selection methods included in 2005,21(14):3185-3186. FiGS are not a complete spectrum of all available meth- 4. SaeysY,IñzaI,LarrañagaP:Areviewoffeatureselectiontechniquesin ods, the set of methods selected in FiGS seems reason- bioinformatics.Bioinformatics2007,23(19):2507-2517. 5. LiuH,LiJ,WongL:AComparativeStudyonFeatureSelectionand able. Most methods in FiGS were selected almost ClassificationMethodsUsingGeneExpressionProfilesandProteomic equally frequently in the final best gene selection proce- Patterns.GenomeInformatics2002,13:51-60. dure for the different datasets. The best optimized gene 6. VapnikVN:StatisticallearningtheoryNewYork,Wiley1998. 7. BreimanL:Randomforests.MachineLearning2001,45:5-32. selection procedures selected by FiGS for the tested can- 8. Diaz-UriarteR,deAndresSA:Geneselectionandclassificationof cer microarray datasets outperformed other typical microarraydatausingrandomforest.BMCBioinformatics2006,7:3. approaches and recently developed methods. 9. StatnikovA,WangL,AliferisCF:Acomprehensivecomparisonofrandom forestsandsupportvectormachinesformicroarray-basedcancer classification.BMCBioinformatics2008,9:319. Conclusion 10. IñzaI,LarrañagaP,BlancoR,CerrolazaAJ:Filterversuswrappergene Findingandusingapropergeneselectionprocedurespe- selectionapproachesinDNAmicroarraydomains.ArtificialIntelligencein Medicine2004,31(2):91-103. cifictoagivenmicroarraydatasetisnecessaryanduseful. 11. PotamiasG,KoumakisL,MoustakisV:GeneSelectionviaDiscretized FiGSfacilitatesthisprocedure.Itisan easy-to-usework- Gene-ExpressionProfilesandGreedyFeature-Elimination.LectureNotesin bench that selectspromisinggenes fordiseaseclassifica- ComputerScience2004,3025:256-266. 12. EfronB,TibshiraniR:ImprovementsonCross-Validation:The.632+ tiononthebasisofarigorousandcomprehensivetestof BootstrapMethod.JournaloftheAmericanStatisticalAssociation1997, variousgene selectionproceduresforagiven microarray 92(438):548-560. dataset. FiGS helps researchers efficiently and conveni- 13. RDevelopmentCoreTeam:R:Alanguageandenvironmentforstatistical Computing2009.RFoundationforstatisticalComputing,Vienna, entlydeterminebiomarkercandidatesforclinicalapplica- Austriahttp://www.R-project.org,ISBN3-900051-07-0. tionwithagreaterlevelofaccuracyandreliability. 14. DimitriadouE,HornikK,LeischF,MeyerD,WeingesselA:e1071:Misc FunctionsoftheDepartmentofStatistics(e1071),TUWien.http://cran.r- Availability and requirements project.org/web/packages/e1071/index.html. 15. LiawA,WienerM:randomForest:BreimanandCutler’srandomforests Project name: FiGS forclassificationandregression.http://cran.r-project.org/web/packages/ Project homepage: http://gexp.kaist.ac.kr/figs randomForest/index.html. Operating system(s): Platform independent (web- 16. LiN,RossiniAJ:rpvm:RinterfacetoPVM(ParallelVirtualMachine).http:// cran.r-project.org/web/packages/rpvm/index.html. based application). 17. TierneyL,RossiniAJ,LiN,SevcikovaH:snow:SimpleNetworkof Programming language: R, Perl and Java script. Workstations.http://cran.r-project.org/web/packages/snow/. Other requirements: A web brower. 18. GolubTR,SlonimDK,TamayoP,HuardC,GaasenbeekM,MesirovJP, CollerH,LohML,DowningJR,CaligiuriMA,etal:MolecularClassification License: None for usage. ofCancer:ClassDiscoveryandClassPredictionbyGeneExpression Any restrictions to use by non-academics: None. Monitoring.Science1999,286(5439):531-537. 19. AlonU,BarkaiN,NottermanDA,GishK,YbarraS,MackD,LevineAJ:Broad patternsofgeneexpressionrevealedbyclusteringanalysisoftumor andnormalcolontissuesprobedbyoligonucleotidearrays.The Acknowledgements ProceedingsoftheNationalAcademyofSciencesUSA1999, Theauthorsthanktheanonymousreviewersfortheirconstructivecriticisms. 96(12):6745-6750. ThisworkwaspartiallysupportedbytheIndustrialSourceTechnology 20. SinghD,FebboPG,RossK,JacksonDG,ManolaJ,LaddC,TamayoP, DevelopmentPrograms(10024720)oftheMinistryofKnowledgeEconomy RenshawAA,D’AmicoAV,RichieJP,etal:Geneexpressioncorrelatesof (MKE)ofKoreaandpartiallybytheMKEofKoreaundertheITRC(Information clinicalprostatecancerbehavior.CancerCell2002,1(2):203-209. TechnologyResearchCenter)supportprogramsupervisedbytheIITA(IITA- 21. RamaswamyS,RossKN,LanderES,GolubTR:Amolecularsignatureof 2009-C1090-0902-0014). metastasisinprimarysolidtumors.NatureGenetics2003,33(1):49-54. 22. van’tVeerLJ,DaiH,vanbeVijverMJ,HeYD,HartAAM,MaoM,PeterseHL, Authordetails deKooyK,MartonMJ,WitteveenAT,etal:Geneexpressionprofiling 1DepartmentofBioandBrainEngineering,KAIST,Daejeon305-701,South predictsclinicaloutomeofbreastcancer.Nature2002,415(6871):530-536. Korea.2DepartmentofComputerScience,KAIST,Daejeon305-701,South 23. ShippMA,RossKN,TamayoP,WengAP,KutokJL,AguiarRCT, Korea. GaasenbeekM,AngeloM,ReichM,PinkusGS,etal:DiffuselargeB-cell lymphomaoutcomepredictionbygene-expressionprofilingand Authors’contributions supervisedmachinelearning.NatureMedicine2002,8(1):68-74. GSYconceivedanddesignedthisstudyandinterpretedtheresults.TH designedandcarriedouttheanalyses,andwrotetheR-basedsourcecodes doi:10.1186/1471-2105-11-50 andtheweb-basedinterfaces.THandGSYparticipatedinwritingthe Citethisarticleas:Hwangetal.:FiGS:afilter-basedgeneselection manuscript.CHSandTYcontributedtobuildingandtestingthe workbenchformicroarraydata.BMCBioinformatics201011:50. infrastructuresforparallelcomputingandthebenchmarktest.

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.