UUnniivveerrssiittyy ooff CCeennttrraall FFlloorriiddaa SSTTAARRSS Electronic Theses and Dissertations, 2004-2019 2014 CCoosstt--SSeennssiittiivvee LLeeaarrnniinngg--bbaasseedd MMeetthhooddss ffoorr IImmbbaallaanncceedd CCllaassssiifificcaattiioonn PPrroobblleemmss wwiitthh AApppplliiccaattiioonnss Talayeh Razzaghi University of Central Florida Part of the Industrial Engineering Commons Find similar works at: https://stars.library.ucf.edu/etd University of Central Florida Libraries http://library.ucf.edu This Doctoral Dissertation (Open Access) is brought to you for free and open access by STARS. It has been accepted for inclusion in Electronic Theses and Dissertations, 2004-2019 by an authorized administrator of STARS. For more information, please contact [email protected]. SSTTAARRSS CCiittaattiioonn Razzaghi, Talayeh, "Cost-Sensitive Learning-based Methods for Imbalanced Classification Problems with Applications" (2014). Electronic Theses and Dissertations, 2004-2019. 4574. https://stars.library.ucf.edu/etd/4574 COST-SENSITIVELEARNING-BASEDMETHODSFORIMBALANCEDCLASSIFICATION PROBLEMSWITHAPPLICATIONS by TALAYEHRAZZAGHI B.S.UniversityofTehran,2005 M.S.SharifUniversityofTechnology,2007 Adissertationsubmittedinpartialfulfilmentoftherequirements forthedegreeofDoctorofPhilosophy intheDepartmentofIndustrialEngineeringandManagementSystems intheCollegeofEngineeringandComputerScience attheUniversityofCentralFlorida Orlando,Florida SpringTerm 2014 MajorProfessor:PetrosXanthopoulos (cid:13)c 2014TalayehRazzaghi ii ABSTRACT Analysis and predictive modeling of massive datasets is an extremely significant problem that arises in many practical applications. The task of predictive modeling becomes even more chal- lenging when data are imperfect or uncertain. The real data are frequently affected by outliers, uncertain labels, and uneven distribution of classes (imbalanced data). Such uncertainties create bias and make predictive modeling an even more difficult task. In the present work, we introduce a cost-sensitive learning method (CSL) to deal with the classification of imperfect data. Typically, most traditional approaches for classification demonstrate poor performance in an environment with imperfect data. We propose the use of CSL with Support Vector Machine, which is a well- known data mining algorithm. The results reveal that the proposed algorithm produces more ac- curate classifiers and is more robust with respect to imperfect data. Furthermore, we explore the bestperformancemeasurestotackleimperfectdataalongwithaddressingrealproblemsinquality controlandbusinessanalytics. iii ToMyBelovedAunt,Mina iv ACKNOWLEDGMENTS My journey toward fulfillment of the doctoral studies at the University of Central Florida was one ofthemostvaluableexperiencesinmylife. Itwouldnothavebeenpossiblewithoutallthosewho encouraged and supported me during this process. I would like to to gratefully acknowledge to themhere. First and foremost, I would like to express the deepest appreciation to my adviser Dr. Petros Xan- thopouloswhosehelp,stimulatingideasandencouragementhelpedmeinworkingonthisproblem andwritingthisdissertation. Itwastrulyapleasureformetoworkunderhissupervision. Iwould especiallyliketothankhimforhismentoringcontributionstowardsmygrowthasaresearcher. I would like to thank my committee members, Professor Waldemar Karwowski, Dr. Jennifer Pa- zour, and Professor Mikusinski, for being on my Dissertation committee and valuable comments. I would like to especial thank to Dr. Pazour to always allow me to feel comfortable sharing my thoughtswithheranduseherpreciousadviceforacademiclifeandjobsearchexperiences. Inad- dition,mysincerethanksgoestoDr. OnurSerefattheBusinessInformationTechnology,Virginia Techforhismotivationandstimulatingdiscussions. MywarmthankstomyfirstmentorandMaster’sthesisadvisor,ProfessorFarhadKianfarfromthe Industrial Engineering department at Sharif University of Technology, for inspiring me to study higher education abroad and for his invaluable support. I am very thankful to Ms. Liz Stavely, Maria Bull, Li Muyuan, Yilling He, Serina Haddad, and Halil Bozkurt for being always so kind, helpfulandsupportive. Ithasbeenatruepleasureworkinginthesameenvironmentwiththem. And last but not least, my special thanks to my parents and my lovely sister, Tarlan, for their unconditional support and love throughout my life. I would especially like to thank my dear aunt Mina and his husband, Ali Tarkhagh, for the love and support that they continuously gave me especiallythroughoutmygraduatestudies. v TABLE OF CONTENTS LISTOFFIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix LISTOFTABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi CHAPTER1: INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 ABriefOverview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 ABriefHistoryofImbalancedClassification . . . . . . . . . . . . . . . . . . . . . . . . 5 ImbalancedClassificationwithClassNoise . . . . . . . . . . . . . . . . . . . . . 6 EmbeddedOutlierDetectionandClassificationvs. ConventionalOutlierDetection . . . 8 DissertationGoalandStructure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 CHAPTER2: LITERATUREREVIEW . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 ImbalancedClassificationTechniques . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 Resampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 Cost-SensitiveLearning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 EnsembleLearning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 PerformanceMeasuresforImbalancedClassification . . . . . . . . . . . . . . . . . . . 18 OutlierDetectionTechniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 vi OutlierDetectionEvaluationMeasures . . . . . . . . . . . . . . . . . . . . . . . . 23 ClassificationwithImbalancedDatainthePresenceofOutliers . . . . . . . . . . . . . . 23 ControlChartPatternRecognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 AverageRunLength(ARL)BasedMeasures . . . . . . . . . . . . . . . . . . . . 26 ImbalancedClassificationinBusinessAnalytics . . . . . . . . . . . . . . . . . . . . . . 28 CHAPTER3: METHODOLOGY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 SupportVectorMachines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 WeightedSupportVectorMachines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 WeightedRelaxedSupportVectorMachines . . . . . . . . . . . . . . . . . . . . . . . . 39 ModelSelectionforSupportVectorMachines . . . . . . . . . . . . . . . . . . . . . . . 44 CHAPTER4: RESULTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 ImbalancedSupportVectorMachineforControlChartPatternRecognition . . . . . . . 46 BinaryClassification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 Multi-ClassClassification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 ImbalancedSupportVectorMachineClassificationwithLabelNoise . . . . . . . . . . . 60 ComparativeEvaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 OutlierDetectionPerformance . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 vii CHAPTER5: CONCLUSION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 APPENDIXA: MATHEMATICALMODELSOFCONTROLCHARTPATTERNS . . . . 71 APPENDIXB: APRACTICALGUIDETOWEIGHTEDSUPPORTVECTORMACHINE TOOLBOXFORCONTROLCHARTPATTERNRECOGNITION . . . . . 74 ProposedProcedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 DataGeneration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 DataPreprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 ModelSelection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 WSVMTrainingandTesting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 LISTOFREFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 viii LIST OF FIGURES Figure1.1:Imbalanced data classification without outliers using linear (a-b) and RBF (c-d) kernel functions. The black and gray points show the majority and minorityclassdatapointsrespectively. . . . . . . . . . . . . . . . . . . . . . 10 Figure1.2:Imbalanceddataclassificationwithoutliersusinglinear(a)andRBF(b)ker- nelfunctions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 Figure2.1:BaggingAlgorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 Figure2.2:ROCcurveshowingfourclassifiers . . . . . . . . . . . . . . . . . . . . . . . 19 Figure2.3:Imbalanced data classification in the presence of outliers. It can be observed thattheclassifierisgreatlyinfluencedbytheoutliersandthedecisionbound- aryisshiftedtotheright. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 Figure2.4:Examples of six abnormal patterns (bold) plotted versus an example of nor- malone . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 Figure2.5:Examplesofstratificationabnormalpattern(bold)plottedversusanexample ofnormalone . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 Figure2.6:Conceptualschemeforclassificationofimbalanceddata . . . . . . . . . . . 34 Figure4.1:Geometric mean of sensitivity for different parameters window lengths and patternsforhighlyimbalanceddata. . . . . . . . . . . . . . . . . . . . . . . 50 Figure4.2:Boundary obtained for inseparable, partially separable, and separable classi- ficationproblemsforcyclicandstratificationpatterns . . . . . . . . . . . . . 51 ix
Description: