Shuichi Shinmura New Theory of Discriminant Analysis After R. Fisher Advanced Research by the Feature-Selection Method for Microarray Data New Theory of Discriminant Analysis After R. Fisher Shuichi Shinmura New Theory of Discriminant Analysis After R. Fisher Advanced Research by the Feature-Selection Method for Microarray Data 123 Shuichi Shinmura Faculty of Economics Seikei University Musashinoshi, Tokyo Japan ISBN978-981-10-2163-3 ISBN978-981-10-2164-0 (eBook) DOI 10.1007/978-981-10-2164-0 LibraryofCongressControlNumber:2016947390 ©SpringerScience+BusinessMediaSingapore2016 Thisworkissubjecttocopyright.AllrightsarereservedbythePublisher,whetherthewholeorpart of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission orinformationstorageandretrieval,electronicadaptation,computersoftware,orbysimilarordissimilar methodologynowknownorhereafterdeveloped. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publicationdoesnotimply,evenintheabsenceofaspecificstatement,thatsuchnamesareexemptfrom therelevantprotectivelawsandregulationsandthereforefreeforgeneraluse. The publisher, the authors and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authorsortheeditorsgiveawarranty,expressorimplied,withrespecttothematerialcontainedhereinor foranyerrorsoromissionsthatmayhavebeenmade. Printedonacid-freepaper ThisSpringerimprintispublishedbySpringerNature TheregisteredcompanyisSpringerScience+BusinessMediaSingaporePteLtd. Preface Thisbookintroducesthenewtheoryofdiscriminantanalysisbasedonmathematical programming(MP)-basedoptimallineardiscriminantfunctions(OLDFs)(hereafter, “theTheory”)afterR.Fisher.Therearefiveseriousproblemsofdiscriminantanalysis in Sect. 1.1.2. I develop five OLDFs in Sect. 1.3. An OLDF based on a minimum number of misclassification (minimum NM, MNM) criterion using integer pro- graming(IP-OLDF)revealsfour relevantfactsinSect. 1.3.3.IP-OLDF tells usthe relationbetweenNMandLDFclearlyinadditiontoamonotonicdecreaseofMNM. IP-OLDF and an OLDF using linear programing (LP-OLDF) are compared with Fisher’sLDFandaquadraticdiscriminantfunction(QDF)usingIrisdatainChap.2 andcephalo-pelvicdisproportion(CPD)datainChap.3.However,becauseIP-OLDF maynotfindatrueMNMifdatadonotsatisfythegeneralpositionrevealedbystudent data inChap.4(Problem 1),IdevelopRevisedIP-OLDF,Revised LP-OLDF,and Revised IPLP-OLDF that is a mixture model of Revised LP-OLDF and Revised IP-OLDF.OnlyRevisedIP-OLDFcanfindtrueMNMcorrespondingtoaninterior pointofoptimalconvexpolyhedron(optimalCP,OCP)definedonthediscriminant coefficientspaceinSect.1.3.BecauseallLDFsexceptforRevisedIP-OLDFcannot discriminatecasesonthediscriminanthyperplaneexactly(Problem1),NMsofthese LDFsmaynotbecorrect.IP-OLDFfindsSwissbanknotedatainChap.6havingsix variables is linearly separable data (LSD) and two variables such as (X4, X6) is minimum linearly separable model by examination of all 63 models made by six independent variables. Revised IP-OLDF confirms this result, later. By monotonic decreaseofMNM,16modelsincluding(X4,X6)arelinearlyseparablemodels.This factisveryimportantforustounderstandthegeneanalysis.OnlyRevisedIP-OLDF andahard-marginsupportvectormachine(H-SVM)candiscriminateLSDtheoret- ically (Problem 2). Problem 3 is the defect of generalized inverse of variance-covariance matrices that causes a trouble for QDF and a regularized dis- criminant analysis (RDA). I solve Problem 3 that is explained by the pass/fail determinationsusing18examinationscoresinChap.5.AlthoughthesedataareLSD, errorratesofFisher’sLDFandQDFareveryhighbecausethesedatasetsdonotsatisfy Fisher’s assumption. These facts tell us serious problem that we had better v vi Preface re-evaluatedthediscriminantresultsofFisher’sLDFandQDF.Inparticular,weshall re-evaluate the medical diagnosis, and various ratings because these data have the same type oftestdata having manycasesonthediscriminant hyperplane.Because Fisher never formulated the equation of standard error (SE) of error rates and dis- criminant coefficient (Problem 4), I develop a 100-fold cross-validation for small samplemethod(hereafter,“theMethod1”).TheMethod1offersthe95%confidence interval(CI)ofdiscriminantcoefficientanderrorrate.Moreover,Idevelopapowerful modelselectionproceduresuchasthebestmodelwithminimummeanoferrorratein thevalidationsamples(M2).BestmodelsofRevisedIP-OLDFarebetterthanother seven LDFs using six datasets including Japanese-automobile data in addition to above five datasets. Therefore, we misunderstand I establish the Theory in 2015. However,whenRevisedIP-OLDFdiscriminatessixmicroarraydatasets(thedatasets) inNovember2015,RevisedIP-OLDFcannaturallyselectfeatures.AlthoughRevised IP-OLDF can make feature-selection naturally for Swiss banknote data and Japanese-automobiledatainChap.7,Idonotthinkitisaveryimportantfactbecause the best model offers the useful model selection procedure for common data. Over thantenyears,manyresearchersarestrugglingintheanalysisofgenedatasetsbecause there are huge numbers of genes and it is difficult for us to analyze by common statistical methods (Problem 5). I develop a Matroska feature-selection method (hereafter,“theMethod2”)andLINGOprogram.TheMethod2revealsthedataset consists several disjoint small linearly separable subspaces (small Matroska, SMs) andotherhigh-dimensionalsubspacethatisnotlinearlyseparable.Therefore,wecan analyzeeachSMbyordinary statistical methods.WefindProblem 5inNovember 2015andsolveitinDecember2015. The bookrepresentsmylife'swork/research, towhich Ihavededicatedover44 yearsofmylife.AftergraduatingfromKyotoUniversityin1971,Iwasemployed by SCSK Corp. in Japan as a system integrator. Naoji Tuda, the grandson of the second-generation general director Teigo Iba of Sumitomo Zaibatsu, was my boss and he believed that medical engineering (ME) is an important target for the information-processingindustries.Throughhisdecision,Ibecameamemberofthe project for the automatic diagnostic system of electrocardiogram (ECG) data with the Osaka Center for Cancer and Cardiovascular Diseases and NEC. The project leader,Dr.YutakaNomura,orderedmetodevelopthemedicaldiagnosticlogicfor ECG data through the Fisher’s LDF and QDF. Although I had hoped to become a mathematical researcher when I was a senior student in high school, I failed the entranceexaminationofgraduateschoolatKyotoUniversitybecauseIspentmuch more time pursuing the activities of the swimming club in the university. Although I did not become a mathematical researcher, I started research with ME. The research I conducted from 1971 to 1974 using Fisher’s LDF and QDF was inferior to his experimental decision tree logic. Initially, I believed that my statis- tical ability was poor. However, I soon realized that Fisher’s assumption was too strict for medical diagnosis. I proposed the earth model (Shinmura, 1984)1 for 1SeethereferencesinChap.1. Preface vii medical diagnosis instead of Fisher’s assumption. Therefore, this experience gave methemotivationtodeveloptheTheory.Shinmuraetal.(1973,1974)proposeda spectrum diagnosis using Bayesian theory that was the first trial for the Theory. However, logistic regression was more suitable for the earth model. Shimizu et al. (1975) requested me to analyze photochemical air pollution data by Hayashi quantification theory, and this became my first paper. Dr. Takaichirou Suzuki, leader of the Epidemiology Group, provided me with several themes for many types of cancers (Shinmura et al. 1983). In 1975, I met Prof. Akihiko Miyake from the Nihon Medical School at the workshop organized by Dr. Shigekoto Kaihara, Professor Emeritus of the Medical SchoolofTokyoUniversity.MiyakeandShinmura(1976)studied therelationship between population and sample error rate in Fisher’s LDF. Next, Miyake and Shinmura (1979) developed an OLDF based on the MNM criterion by a heuristic approach.ShinmuraandMiyake(1979)discriminatedCPDdatawithcollinearities. Afterwerevisedapapertwoorthreetimes,astatisticaljournalrejectedourpaper. However, Miyake and Shinmura (1980) was accepted by Japanese Society for Medical and Biological Engineering (JSMBE). Former editors who judged OLDF basedontheMNMcriterionoverestimatedthevalidationsample,andFisher’sLDF didnotoverestimatethesamplebecauseFisher’sLDFwasderivedfromthenormal distribution without examinationof real data. Iwas deeply disappointedthat many statisticians disliked real data review and started their research from a normal distribution because it was very comfortable for them without the examination of real data (lotus eating). However, I could not develop a second trial of the Theory because of poor computer power and a defect in the heuristic approach. Shinmura et al. (1987) analyzed the specific substance mycobacterium (SSM, commonly known as Maruyama vaccine). From 270,000 patients, we categorized 152,289 cancer patients into four postoperative groups. Those patients that were administered SSM within one year after surgery were divided into four groups everythreemonthsatthestartoftheSSMadministration.WeassumedthatSSMis onlywaterwithoutsideeffects,andthiswasthenullhypothesis.Thesurvivaltime for the first group was longer than for the fourth group from nine months to 12 months after surgery and the null hypothesis was rejected. In 1994, Prof. Kazunori Yamaguchi and Michiko Watanabe strongly recom- mended me to apply for the position at Seikei University. After organizing the 9th Symposium of JSCS in SCSK at Ryogoku near Ryogoku Kokugikan in March 1995,IbecameaprofessorattheEconomicDepartmentinAprilofthesameyear. Dr. Tokuhide Doi presented a long-term care insurance system that employed a decisiontree methodas advised by me. (Doctor Kaihara planned this system asan advisor to the Ministry of Health and Welfare, and I advised Dr. Doi to use the decision tree.) In 1997, Prof. Tomoyuki Tarumi advised me to obtain a doctorate degree in science at his graduate school. Without examining the previous research, I devel- oped IP-OLDF and LP-OLDF that discriminated Iris data, CPD data, and 115 viii Preface random number datasets. IP-OLDF found two relevant facts about the Theory. Therefore, we confirmed the MNM criterion was essential for the discriminant analysis and complete the Theory in 2015. The Theory is useful for the gene datasets as same as the ordinary datasets. Redears can download all my research from researchmap and Theory from research gate. https://www.researchgate.net/profile/Shuichi_Shinmura http://researchmap.jp/read0049917/?lang=english Musashinoshi, Japan Shuichi Shinmura Acknowledgments Iwishtothankall researchers whocontributedtothisbook:Linus Schrage, Kevin Cunningham, Hitoshi Ichikawa, John Sall, Noriki Inoue, Kyoko Takenaka, Masaichi Okada, Naohiro Masukawa, Aki Ishii, Ian B. Jeffery, Tomoyuki Tarumi, YutakaTanaka,KazunoriYamaguchi,MichikoWatanabe,YasuoOhashi,Akihiko Miyake, Shigekoto Kaihara, Akira Ooshima, Takaichirou Suzuki, Tadahiko Shimizu, Tatuo Aonuma, Kunio Tanabe, Hiroshi Yanai, Toji Makino, Jirou Kondou, Hiroshi Takamori, Hidenori Morimura, Atsuhiro Hayashi, Iebun Yun, Hirotaka Nakayama, Mika Satou, Masahiro Mizuta, Souichirou Moridaira, Yutaka Nomura, Naoji Tuda. I am grateful for my families, in particular, the legacy of my late father, who supported the research: Otojirou Shinmura, Reiko Shinmura, Makiko Shinmura, Hideki Shinmura, and Kana Shinmura. I would like to thank Editage (www.editage.jp) for English language editing. ix Contents 1 New Theory of Discriminant Analysis.... .... .... .... ..... .... 1 1.1 Introduction .... .... ..... .... .... .... .... .... ..... .... 1 1.1.1 Theory Theme. ..... .... .... .... .... .... ..... .... 1 1.1.2 Five Problems. ..... .... .... .... .... .... ..... .... 3 1.1.2.1 Problem 1.. .... .... .... .... .... ..... .... 3 1.1.2.2 Problem 2.. .... .... .... .... .... ..... .... 4 1.1.2.3 Problem 3.. .... .... .... .... .... ..... .... 4 1.1.2.4 Problem 4.. .... .... .... .... .... ..... .... 5 1.1.2.5 Problem 5.. .... .... .... .... .... ..... .... 5 1.1.2.6 Summary .. .... .... .... .... .... ..... .... 6 1.2 Motivation for Our Research .... .... .... .... .... ..... .... 6 1.2.1 Contribution by Fisher ... .... .... .... .... ..... .... 6 1.2.2 Defect of Fisher’s Assumption for Medical Diagnosis .... 7 1.2.3 Research Outlook ... .... .... .... .... .... ..... .... 8 1.2.4 Method 1 and Problem 4 . .... .... .... .... ..... .... 9 1.3 Discriminant Functions..... .... .... .... .... .... ..... .... 10 1.3.1 Statistical Discriminant Functions... .... .... ..... .... 10 1.3.2 Before and After SVM ... .... .... .... .... ..... .... 11 1.3.3 IP-OLDF and Four New Facts of Discriminant Analysis . .... ..... .... .... .... .... .... ..... .... 13 1.3.4 Revised IP-OLDF, Revised LP-OLDF, and Revised IPLP-OLDF... ..... .... .... .... .... .... ..... .... 16 1.4 Unresolved Problem (Problem 1). .... .... .... .... ..... .... 17 1.4.1 Perception Gap of Problem 1 .. .... .... .... ..... .... 17 1.4.2 Student Data.. ..... .... .... .... .... .... ..... .... 18 1.5 LSD Discrimination (Problem 2) . .... .... .... .... ..... .... 20 1.5.1 Importance of This Problem ... .... .... .... ..... .... 20 1.5.2 Pass/Fail Determination... .... .... .... .... ..... .... 21 1.5.3 Discrimination by Four Testlets .... .... .... ..... .... 22 xi
Description: