Sample Size Planning for Classification Models ClaudiaBeleitesa,∗,UteNeugebauera,b,ThomasBocklitzc,ChristophKraffta,JürgenPoppa,b,c aDepartmentofSpectroscopyandImaging,InstituteofPhotonicTechnology,Albert-Einstein-Str.9,07745Jena,Germany bCenterforSepsisControlandCare,JenaUniversityHospital,ErlangerAllee101,07747Jena,Germany cInstituteofPhysicalChemistryandAbbéCenterofPhotonics,Friedrich-Schiller-UniversityJena,Helmholtzweg4,07743Jena,Germany Abstract Inbiospectroscopy,suitablyannotatedandstatisticallyindependentsamples(e.g. patients,batches,etc.) forclassifiertrainingand testingarescarceandcostly. Learningcurvesshowthemodelperformanceasfunctionofthetrainingsamplesizeandcanhelpto determinethesamplesizeneededtotraingoodclassifiers.However,buildingagoodmodelisactuallynotenough:theperformance mustalsobeproven. Wediscusslearningcurvesfortypicalsmallsamplesizesituationswith5–25independentsamplesperclass. 3 1Althoughtheclassificationmodelsachieveacceptableperformance, thelearningcurvecanbecompletelymaskedbytherandom 0testinguncertaintyduetotheequallylimitedtestsamplesize. Inconsequence,wedeterminetestsamplesizesnecessarytoachieve 2reasonableprecisioninthevalidationandfindthat75–100sampleswillusuallybeneededtotestagoodbutnotperfectclassifier. nSuchadatasetwillthenallowrefinedsamplesizeplanningonthebasisoftheachievedperformance. Wealsodemonstratehow atocalculatenecessarysamplesizesinordertoshowthesuperiorityofoneclassifieroveranother: thisoftenrequireshundredsof J statisticallyindependenttestsamplesoriseventheoreticallyimpossible. Wedemonstrateourfindingswithadatasetofca. 2550 4 Ramanspectraofsinglecells(fiveclasses: erythrocytes,leukocytesandthreetumourcelllinesBT-20,MCF-7andOCI-AML3)as wellasbyanextensivesimulationthatallowsprecisedeterminationoftheactualperformanceofthemodelsinquestion. ] P Keywords: smallsamplesize,designofexperiments,multivariate,learningcurve,classification,training,validation A . t a t Accepted Author Manuscript non-diseased). In these situations, particular classes are ex- s tremely rare, and/or large sample sizes are necessary to cover [ Thispaperhasbeenpublishedas classesthatareratherill-definedlike“notthisdisease”or“out 2 C. Beleites, U. Neugebauer, T. Bocklitz, C. Krafft and of specification”. In addition, ethical considerations often re- v J. Popp: Sample size planning for classification mod- strictthestudiednumberofpatientsoranimals. 3 els. Analytica Chimica Acta, 2013, 760 (Special Is- 2 Eventhoughthedatasetsoftenconsistofthousandsofspec- sue:ChemometricsinAnalyticalChemistry2012),25–33, 3 tra, the statistically relevant number of independent cases is 1 DOI:10.1016/j.aca.2012.11.007. often extremely small due to “hierarchical” structure of the . The manuscript is also available at arXiv no. 1211.1323, 1 biospectroscopicdatasets: manyspectraaretakenofthesame 1 wherethesourcefilescontainalsothesourcecodeshown specimen, and possibly multiple specimen of the same patient 2 insupplementaryfileII. are available. Or, many spectra are taken of each cell, and a 1 numberofcellsismeasuredforeachcultivationbatch, etc. In : v these situations, the number of statistically independent cases i X1. Introduction isgivenbythesamplesizeonthehighestlevelofthedatahi- erarchy,i.e. patientsorcellculturebatches. Allthesereasons r a Sample size planning is an important aspect in the design together lead to sample sizes that are typically in the order of ofexperiments. Whilethisstudyexplicitlytargetssamplesize magnitudebetween5and25statisticallyindependentcasesper planning in the context of biospectroscopic classification, the class. ideasandconclusionsapplytoamuchwiderrangeofapplica- tions. Biospectroscopysuffersfromextremescarcityofstatisti- Learning curves describe the development of the perfor- manceofchemometricmodelsasfunctionofthetrainingsam- callyindependentsamples,butsmallsamplesizeproblemsare plesize. Thetrueperformancedependsonthedifficultyofthe commonalsoinmanyotherfieldsofapplication. taskathandandmustthereforebemeasuredbypreliminaryex- Inthecontextofbiospectroscopicstudies,suitablyannotated periments. Estimation of necessary sample sizes for medical andstatisticallyindependentsamplesforclassifiertrainingand classificationhasbeendonebasedonlearningcurves[1,2]as validationfrequentlyarerareandcostly. Moreover,theclassi- wellasonmodelbasedconsiderations[3,4]. Inpatternrecog- fication problems are often rather ill-posed (e.g. diseased vs. nition,necessarytrainingsamplesizeshavebeendiscussedfor alongtime(e.g. [5–7]). ∗Correspondingauthor Emailaddress:[email protected](ClaudiaBeleites) However, building a good model is not enough: the quality PreprintsubmittedtoElsevier January7,2013 should also be kept in mind that the specificity often corre- sponds to an ill-posed question: “Not class A” may be any- thing. Yet not all possibilities of a sample truly not belonging prediction to class A are of the same interest. In multi-class set-ups, the A B C e A specificitywilloftenpooleasydistinctionswithmoredifficult c en B differentialdiagnoses. Inourapplication[8,9], thespecificity refer C for recognizing a cell does not come from the BT-20 cell line poolse.g. thefactthatitisnotanerythrocyte(whichcaneas- (a) confusionma- (b) SensA (c) SpecA (d) PPVA (e) NPVA trix ily be determined by eye without any need for chemometric analysis) with the fact that it does not come from the MCF-7 Figure1: Confusionmatrix(a)andcharacteristicfractions(b) cell line, which is far more similar (yet from a clinical point –(e). Thepartsoftheconfusionmatrixsummedasenumerator of view possibly of low interest as both are breast cancer cell anddenominatorfortherespectivefractionwithrespecttoclass lines) and the clinically important fact that it does not belong Aareshaded. to the OCI-AML3 leukemia. This pooling of all other classes has important consequences. Increasing numbers of test cases in easily distinguished classes (erythrocytes) will lead to im- ofthemodelneedstobedemonstrated. provedspecificitieswithoutanyimprovementfortheclinically Onemaythinkoftrainingaclassifierastheprocessofmea- relevant differential diagnoses. Also, it must be kept in mind suringthemodelparameters(coefficientsetc.). Likewise,test- thatrandompredictions(guessing)alreadyleadtospecificities ingaclassifiercanbedescribedasameasurementofthemodel thatseemtobeverygood. Forourrealdatasetwithfivediffer- performance. Likeothermeasuredvalues,boththeparameters entclasses,guessingyieldsspecificitiesbetween0.77and0.85. ofthemodelandtheobservedperformancearesubjecttosys- Reportedsensitivitiesshouldalsobereadinrelationtoguessing tematic(bias)andrandom(variance)uncertainty. performance,butneglectingtodosowillnotcauseanintuitive Classifier performance is often expressed in fractions of overestimation ofthe prediction quality: guessing sensitivities test cases, counted from different parts of the confusion ma- arearound0.20inourfive-classproblem. trix, see fig. 1. These ratios summarize characteristic aspects Examiningthenon-diagonalpartsoftheconfusiontablein- of performance like sensitivity (SensA: “How well does the steadofspecificitiesavoidstheseproblems. Ifreportedasfrac- model recognize truly diseased samples?”, fig. 1b), specificity tionsoftestcasestrulybelongingtothatclass,thenallelements (SpecA: “How well does the classifier recognize the absence oftheconfusiontablebehavelikethesensitivitiesonthediag- of the disease?”, fig. 1c), positive and negative predictive val- onal, if reported as fractions of cases predicted to belong to ues(PPVA/NPVA: “Giventheclassifierdiagnosesdisease/non- thatclass,theentriesbehavelikethepositivepredictivevalues disease, what is the probability that this is true?”, fig. 1d and (againonthediagonal). 1e). Sometimesfurtherratios, e.g. theoverallfractionofcor- Literature guidance on how to obtain low total uncertainty rectpredictionsormisclassifications,areused. and how to validate different aspects of model performance is The predictive values, while obviously of more interest to available[10–14]. Inclassifiertesting,usuallyseveralassump- the user of a classifier than sensitivity and specificity, cannot tions are implicitly made which are closely related to the be- be calculated without knowing the relative frequencies (prior haviouroftheperformancemeasurementsintermsofsystem- probabilities)oftheclasses. aticandrandomuncertainty. Fromthesamplesizepointofview,oneimportantdifference Classification tests are usually described as Bernoulli- between these different ratios is the number of test cases ntest process (repeated coin throwing, following a binomial distri- that appears in the denominator. This test sample size plays a bution): n samples are tested, and thereof k successes (or test crucial role in determining the random uncertainty of the ob- errors) are observed. The true performance of the model is p, served performance pˆ, (see below). Particularly in multi-class anditspointestimateis problems,thistestsamplesizevarieswidely:thenumberoftest casestrulybelongingtothedifferentclassesmaydiffer,leading pˆ = k (1) to different and rather small test sample sizes for determining n test thesensitivity pofthedifferentclasses. Oncontrast,theover- withvariance all fraction of correct or misclassified samples use all tested (cid:32) (cid:33) k p(1−p) samplesinthedenominator. Var = (2) n n Thespecificityiscalculatedfromallsamplesthattrulydonot test test belongtotheparticularclass(fig.1c). Comparedtothesensi- Insmallsamplesizesituations,resamplingstrategieslikethe tivities, the test sample size in the denominator of the speci- bootstrap or repeated/iterated k-fold cross validation are most ficitiesisthereforeusuallylargerandtheperformanceestimate appropriate. Thesestrategiesestimatetheperformancebyset- moreprecise(withtheexceptionofbinaryclassification,where ting aside a (small) part of the samples for independent test- thespecificityofoneclassisthesensitivityoftheother). Thus ingandbuildingamodelwithoutthesesamples,thesurrogate small sample size problems in the context of measuring clas- model. The surrogate model is then tested with the remaining sifier performance are better illustrated with sensitivities. It samples. Thetestresultsarerefinedbyrepeating/iteratingthis 2 procedureanumberoftimes.Usually,theaverageperformance overallsurrogatemodelsisreported. Thisisanunbiased esti- mateoftheperformanceofmodelswiththesametrainingsam- plesizeasthesurrogatemodels[1,11]. Notethattheobserved bt varianceoverthesurrogatemodelspossiblyunderestimatesthe ttrruaienivnagricaansceeso[1f]t.hTehpiesrifsoirnmtuaintciveeolyfcmleoadreilfsotnraeintheidnkwsiothfantsraitin- I / a.u. mcf ci o uationwherethesurrogatemodelsareperfectlystable,i.e. dif- ferentsurrogatemodelsyieldthesamepredictionforanygiven u e case. Novarianceisobservedbetweendifferentiterationsofa l c k-fold cross validation. Yet, the observed performance is still rb subject to the random uncertainty due to the finite test sample 600 800 1000 1200 1400 1600 1800 2900 3100 sizeoftheunderlyingBernoulliprocess. D ~n cm-1 Usually,theperformancemeasuredwiththesurrogatemod- els is used as approximation of the performance of a model Figure 2: Spectra of the 5 classes: BT-20 breast carcinoma trained with all samples, the final model. The underlying as- cells, MCF-7 breast carcinoma cells, OCI-AML3 leukemia sumption is that setting aside of the surrogate test data does cells, normal leukocytes and normal erythrocytes (from top to notaffectthemodelperformance. Inotherwords,thelearning bottom). Shown are the median and the 5th to 95th percentile curveisassumedtobeflatbetweenthetrainingsamplesizeof spectra. The confusion tables are available as supplementary thesurrogatemodelandtrainingsamplesizeofthefinalmodel. material. The violation of this assumption causes the well-known pes- simisticbiasofresamplingbasedvalidationschemes. The results of testing many surrogate models are usually pointspacingof4cm-1. Baselinecorrectionwasperformedin pooled. Strictly speaking, pooling is allowed only if the dis- thehighwavenumberregionbyathirdorderpolynomialfitto tributions of the pooled variables are equal. The description spectral regions where no CH stretching signals occur (2700 ofthetestingprocedureasBernoulliprocessallowspoolingif – 2825, 3020 – 3040 and 3085 – 3200 cm-1) which was then the surrogate models have equal true performance p. In other usedasbaselinefortheCHstretchingbandsfrom2810to3085 words, if the predictions of the models are stable with respect cm-1. Athirdorderpolynomialautomaticallyselectingsupport toperturbedtrainingsets, i.e. ifexchangingofafewsamples pointsbetween500–1200 cm-1 wasblendedsmoothlywitha doesnotleadtochangesintheprediction.Consequently,model quadraticpolynomialinthespectralrangeautomaticallyselect- instability causes additional variance in the measured perfor- ingsupportpointsbetween800–1200and1700–1800 cm-1. mance. After baseline correction, the spectral ranges 600 – 1800 and 2810–3085 cm-1 wereretained. Finally,thespectrawerearea Here, we discuss the implications of these two aspects of normalized. sample size planning with a Raman-spectroscopic five-class classification problem: the recognition of five different cell Figure 2 shows the preprocessed spectra. Erythrocyte (red typesthatcanbepresentinblood. Inadditiontothemeasured blood cells, rbc) spectra can easily be recognized by the reso- data set, the results are complemented by a simulation which nanceenhancedcharacteristicsignatureofhemoglobinaround allowsarbitrarytestprecision. 1600 cm-1. Leukocyte(leu)spectraarerathersimilartothetu- mourcellspectra,yettherearesubtledifferencesintheshapeof the CH -deformation vibrations around 1440 cm-1, the inten- 2 2. MaterialsandMethods sityoftheν stretchingvibrations(2810–3085 cm-1)which CH are more intense in the tumour cells, and the intensity of the 2.1. RamanSpectraofSingleCells phenylalanine band at 1002 cm-1 (less intense in the tumour Raman spectra of five different types of cells that could be cells).Betweenthedifferenttumourcelllines(bt,mcf,andoci) present in blood are used in this study. Details of the prepa- nodistinctmarkerbandsarevisiblebyeye. ration, measurements and the application have been published Variation in the data set is introduced by using cells from previously[8,9]. Thedataweremeasuredinastratifiedman- 5differentdonors(leukocytesanderythrocytes)and5different ner,specifyingroughlyequalnumbersofcellsperclassbefore- cultivationbatches,respectively;measuringthecellsonthefirst hand,anddonotreflectrelativefrequenciesofthedifferentcells day of preparation and one day after (yielding 9 measurement inatargetpatientpopulation.Thus,wecannotcalculatepredic- days)andusingtwodifferentlasersofthesamemodelfromthe tivevaluesforourclassifiers. same manufacturer. For the present study, we pretend not to For this study, the spectra were imported into R [15] using knowoftheseinfluencingfactorsandtreatthespectraasinde- package hyperSpec [16]. In order to correct for deviations of pendent. This allows us to pretend that we have a sufficiently the wavenumber calibration the maximum of the CaF band large data set to run reference calculations that can be used as 2 wasalignedto322cm-1. Thespectrathenunderwentasmooth- ground truth. The consequence is that no performance for the ing interpolation (spc.loess) onto a common wavenumber recognitionofthecelllinesingeneralcanbeinferredfromthis axisrangingfrom500to1800and2600to3200cm-1withdata study: the results would be heavily overoptimistic (tab. 1, see 3 class celltype n sensitivity spectra sim. LDA sim. PLS-LDA realPLS-LDA realPLS-LDAbatch-wise rbc erythrocytes 372 1.00 1.00 0.99(0.96–0.99) 0.97(0.96–0.98) leu leukocytes 569 1.00 0.99 0.97(0.96–0.97) 0.87(0.84–0.90) mcf MCF-7breastcarc. 558 0.95 0.87 0.91(0.90–0.92) 0.31(0.24–0.42) bt BT20breastcarc. 532 0.91 0.72 0.75(0.74–0.76) 0.38(0.32–0.45) oci OCI-AML3leukemia 518 0.94 0.86 0.89(0.88–0.90) 0.30(0.23–0.17) Table1:Datasetcharacteristics:classes,numberofspectraperclassand“bestpossible”sensitivities.Forthesimulated(sim.) data (column“sim. LDA”and“sim. PLS-LDA”),n =2·104 spectra. Bestpossibleperformanceoftherealdatawasestimatedusing test 100×5-foldcrossvalidation,shownareaverageand5thto95thpercentileofobservedsensitivitiesovertheiterations. Column“real PLS-LDA”correspondstothesetupforthisstudy,treatingeachspectrumasindependentoftheotherspectra,forcolumn7(“real PLS-LDAbatch-wise”)thevalidationsplitspatientsandbatchesratherthanspectra. also[14]foradiscussionofrepresentativetesting). 100 “small” data sets of 25 spectra/class (i.e. 125 spec- Hence, we have a data set of about 2500 spectra (tab. 1) tra of all classes together per small dataset) were generated. of five classes with “unknown” influencing factors. The dif- For determining the real performance of the models, a large ficulty in recognising the five different classes varies widely: testsetof4·104 spectra/classwasgenerated. Thismeansthat while erythrocytes are extemely easy to recognize, we expect thesensitivitiescanbemeasuredwithaprecisionofbetterthan thatperfectrecognitionofleukocytesispossibleaswellthough 0.5±0.005(95%c.i.),thestandarddeviationofobservedper- (cid:113) we expect that more training cases are needed to achieve this. formanceisthenσ(pˆ)= p(1−p) ≤ 0√.5 =0.0025. Differential diagnosis of the cancer cell lines is more difficult, n n Inaddition,onelargetrainingsetof2·104spectra/classwas and substantial overlap between the two breast carcinoma cell generated. Thisdatasetwasusedtoestimatethebestpossible lines BT-20 and MCF-7 has been observed in previous stud- performancethatcanbeobtainedwiththechosenclassifierson ies [8, 9]. Throughout this paper, we discuss the sensitivities thisidealizedproblem. forerythrocytes(rbc),leukocytes(leu)andthetumourcellline BT-20(bt). Of these 2500 spectra, we draw data sets of size 25 2.3. ClassificationModels cases/classkeepingtheremainingspectraasalargetestsetto AsclassifierwechosePLS-LDAasimplementedinpackage getamorepreciseestimateoftheperformanceoftherespective cbmodels[19]wherethepartialleastsquares(PLS)andlinear models. discriminantanalysis(LDA)modelsfrompackagespls[20]and rbcisthesmallestclass,itssensitivitycanbeestimatedwith MASS[21]arecombinedintoonemodel.Theprojectionbythe aprecisionbetterthan±0.052(95%confidenceintervalatsen- PLSisasuitablevariablereductionforLDA[22]. LDAmod- sitivityof0.5). elstrainedonthePLSscoressuffermuchlessfrominstability than LDA models trained on data with large numbers of vari- 2.2. SimulatedSpectra ates. Thenumberoflatentvariableswassetto10forntrain ≥4 trainingspectra/class. Fortheextremelysmalltrainingsets,it In addition to the experimental data set, simulations were wasrestrictedtobeatmosthalfthetotalnumberofspectrain used. This allows to study an idealized situation: arbitrarily thetrainingset. Allclassificationmodelsweretrainedwithall largetestsetsallowtomeasurethetrueperformancewithneg- fiveclasses. ligiblerandomuncertaintyduetothetesting. Thus,therandom Inaddition,webuilttwomodelsusing2·104simulatedspec- uncertainty due to model instability can be measured with the tra/class and tested them with the large test set (4·104 spec- simulationswhilethesetwosourcesofrandomuncertaintycan- tra/class). Thesemodelsareassumedtoachievethebestpos- notbeseparatedfortherealdata. sible performance LDA can reach with and without PLS di- Foreachofthefiveclassesintheexperimentaldataset,av- mensionality reduction for the given problem. The achieved eragespectrumandcovariancematrixwerecalculated. Multi- sensitivities are 1.00 for rbc and leu and 0.91 for bt (column variate normally distributed simulated spectra were simulated “sim. LDA”intab.1)withoutPLS.The10latentvariablePLS- using rmvnorm [17, 18]. Briefly, the Mersenne-Twister algo- LDAmodeltrainedonthesamedatasethadlowersensitivities rithmgeneratesuniformlydistributedpseudo-randomnumbers of1.00fortherbc, 0.99forleu, and0.72forclassbt(column whicharethenconvertedtonormallydistributedrandomnum- “simPLS-LDA”). bers via the inverse cumulative distribution function. The re- For the real data, we report best possible performance for questedcovariancestructureisobtainedbymultiplyingwiththe PLS-LDAmodelsofthecompletedatasetusing10latentvari- matrixrootofthecovariancematrix(calculatedviaeigenvalue ables (measured by 100× iterated 5-fold cross validation, col- decomposition)andtherequestedmeanspectrumisadded. umn “real PLS-LDA”). In addition, we checked the perfor- 4 mancefor100×iterated5-foldcrossvalidationwhenthevali- 3. LearningCurves dationsplitsaredonebypatient/batch(astheunderlyingstruc- The learning curve describes the performance of a given tureofthemeasurementwouldrequire;column“realPLS-LDA classifierforaproblemasfunctionofthetrainingsamplesize batch-wise”). Here, 10 latent variable PLS-LDA can still per- [10]. The prediction errors a classifier makes may be divided fectly recognize erythrocytes, sensitivities for leukocytes are intofourcategories: closeto0.90,butamongthetumourcelllinesthemodelisbasi- callyguessing. 10latentvariablePLS-LDAisanextremelyre- 1. theirreducibleorBayeserror strictivemodelset-upwhichisappropriateforthesmallsample 2. thebiasduetothemodelsetup, sizesstudiedinthispaperbutrecognitionofcirculatingtumour 3. additionalsystematicdeviations(bias),and cellsrequiresmoreelaboratemodelling[8,9]. 4. randomdeviations(variance) Theinterestedreaderwillfindtheconfusiontables,i.e. sen- sitivitiesaswellasthespecificitiesforthevarioustypesofmis- Thebestpossibleperformancethatcanbeachievedwithagiven classification,inthesupplementarymaterial. model setup consists of the Bayes error, i.e. the best possi- ble performance for the best possible model, and the bias for 2.4. ValidationSet-Up ntrain → ∞. Thelattertwocomponentsdependonthetraining samplesize,andtendtozeroasmorecasesbecomeavailable. Iterated k-fold cross validation was chosen as validation Thegeneraldiscussionoflearningcurves,e.g. [10,fig. 7.8], scheme. While out-of-bootstrap validation is sometimes pre- usuallyconsidersthecombinationofthefirstthreeerrortypes ferredforsmallsamplesizesduetothelowervariance,aprevi- (as function of the training sample size) which form the av- ousstudyonspectroscopicdatasetsfoundcomparableoverall erage(expected)performanceofagivenclassificationrulefor uncertaintyforthesetwovalidationschemes[13].Incontrastto a particular problem if n training cases are available. The k-foldcrossvalidation,theeffectivetrainingsamplesizeisnot train learningcurveforaparticulardatasetisknownasconditional knowninout-of-bootstrapvalidation. Out-of-bootstrapusually learningcurve[10]. has the same nominal training sample size as the whole data Inthecontextofclassificationbasedonmicroarraydata,both set. However,itispessimisticallybiasedwithrespecttothefi- empirically fitted functions [1] and parametric methods based nal model. Such a pessimistic bias is usually observed if the onthedifferenceingeneexpression[3,4]havebeenusedtoes- trainingsetissmallerthanthewholedataset. Thispessimistic timatelearningcurvesandnecessarysamplesizesforthetrain- biasisusuallylargerthanthatof5-or10-foldcrossvalidation. ingofwellperformingclassifiers. AnextensionofMukherjee Thissuggeststhattheduplicatecasesinthebootstraptraining etal. [1]hasbeenappliedtomedicaltextclassification[2]. sets do not contribute as much information for classifier train- Microarray(geneexpression)datasetsaresimilartobiospec- ingasthefirstinstanceofthegivencasedoes. Crossvalidation troscopicdatasetsintheirshapeandsize:inbothcasestheraw is unbiased with respect to the number of cases actually used data typically consists of thousands of measurement channels fortrainingofthesurrogatemodels[11]andisthereforemore (variates: genes, wavelengths)andtypicallyhundredstothou- suitableforcalculatinglearningcurves. sandsofrows(expressionprofiles,spectra). However,theydif- Weusedk=5-foldcrossvalidationwith100iterations. ferfromtypicalbiospectroscopicdatasetsintwoimportantas- pects. Firstly,biospectroscopicdatasetsoftenhaveratherlarge 2.5. GrowingDataSetsorRetrospectiveLearningCurves numbers of spectra of the same patient or batch while multi- Bothrealandsimulateddatasetswereusedforthelearning ple measurements of the same subject are far less common in curveestimationina“growing”fashion. Thissimulatesasce- microarraystudies. ThedatasetsinMukherjeeetal. [1]have nariowhereatfirstveryfewcasesareavailable,andnew,better totalpatientnumbersbetween53and78(plusonelargesetof models are built as further cases become available, following 280patients),thesesamplesizesunfortunatelydonotallowto thepracticeofmodelingandsamplecollectionweusuallyen- check their extrapolated predictions of the performance. Sec- counter. ondly,theinformationwithrespecttotheclassificationproblem 100 such growing data sets were analysed for both the real is usually spread out over wide spectral ranges in biospectro- andthesimulateddata. Thisallowscalculationoftheaverage scopicclassification. Incontrast,microarrayclassificationtyp- performance that can be expected for our cell classifier with icallyreliesonratherfewgenesthatcarryinformationamonga 10 latent variable PLS-LDA models as well as the respective largenumberofnoise-onlyvariates[3,4]. randomuncertainty. Figures3and4givethe(unconditional)learningcurvesfor Thealternativetothegrowingdatasetscenario,retrospective therealandsimulateddatainthetoprows(lines). Withsmaller calculationofthelearningcurve,wouldleadtoanintermediate samplesizes,therandomuncertaintygrows,andcannotbene- between the two different learning curves: as there are many glected: Aparticular datasetofsizen maydiffersubstan- train possibilitiestodrawfewcasesoutofevenasmalldataset,for tiallyfromtheaveragedatasetofsizen . Foreachtraining train the very small sample sizes the resulting curve will be closer samplesize,90ofthe100smalldatasetshadperformancein- totheaverageperformanceofthattrainingsamplesize. How- sidetheshadedarea. ever,asthedrawnnumberofsamplesapproachesthesizeofthe For the simulated data, one such growing data set is shown small data set, the retrospective estimate of the learning curve exemplarily in the middle (true performance, i.e. tested with tendstowardstheestimateofthegrowingdataset. thelargetestset)andbottomrows(crossvalidationestimateof 5 rbc leu bt 1.0 0.8 le a 0.6 rn in g 0.4 cu rv e 0.2 0.0 1.0 0.8 Sensitivity00..46 one set larg e 0.2 0.0 1.0 0.8 o 0.6 n e s e 0.4 t c v 0.2 0.0 0 5 10 15 20 0 5 10 15 20 0 5 10 15 20 n class train Figure 3: Learning curves of the real data set: sensitivities for recognition of red blood cells (rbc), leukocytes (leu), and BT-20 breast tumour cell line (bt). Black: Sensitivity observed for 100 iterations of 5-fold cross validation on the complete data set, approximatingthebestpossibleperformanceofa10latentvariablePLS-LDAonthisdataset. Linesgivetheaverage,theshaded areacoversthe5th to95th percentileofiterations(bottomandmiddlerow)andsmalldatasets(toprow). Thinlines: average“one set large” and “one set cv” (cross validation) performance are repeated in the rows above for easier comparison. Colours: blue performancemeasuredwithlargetestset,redperformancemeasuredbyiteratedcrossvalidation. Bottomrow: Learningcurveof onegrowingdataset,measuredwith100×iterated5-foldcrossvalidation. Middlerow: Thesamemodelsasinthebottomrow,but performancemeasuredwithlargetestset. Thepercentilesdepicttheinstabilityofthesurrogatemodelstrainedduringiteratedcross validation,butaresubjectonlytolowuncertaintyduetothefinitetestsamplesize. Toprow: sensitivityachievedfor100different smalldatasetsofsizen ,measuredwiththelargetestset. train performanceofthesamemodel)offig.4.Theexamplerunper- thequestionhowmanysamplesshouldbecollectedifnosam- forms exceptionally well for the leukocytes but roughly at the ples are yet available for a specific problem, while the middle 5th percentilewithrespecttoallpossibledatasetsofsizen rowsbelongtothequestionhowmanymoresamplesinaddition train of sensitivity for red blood cells and the BT-20 cell line. The tothealreadyavailableonesshouldbecollected. example run of the real data (fig. 3) in general follows more In practice, however, neither the top nor the middle row closely the average sensitivity of data sets of the respective learningcurvesareavailable,only(iterated)crossvalidationor size. Learning curves reported for real data sets usually give out-of-bootstrap results are available from within a given data one point measurement for each classifier set up and training set. Theresultsofthecross-validationinthebottomrowarean sample size only and are usually calculated in the “retrospec- unbiasedestimateofthemiddlerow(weusetheactualtraining tive”manneraccordingtoourdefinitionabove. sample sizeof the surrogate models, i.e. 4 of thesample size 5 For the planning of necessary sample sizes needed to train ofthesmalldataset). However,thecrossvalidationissubject goodclassifiers,boththeexpectedperformanceinthetoprow tomuchhigherrandomuncertainty,asthetotalnumberoftest offigs.3and4andtheperformanceforagivengrowingdataset casesismuchlowerthanwiththelargetestsetusedtocalculate asinthemiddlerowsareofimportance. Thetoprowsanswer themiddlerows. 6 rbc leu bt 1.0 0.8 le a 0.6 rn in g 0.4 cu rv e 0.2 0.0 1.0 0.8 Sensitivity00..46 one set larg e 0.2 0.0 1.0 0.8 o 0.6 n e s e 0.4 t c v 0.2 0.0 0 5 10 15 20 0 5 10 15 20 0 5 10 15 20 n class train Figure4: Learningcurvesofthesimulateddataset. Thisplotwasgeneratedanalogouslytofig3. Theonlydifferenceisthatthe bestpossibleerror(blacklines)wasmeasuredwiththelargeindependenttestset,seedescriptionofthedataanalysis. Asexplainedbefore,therandomuncertaintycomesfromtwo ofthesmalldataset. sources: firstly,modelinstability,i.e. differencesbetweensur- Thisuncertaintyislargeenoughtomaskimportantfeatures rogatemodelsbuiltwithdifferenttrainingsetsofthesamesize, ofthelearningcurveofthegrowingdataset:inourexamplerun and secondly testing uncertainty due to the finite number of forthesimulateddata,thesensitivityforerythrocytesislargely spectraavailablefortesting. Thefirstisrelatedtothenumber overestimated(otherrunsshowequallylargeunderestimation). oftrainingsampleswhiletheseconddependsonthenumberof Theexceptionallygoodperformancefortheleukocyteswith4– testsamples. Testingwiththelargetestsetreducesthesecond 10trainingsamplesisnotonlynotdetectedbythecrossvalida- source of uncertainty but does not influence the variation due tionbutinfacttwodipsappearinthecrossvalidationestimate to model instability. The only difference between middle and of the example data set’s learning curve. For the BT-20 cell bottomrowsinfigs.3and4arethetestsets: exactlythesame line, we observe an oscillating behaviour with the addition of modelsaretestedwiththelargetestset(middle)andthespectra single cases up to a data set size of 9 samples (i.e. on aver- heldoutbythecrossvalidation(bottomrow). Inotherwords, age7.2trainingsamples). Ofcourse,weobservealsorunsthat the bottom row is a “small test sample size” approximation to matchthetrue(reference)learningcurveoftheparticulardata the middle row. The simulations use n = 2·104 for refer- setmoreclosely. Buteventhenthepercentilesindicatethatthe test ence(topandmiddlerow),meaningthatthevariationdepicted results are not reliable estimates of the learning curve of that inmiddlerowoffig.4iscausedonlybymodelinstability. On dataset. contrast, for the real data, only ca. 350 – 540 reference test The cross validation of the real data set underestimates the spectraareavailableanduncertaintyduetothefinitetestsam- sensitivity for red blood cells for the extremely small sample plesizecancontributesubstantiallytotheobservedvariationin sizes, however the general development of sensitivity as func- themiddlerowoffig.3. However,thetotalrandomuncertainty tionofthetrainingsamplesizeoftheexampleruniscorrectly on the iterated cross validation is dominated by the huge ran- reproduced. Alsothelearningcurvefortheleukocytesisquite domuncertaintyduetotestingonlywiththeupto25samples closelymatched. FortheBT-20cells, however, thecrossvali- 7 dation again does not even resemble the shape of the example p^=0.75 p^=0.9 p^=0.95 dataset’slearningcurve. 1.00 In conclusion, the average performances observed during theiteratedcrossvalidationdonotreliablyrecoverthecorrect 0.75 shape of the learning curve of the particular data set for our p0.50 small sample size scenarios (middle rows), much less that of theperformanceofanydatasetoftherespectivetrainingsam- 0.25 ple size (top rows). In contrast, the actual performance of the 0.00 classifiers(topandmiddlerows)isacceptabletoverygoodcon- 0 25 50 75 1000 25 50 75 1000 25 50 75 100 ntest sideringtheactualtrainingsamplesizes: with20trainingcases perclass,redbloodcellsarealmostperfectlyrecognized,sensi- Figure5: 95%confidenceintervalsfordifferentobservedper- tivitiesaround0.90areachievedforleukocytesandevenabout formances pˆ asfunctionofn . If90outof100samplesofa 2 out of 3 of the very difficult BT-20 breast cancer cells are test classarerecognizedcorrectly(e.g.sensitivityoftheleukocytes recognizedcorrectly. with25trainingsamples),the95%confidenceintervalforthe sensitivity ranges from 0.83 to 0.94 – which in the context of 4. SampleSizeRequirementsforClassifierTesting our classification task reads as being between “quite bad” and “reallygood”. Thus,theprecisemeasurementoftheclassifierperformance turnsouttobemorecomplicatedinsuchsmallsamplesizesit- uations. Samplesizeplanningforclassificationthereforeneeds exampleapplication,e.g. thesensitivityoftheleukocyteclass to take into account also the sample size requirements for the reaches0.90ratherquickly. Ifthatmodelweretestedwith100 testingoftheclassifier.Wewilldiscussheretwoimportantsce- leukocytes(i.e. fourtimesasmanyasinourlargestsmalldata narios that allow estimating required test sample sizes: firstly, sets)and90ofthemwerecorrectlyrecognized, the95%con- specifyinganacceptablewidthforaconfidenceintervalofthe fidence interval would range from 0.83 (which would be con- performance measure and secondly the number of test cases sideredquitebadasleukocytesarefairlyeasytorecognize)to neededforacomparisonofclassifiers. 0.94–whichinthecontextofourclassificationtaskwouldbe translatedto“quitegood”. Inotherwords,theconfidenceinter- 4.1. SpecifyingAcceptableConfidenceIntervalWidths valwouldstillbetoowidetoallowapracticaljudgmentofthe classifier. ## Loading required package: ggplot2 Similarly, already with 4 – 5 training spectra (out of 6 to- tal red blood cell spectra in the data set), we observed perfect ForBernoulliprocesses,severalapproachesexisttoestimate recognitionofredbloodcellsinthesimulationexample’scross confidence intervals for the true probability p given the ob- validation. Butthe95%confidenceintervalstillreachesdown servedprobability pˆ andthenumberoftestsn,see[23,24]for to 0.65. However, for pˆ = 1 the confidence intervals narrow recommendations particularly in small sample size situations. verysoon,and“already”with58testsamplesthelowerlimitof Forthefollowingdiscussion, weusetheBayesmethodwitha the95%confidenceintervalreaches0.95(seefig.6). uniformpriortoobtaintheminimal-lengthorhighestposterior Figure 6 gives the width of the Bayesian confidence inter- density (HPD) interval [25, 26]. For details about the statis- val as function of the test sample size for different observed tical properties of this method, please refer to [24]. Package valuesoftheperformance. Notethatspecifyingconfidencein- binom[25] offers a variety of other methods that can easily be terval widths to be less than 0.10 with expected observed per- usedbytheinterestedreaderinstead. formancebetween0.90and0.95alreadycorrespondstorequir- From a computational point of view, this method is con- ingbetween3–51 timesasmanytestsamplesasweconsider 2 venient as the calculations can be formulated using the Beta- typicallyavailableinbiospectroscopy. Forconfidenceinterval distribution which allows to compute results not only for dis- widths of less than 0.05 which would allow to distinguish the cretenumbersofeventsk,butforrealk. Thus, pˆobtainedfrom practical categories “bad” and “very good”, hundreds of test testing many spectra can be used with a test sample size ntest cases are required. Also, this estimation of required sample equallinge.g. thenumberoftestpatientsorbatches. sizes is very sensitive to the true proportion p: if p were in Confidence intervals for the true proportion are calculated fact only 0.89 instead of the 0.9 assumed in the example, 153 as function of the number of test samples (denominator of the insteadof141testsampleswouldberequiredtoreachthespec- proportion) and the observed proportion pˆ. The intervals are ifiedconfidenceintervalwidth. widestfor pˆ =0.5andnarrowestfor pˆ =0or1. Consequently, thenecessarytestsamplesizetomeasuretheperformancewith 4.2. DemonstratingthataNewClassifierisBetter a pre-specified precision can be calculated, either in a conser- vative(worst-case)fashionfor pˆ =0.5orusingexistingknowl- A second important scenario that allows to specify neces- edge/expectationsabouttheachievableperformance. sarytestsamplesizesisdemonstratingsuperioritytoanalready Figure5showsthe95%confidenceintervalsfordifferentob- known classifier. E.g., the instrument is improved and the re- servedperformancesasfunctionofthetestsamplesize.Forour sultingadvantageshouldbedemonstrated. Aroughestimateof 8 h0.20 500 dt wi p^ val 0.15 0.5 400 er nt 0.75 300 ence i0.10 00..995 ntest200 d nfi0.05 0.975 o 100 % c 1 950.00 0 0 100 200 300 400 500 0.900 0.925 0.950 0.975 1.000 ntest pnew Figure6:95%confidenceintervalwidthsfordifferentobserved Figure7: Testsamplesizenecessarytodemonstratesuperiority performances pˆ as function of n . pˆ = 0.5 and 1 give the of an improved model of sensitivity p , assuming the “old” test new widestandnarrowestpossibleconfidenceintervalwidths. E.g. modelhad p =0.75sensitivityandwastestedwithn =25 old test If the confidence interval should not be more than 0.1 wide samples and accepting a type I error of α = 0.05 and a type whileasensitivityof0.9isexpected,n ≥ 141samplesneed II error β = 0.2 (solid line). Dotted: test sample size for the test tobetested. second model with α = β = 0.10. However, if α = β = 0.05 is required (dashed), even a model with 0.975 true sensitivity needs to be tested with at least 300 cases and 116 cases are theperformanceofthenewinstrumentisavailable. Howmany necessarytodemonstratethesuperiorityofanimprovedmodel samplesareneededinordertoprovethesuperiorityofthenew trulyachieving0.99sensitivity. approach? From a statistics point of view, comparing classifier perfor- mance is a typical hypothesis testing task. R package Hmisc again(impossibleforourstudy: newcellculturebatchesneed [27]providesfunctionsforpower(bpower)andsamplesizees- to be grown) or if the improvement is in the data analysis and timation(bsamsize)ofindependentproportionswithunequal therefore the same instrumental data can be analysed by both test sample sizes as described by Fleiss etal. [28]. The ap- methods. proximation overestimates power for small sample sizes [29]. If we could achieve 0.975 sensitivity for BT-20 cells, we However, this is not of much consequence here, as the calcu- wouldneedtotestwith63testcases(acceptingα=β=0.10). lated sample sizes will anyways be rough guesstimates rather Figure7showsthatthisisverysensitivetotheassumedqual- thanexactnumbersofrequiredsamples: Firstly,theexactper- ity of the new model: if the new model has in fact “only” a formanceoftheimprovedclassifierisunknown,sothesample sensitivity of 0.96 (corresponding to ca. 1 additional misclas- size planning needs to check the sensitivity of the calculated sification out of 63 test cases), already 117 or almost twice as numberstothisassumption. Secondly,theactualpowerofthe many test cases are needed. Note that this is a rather extreme calculatedscenariocanbecheckedbybpower.sim. exampleasitmeansoneorderofmagnitude(0.25to0.025)re- AssumeourrecognitionofBT-20cellswereimprovedfrom duction in the fraction of unrecognized BT-20 cells, which is the0.75sensitivityweobtainwith20trainingsamples/classto much larger than the improvements considered in the practice 0.90. Aquickestimateofthenecessarytestsamplesizereveals ofbiospectroscopicclassification. that in this scenario, the maximal obtainable power 1 (setting In conclusion, well working classifiers need to be validated n forthenewmodelto105asinfiniteforpracticalpurposes) with at least 75 test cases in order to obtain confidence inter- test is 1−β = 0.62. In other words, there is no chance to prove valsthatarenarrowenoughtodrawpracticalconclusionsabout the superiority of the new classifier with anything close to an themodel. Demonstratingsuperiorityofanew,improvedclas- acceptabletypeIIerror2duetothesmalltestsamplesizeavail- sifier in general needs even more test cases and often will be able for the old model. The comparisons have most power if impossibleatallifthetestsamplesizefortheoldclassifierwas thetestsareperformedwithequalsamplesizes. Forthiscase, small. tablesarealsoavailableinFleissandPaik[30].Inourexample, the usual power of 0.8 (i.e. type II error β = 0.2; with type I error3α = 0.05)needsatleast100independenttestcasestruly 5. Summary belonging to class bt for each of the models. Note that paired Using a Raman spectroscopic five class classification prob- testscanbemuchmorepowerful, thusrequiringlesssamples. lem as well as simulated data based on the real data set, we Pairedtestscanbeusedwhenthesamecasescanbemeasured comparedthesamplesizesneededtotraingoodclassifierswith samplesizesneededtodemonstratethattheobtainedclassifiers 1Probabilitythatwecorrectlyconcludethatthenewclassifierisbetterthan workwell. Duetothesmallertestsamplesize,sensitivitiesare theoldoneiffitactuallyis. moredifficulttodeterminepreciselythanspecificitiesoroverall 2Probabilitythatwewronglyconcludethenewclassifierisnobetterthan hitrates. theoldone,althoughitactuallyis. 3Probabilitythatwewronglyconcludethenewclassifierisbetterthanthe Using typical small sample sizes of up to 25 samples per oldone,althoughitactuallyisnot. class, we calculated learning curves (sensitivity as function of 9 thetrainingsamplesize)using100×iterated5-foldcrossvali- [9] U. Neugebauer, J. H. Clement, T. Bocklitz, C. Krafft, J. Popp, Identi- dation. Whilethegeneralshapeofthelearningcurvecouldbe ficationanddifferentiationofsinglecellsfromperipheralbloodbyRa- man spectroscopic imaging., J Biophotonics 3 (8-9) (2010) 579–587. determined correctly for the very easily recognized red blood doi:10.1002/jbio.201000020. cells, for more difficult recognition tasks not even the correct [10] T.Hastie,R.Tibshirani,J.Friedman,TheElementsofStatisticalLearn- shape of the learning curve can be determined reliably within ing;Datamining,InferenceandPrediction,2ndEdition,SpringerVerlag, thesmalldatasetastheprecisemeasurementofclassifierper- NewYork,2009. formancerequiresratherlargetestsamplesizes(>75cases). [11] E.R.Dougherty,C.Sima,J.Hua,B.Hanczar,U.M.Braga-Neto,Per- formanceofErrorEstimatorsforClassification,CurrentBioinformatics5 Inconsequence,wecalculatenecessarytestsamplesizesfor (2010)53–67. differentpre-specifiedtestingscenarios,namelyspecifyingac- [12] R.Kohavi,AStudyofCross-ValidationandBootstrapforAccuracyEs- ceptable widths for the confidence interval of the true sensi- timationandModelSelection,in: C.S.Mellish(Ed.),ArtificialIntelli- genceProceedings14thInternationalJointConference,20–25.August tivity and the number of test samples needed to demonstrate 1995, Montréal, Québec, Canada, MorganKaufmann, USA,1995, pp. superiority of one classifier over another. In order to obtain 1137–1145. confidence interval widths ≤ 0.1, 140 test samples are neces- [13] C. Beleites, R. Baumgartner, C. Bowman, R. Somorjai, G. Steiner, sarywhen90%sensitivityisexpected. Incontrast, therecog- R. Salzer, M. G. Sowa, Variance reduction in estimating classification errorusingsparsedatasets,Chem.Intell.Lab.Syst.79(2005)91–100. nition of leukocytes in our example application reaches 90% [14] K.H.Esbensen,P.Geladi,PrinciplesofProperValidation:useandabuse sensitivityalreadywithabout20trainingsamples. Comparison ofre-samplingforvalidation,J.Chemometrics24(3-4)(2010)168–187. ofclassifierswasfoundtorequireevenlargertestsamplesizes [15] RDevelopmentCoreTeam,R:ALanguageandEnvironmentforStatisti- (hundredsofstatisticallyindependentcases)inthegeneralcase. calComputing,RFoundationforStatisticalComputing,Vienna,Austria, ISBN3-900051-07-0(2011). In conclusion, we recommend to start sample size planning [16] C.Beleites,V.Sergo,hyperSpec:apackagetohandlehyperspectraldata for classification by specifying acceptable confidence interval setsinR,Rpackagev.0.98-20120725(2012). widths for the expected sensitivities. This will lead to sample [17] A.Genz, F.Bretz, T.Miwa, X.Mi, F.Leisch, F.Scheipl, T.Hothorn, mvtnorm:MultivariateNormalandtDistributions,Rpackagev.0.9-9992 sizesthatallowretrospectivecalculationoflearningcurvesand (2012). arefinedsamplesizeplanningintermsofbothtestandtraining [18] A.Genz,F.Bretz,ComputationofMultivariateNormalandtProbabili- samplesizecanthenbedone. ties,LectureNotesinStatistics,Springer-Verlag,Heidelberg,2009. [19] C. Beleites, cbmodels: Collection of "combined" models: PCA-LDA, PLS-LDA,etc.,Rpackagev.0.5-20120731(2012). Acknowledgments [20] R.Wehrens,B.-H.Mevik,pls:PartialLeastSquaresRegression(PLSR) and Principal Component Regression (PCR), R package version 2.1-0 Graphicsweregeneratedusingggplot2[31]. (2007). Financial support by the European Union via the Eu- [21] W.N.Venables,B.D.Ripley,ModernAppliedStatisticswithS,4thEdi- ropäischer Fonds für Regionale Entwicklung (EFRE) and the tion,Springer,NewYork,2002,ISBN0-387-95457-0. [22] M.Barker,W.Rayens,Partialleastsquaresfordiscrimination,Journalof Thüringer Ministerium für Bildung, Wissenschaft und Kultur Chemometrics17(3)(2003)166–173. (project B714-07037) as well as the funding by BMBF (FKZ [23] L.Brown,T.Cai,A.DasGupta,IntervalEstimationforaBinomialPro- 01EO1002)ishighlyacknowledged. portion,StatisticalScience16(2001)101–133. [24] A.M.Pires,C.Amado,IntervalEstimatorsforaBinomialProportion: ComparisonofTwentyMethods,Revstat–StatisticalJournal6(2)(2008) References 165–197. [25] S.Dorai-Raj,binom:BinomialConfidenceIntervalsForSeveralParam- [1] S.Mukherjee,P.Tamayo,S.Rogers,R.Rifkin,A.Engle,C.Campbell, eterizations,Rpackageversion1.0-5(2009). T.R.Golub,J.P.Mesirov,Estimatingdatasetsizerequirementsforclas- [26] E.Jaynes,Probabilitytheory:thelogicofscience,CambridgeUniversity sifying DNA microarray data., J Comput Biol 10 (2) (2003) 119–142. Press,Cambridge,UKNewYork,NY,2003. doi:10.1089/106652703321825928. [27] F.E.HarrellJrwithcontributionsfrommanyotherusers.,Hmisc:Harrell [2] R.L.Figueroa,Q.Zeng-Treitler,S.Kandula,L.H.Ngo,Predictingsam- Miscellaneous,Rpackagev.3.9-3(2012). plesizerequiredforclassificationperformance.,BMCMedInformDecis [28] J.L.Fleiss,A.Tytun,H.K.Ury,ASimpleApproximationforCalculating Mak12(1)(2012)8.doi:10.1186/1472-6947-12-8. SampleSizesforComparingIndependentProportions,Biometrics36(2) [3] K.K.Dobbin,R.M.Simon,Samplesizeplanningfordevelopingclas- (1980)343–346.doi:10.2307/2529990. sifiersusinghigh-dimensionalDNAmicroarraydata.,Biostatistics8(1) [29] M.Vorburger,B.Munoz,SimplePowerCalculations:HowDoWeKnow (2007)101–117.doi:10.1093/biostatistics/kxj036. We Are Doing Them the Right Way?, in: Proceedings of the Survey [4] K.K.Dobbin,Y.Zhao,R.M.Simon,Howlargeatrainingsetisneeded ResearchMethodsSection,AmericanStatisticalAssociation,2006,pp. todevelopaclassifierformicroarraydata?,ClinCancerRes14(1)(2008) 3809–3812. 108–114.doi:10.1158/1078-0432.CCR-07-0443. [30] B.L.JosephL.Fleiss,M.C.Paik,StatisticalMethodsforRatesandPro- [5] A. Jain, B. Chandrasekaran, Dimensionality and Sample Size Consid- portions,3rdEdition,Wiley-Interscience,NewJersey,2003. erationsinPatternRecognitionPractice,in: P.R.Krishnaiah,L.Kanal [31] H.Wickham,ggplot2: elegantgraphicsfordataanalysis,SpringerNew (Eds.),HandbookofStatistics,Vol.IIofHandbookofStatistics,North- York,2009. Holland,Amsterdam,1982,Ch.39,pp.835–855. [6] S. Raudys, A. Jain, Small Sample Size Effects in Statistical Pattern Recognition: RecommendationsforPractitioners,IEEETransactionson Pattern Analysis and Machine Intelligence 13 (1991) 252–264. doi: http://doi.ieeecomputersociety.org/10.1109/34.75512. [7] H.M.Kalayeh,D.A.Landgrebe,Predictingtherequirednumberoftrain- ingsamples.,IEEETransPatternAnalMachIntell5(6)(1983)664–667. [8] U. Neugebauer, T. Bocklitz, J. H. Clement, C. Krafft, J. Popp, To- wardsdetectionandidentificationofcirculatingtumourcellsusingRa- manspectroscopy.,Analyst135(12)(2010)3178–3182.doi:10.1039/ c0an00608d. 10