A Taxonomy of Big Data for Optimal Predictive Machine Learning and Data Mining Ernest Fokoué CenterforQualityandAppliedStatistics Rochester Instituteof Technology 98LombMemorialDrive,Rochester, NY14623,USA 5 [email protected] 1 0 2 Abstract n a Big data comes in various ways, types, shapes, forms and sizes. Indeed, almost all areas of science, J technology,medicine,publichealth,economics,business,linguisticsandsocialsciencearebombardedby 3 ever increasing flows of data begging to analyzed efficiently and effectively. In this paper, we propose a rough idea of a possible taxonomy of big data, along with some of the most commonly used tools for ] L handling each particular category of bigness. The dimensionality p of the input space and the sample size n are usually the main ingredients in the characterization of data bigness. The specific statistical M machine learning technique used to handle a particular big data set will depend on which category it . falls in within the bigness taxonomy. Large p small n data sets for instance require a different set of t a tools from the large n small p variety. Among other tools, we discuss Preprocessing, Standardization, t s Imputation,Projection,Regularization,Penalization,Compression,Reduction,Selection,Kernelization, [ Hybridization,Parallelization,Aggregation,Randomization,Replication,Sequentialization.Indeed,itis 1 importanttoemphasizerightawaythattheso-callednofreelunchtheoremapplieshere,inthesensethat v thereisnouniversallysuperiormethodthatoutperformsallothermethodsonallcategoriesofbigness. It 4 isalsoimportanttostressthefactthatsimplicityinthesenseofOckham’srazornonpluralityprinciple 0 ofparsimonytendstoreignsupremewhenitcomestomassivedata. Weconcludewithacomparisonof 6 thepredictiveperformanceofsomeofthemostcommonlyusedmethodsonafewdatasets. 0 0 Keywords: Massive Data, Taxonomy, Parsimony, Sparsity, Regularization, Penalization, Compression, . 1 Reduction, Selection, Kernelization, Hybridization, Parallelization, Aggregation, Randomization, Sequen- 0 tialization,CrossValidation,Subsampling,Bias-VarianceTrade-off,Generalization,PredictionError. 5 1 : ntroduction v I. I i X We consider a dataset = (x ,y ),(x ,y ), ,(x ,y ) , wherex (x ,x , ,x ) denotes r D { 1 1 2 2 ··· n n } ⊤i ≡ i1 i2 ··· ip a the p-dimensionalvectorofcharacteristicsoftheinputspace ,andy representsthecorrespond- i X ing categoricalresponse value from the output space = 1, ,g . Typically, one of the most Y { ··· } basicingredientsinstatisticaldataminingisthedatamatrix X givenby x x x 11 12 1p ··· x x x 21 22 2p X = .. .. ··· .. . (1) . . . ··· x x x n1 n2 ··· np Five aspects of the matrix X that are crucial to a taxonomy of massive data include: (i) The dimension p oftheinputspace ,whichsimplyrepresentsthenumberofexplanatoryvariables X measured; (ii) The sample size n, which represents the number of observations (sites) at which the variables were measured/collected; (iii) The relationship between n and p, namely the ratio 1 n/p;(iv)Thetypeofvariablesmeasured(categorical,ordinal,interval,countorrealvalued),and the indication of the scales/units of measurement; (v) The relationships among the columns of X,namelywhether ornotthe columnsarecorrelated(nearlylinearlydependent). Indeed,aswe willmake clearlater,massivedata,alsoknown asbigdata,come in variousways, types, shapes, forms and sizes. Different scenarios of massive data call upon tools and methods that can be drastically different at times. The rest of this paper is organized as follows: Section 2 presents oursuggestedtaxonomyformassivedatabasedonawidevarietyofscenarios. Section3presents asummaryofthefundamentalstatisticallearningtheoryalongwithsomeofthemostcommonly used statistical learning methods and their application in the context of massive data; Section 4 presents a comparison of predictive performances of some popular statistical learning methods on a variety of massive data sets. Section 5 presents our discussion and conclusion, along with some oftheideasweareplanningtoexplorealongthelinesofthepresentpaper. n the iversity of assive ata ets II. O D M D S Categorization of massiveness as a function of the input space dimensionality p Our idea of a basic ingredient for a taxonomy for massive data comes from a simple reasoning. Consider the traditional multiple linear regression (MLR) setting with p predictor variables un- der the Gaussian noise. In a typical model space search needed in variable selection, the best subsets approach fits 2p 1 models and submodels. If p = 20, the space of linear models is − of size 1 million. Yes indeed, one theoretically has to search a space of 1 million models when p = 20. Now, if we have p = 30, the size of that space goes up to 1 billion, and if p = 40, the sizeofthemodelspacegoesupto1 trillion,andsoon. Oursimpleruleisthatanyproblemwithan input ofmorethan50variables is a massivedataproblem, because computationallysearching a thousand trillion is clearly a huge/massive task for modern day computers. Clearly, those who earn their keep analyzing inordinately large input spaces like the ones inherent in microarray data (p for such dataisinthethousands)willfindthistaxonomysomewhatnaive,butitmakessensetousbased onthe computational insights underlyingit. Besides, ifthe problemathandrequiresthe estima- tion of covariancematricesand their inversesthrough many iterations, theO(p3) computational complexity of matrix inversion would then require roughly 125000 complexity at every single iteration which could be quickly untenable computationally. Now, it’s clear that no one in their right mind decides to exhaustively search a space of a thousand trillion models. However, this threshold gives us somewhat of a point to operate from. From now on, any problem with more than 50 predictor variables will be a big data problem, and any problem with p exceeding 100 willbereferredtoasamassivedataproblem. Categorization of massiveness as a function of the sample size n When it comes to ideas about determining how many observations one needs, common sense willhaveitthatthemorethemerrier. Afterall,themoreobservationswehavetoclose weareto the law of large numbers, and indeed, asthe sample size grows, so does the precision of our es- timation. However,some importantmachine learningmethodslike GaussianProcess Classifiers, Gaussian Process Regression Estimators, the Relevance Vector Machine (RVM), Support Vector Machine (SVM)and just about all other kernel methods operate in dual space and are therefore heavily dependent on the sample size n. The computational and statistical complexity of such methods is driven by the same size n. Some of these methods like Gaussian Processes and the Relevance Vector Machine require the inversion of n n matrices. As a result, such methods × 2 could easily be computationally bogged down by too large a sample size n. Now, how large is too large? Well, it takes O(n3) operations to invert an n n matrix. Anyone who works with × matrices quickly realizes that with modern-day computers, messing around with more that a few hundreds in matrix inversion is not very smart. These methods can become excruciatingly (impractically)sloworevenunusablewhenngetseverlarger. Forthepurposesofourcategoriza- tion, we set the cut-off at 1000 and define as observation-massive any dataset that have n > 1000. Again,wederivethiscategorizationbasedonourobservationsonthecomputationalcomplexity ofmatrixinversionanditsimpactonsomeof thestate-of-the-artdatamining techniques. Categorization of massiveness as a function of the ratio n/p From the previous argumentation, we could say that when p > 50 or n > 1000, we are compu- tationally in the presence of massive data. It turns out however that the ratio n/p is even more important to massive and learnability than n and p taken separately. From experience, it’s our viewthatforeachexplanatoryvariableunderstudy, aminimum of 10observations isneededto have a decent analysis from both accuracy and precision perspectives. Put in simple terms, the numberofrowsmustbeatleast10timesthenumberofcolumns,specificallyn>10p. Usingthis simpleideaandthefactthatinformationisanincreasingfunctionofn,wesuggestthefollowing taxonomyasacontinuumof n/p. n <1 1 n <10 n 10 p ≤ p p ≥ Information Information Information Poverty Scarcity Abundance (n≪ p) (n≫ p) Large p,Large n Smaller p,Large n Much smaller p,Large n n>1000 A B C Large p,Smaller n Smaller p,Smaller n Much smaller p,Small n n 1000 ≤ D E F Table1: Inthistaxonomy,AandDposealotofchallenges. ethods and ools for andling assive ata III. M T H M D Batch data vs incremental data production When it comes to the way in which the data is acquired or gathered, the traditionally assumed way is the so-called batch, where all the data needed is available all at once. In state-of-the- art data mining however, there are multiple scenarios where the data is produced/delivered in a sequential/incremental manner. This has prompted the surge in the so-called online learning methods. As a matter of fact, the perceptron learning rule, arguably the first algorithm that launched the whole field of machine learning, is an online learning algorithm. Online algorithms have the distinct advantage that the data does not have to be stored in memory. All that is requiredin the storage of the built model at time t. Inthe sense the stored model is assumed to haveaccumulatedthestructureoftheunderlyingmodel. Becauseofthatdistinctfeature,onemay think of using online algorithms even when the whole data available. Indeed, when the sample size n is so large that the data cannot fit in the computer memory, one can consider building a learning method that receivesthe datasequentially/incrementally ratherthantrying to load the 3 whole datasetinto memory. We shallrefertothis aspectof massivedataassequentialization or incrementalization. Sequentializationisthereforeusefulforbothstreamingdataandmassivedata thatistoolargetobeloadedintomemoryallatonce. Missing Values and Imputation Schemes In most scenarios of massive data analytics, it is very common to be faced with missing values. The literature on missing values is very large, and we will herein simply mention very general guidelines. One of the first thing one needs to consider with missing values is whether they are missing systematically or missing at random. The second important aspect is the rate of missingness. Clearly,when we haveabundanceof data,the number of missing valuesis viewed differently. Threeapproachesareoftenusedtoaddressmissingness: (a)Deletion,whichconsists of deleting all the rows that contain any missingness; (b) Central imputation, which consists of fillingthemissingcellsofthedatamatrixwithcentraltendencieslikemode,medianormean;(c) model-based imputation using various adaptation of the ubiquitous Expectation-Maximization (EM)algorithm. Inherent lack of structure and importance of preprocessing Sentimentanalysisbasedonsocialmediadatafromfacebookandtwitter,topic modellingbased on a wide variety of textual data, classification of tourist documents or even to be more general the whole field of text mining and text categorization require the manipulation of inherently unstructured data. All these machine learningproblems areof greatinteresttoend-users, statis- tical machine learning practitioners and theorists, but cannot be solve without sometimes huge amounts of extensivepre-processing. The analysisof atextcorpusfor instance neverstartswith adatamatrixliketheX definedinEquation(1). Withtheseinherentlyunstructureddataliketext data, the pre-processingoften leads to datamatrices whose entries arefrequenciesof terms. It’s importanttomention thattermfrequencymatricestend tocontainmanyzeroes, becausea term deemedimportantforahandfulofdocumentswilltendnottoappearinmanyotherdocuments. Thiscontent-sparsitycanbeasourceof avarietyofmodellingproblems. Homogeneous vs Heterogeneous input space Thereareindeedmanyscenariosofmassivedatawheretheinputspaceishomogeneous,i.e. where allthe variablesareof the sametype. Audioprocessing, image processing andvideoprocessing all belong to a class of massive data where all the variables are of the same type. There are however many other massive data scenarios where the input space is made up of variables of variousdifferenttypes. Suchheterogeneous input spacesarisein fieldslike business, marketing, social sciences, psychology, etc ... where one can have categorical, ordinal, interval, count, and real valued variables gathered on the same entity. Such scenarios call for hybridization, which may take the form of combining two or more data-type-specific methods in order to handle the heterogeneity of the inputspace. Inkernelmethods forinstance, ifone hasbothtextualinputs and real valued inputs, then one could simply use a kernel = α +(1 α) that is the 1 2 K K − K convexcombinationoftwodata-type-specifickernels,namelyastringkernel andarealvalued 1 K kernel . Hybridization can also be used directly in modelling through the use of combination 2 K ofmodels. 4 Difference in measurement scale and the importance of transformation Even when the input space is homogeneous, it is almost always the case that the variables are measured on different scales. This difference in scales can be the source of many modelling difficulties. A simple way to address this scale heterogeneity is to perform straightforward trans- formationsthatprojectallthevariablesontothesamescale. Standardization: Themostcommonlyusedtransformationisstandardizationwhichleadstoallthe variableshaving zero mean and unit variance. Indeed, if X is one of the variables in and we j X haven observations X ,X , ,X ,thenthe standardizedversionof X is 1j 2j nj ij ··· X X¯ n X˜ = ij− j , where nX¯ = ∑X ij j ij ∑ni=1(Xij−X¯j)2 i=1 q Unitization: is another form of transformation commonly used. Unitization simply consists of transformation the variables such that all take values in the unit interval [0,1]. The resulting input space is therefore the unit p-dimensionalhypercube, namely [0,1]p. With unitization, if X is j one ofthevariablesin andwehave nobservations X ,X , ,X ,thenthe unitizedversion 1j 2j nj X ··· of X isgivenby ij X minX X˜ = ij− j . ij maxX minX j j − Dimensionality reduction and feature extraction Learning, especially statistical machine learning, is synonymous with dimensionality reduction. Indeed, after data is gathered, especially massive data, nothing can be garnered in terms of in- sightsuntilsomedimensionalityreductionisperformedtoprovidemeaningfulsummariesreveal- ing the patterns underlying the data. Typically, when people speak of dimensionality reduction, they have in mind the determination of some intrinsic dimensionality q of the input space, where q ≪ p. There are many motivations for dimensionality reduction (a) achieve orthogonality in theinputspace(b)eliminateredundantandnoisevariables,andasaresultperformthelearning in a lower dimensional and orthogonal input space with the benefit of variance reductionin the estimator. In practice, lossy data compression techniques like principal component analysis (PCA) andsingular valuedecomposition (SVD)arethemethodsofchoicefordimensionalityreduction. However,when n≪ p,mostofthese techniquescannotbedirectlyusedintheirgenericforms. Kernelization and the Power of Mapping to Feature Spaces In some applications, like signal processing, it is always the case that n ≪ p in time domain. A tensecondsaudiotrackata44100Hzsamplingrate,generatesavectorofdimension p=441000 in time domain, and one typically has only few hundreds or maybe a thousand tracks for the whole analysis. Typically image processing problems are similar in terms of dimensionality with a simple face of size 640 512 generating a p = 327680 dimensional input space. In both × these cases, it’s impossible to use basic PCA or SVD, because n ≪ p, making it impossible to estimate the covariance structure needed in eigenvalue decomposition. One of the solution to thisproblem istheuse of the methodsthatoperateindualspace,like kernelmethods. Inrecent years, kernelization has been widely applied and with tremendous success to PCA, Canonical Correlation Analysis (CCA), Regression, Logistic Regression, k-Means clustering just to name a few. Given a dataset with n input vectors x from some p dimensional space, the main i ∈ X 5 ingredientinkernelizationisabivariatefunction ( , )definedon andwithvaluesin IR, K · · X ×X andthecorrespondingmatrixofsimilarities K known astheGrammatrixanddefinedas (x ,x ) (x ,x ) (x ,x ) 1 1 1 2 1 n K K ··· K (x ,x ) (x ,x ) (x ,x ) 2 1 2 2 2 n K = K .. K .. ··· K .. . . . . ··· (x ,x ) (x ,x ) (x ,x ) K n 1 K n 2 ··· K n n CrucialtomostoperationslikekernelPCAisthecenteredversionofthe Grammatrixgivenby K˜ =(I II )K(I II ) =K II K KII +II KII , n n n n n n n n − − − − where I IRn n and II IRn n areboth n n matricesdefinedas n × n × ∈ ∈ × 1 1 1 1 0 0 ··· ··· 1 1 1 1 0 1 0 IIn = n ... ... ··· ... and In = ... ... ··· ... . ··· ··· 1 1 1 0 0 1 ··· ··· The nextstepistosolvetheeigenvalueproblem 1 K˜v =λ v, i i i n wherev IRn and λ IRfori =1, ,n. Inmatrixform, theeigenvalueproblemis i i ∈ ∈ ··· 1 K˜ =VΛV⊤. n In fact, basic PCA can be formulated in kernel form using the Euclidean inner product kernel (x,x ) = x,x = x x , sometimes referredto as the vanilla kernel. If we center the data, i.e, K i j h i ji ⊤i j suchthat∑n x =0,thenthe Grammatrixis i=1 ij x x x x x x 1⊤ 1 1⊤ 2 ··· 1⊤ n x x x x x x K = 2⊤... 1 2⊤...2 ··· 2⊤... n = XX⊤. ··· x x x x x x ⊤n 1 ⊤n 2 ··· ⊤n n Now, the covariance matrix is C = 1 ∑n x x = 1X X, and PCA based on the covariance is n i=1 i ⊤i n ⊤ simply 1X Xw = λ w for j=1, ,pwith w IRp and λ IR. n ⊤ j j j ··· j ∈ j ∈ Aggregation and the Appeal of Ensemble Learning It is often common in massive data that selecting a single model does not lead the optimal prediction. For instance, in the presence of multicollinearity which is almost inevitable when p is very large, function estimators are typically unstable and of large variance. The now popular bootstrap aggregating also referred to as bagging offers one way to reduce the variance of the estimator by creating an aggregation of bootstrapped versions of the base estimator. This is an exampleofensemblelearning,withtheaggregation/combinationformedfromequallyweighted 6 baselearners. Bagging Regressors: Let gˆ(b)( ) be the bth bootstrap replication of the estimated base regression · functiongˆ( ). Thenthebaggedversionoftheestimator isgivenby · 1 B gˆ(bagging)(x)= ∑ gˆ(b)(x). B b=1 If the base learner is a multiple linear regressionmodel estimator gˆ(x) = βˆ +x βˆ, then the bth 0 ⊤ bootstrappedreplicateisgˆ(b)(x)= βˆ(b)+x βˆ(b),andthe baggedversionis 0 ⊤ gˆ(bagging)(x)= B1 ∑B βˆ(0b)+x⊤βˆ(b) b=1(cid:16) (cid:17) Bagging classifiers: Consider a multi-class classification task with labels y coming from = Y 1,2, ,m and predictor variables x = (x ,x , ,x ) coming from a q-dimensional space 1 2 q ⊤ { ··· } ··· . Let gˆ(b)( ) be the bth bootstrap replication of the estimated base classifier gˆ( ), such that X · · (yˆ)(b) = gˆ(b)(x) is the bth bootstrap estimated class of x. The estimated response by bagging is obtained using the majority vote rule, which means the most frequent label throughout the B bootstrapreplications. Namely,gˆ(bagging)(x)=Most frequent label in Cˆ(B)(x),where Cˆ(B)(x)= gˆ(1)(x),gˆ(2)(x), ,gˆ(B)(x) . ··· (cid:26) (cid:27) Succinctly,wecanwrite thebaggedlabelofxas B gˆ(bagging)(x)=argmax freq (y) =argmax ∑ 1 . Cˆ(B)(x) y=gˆ(b)(x) y∈Yn o y∈Y(b=1(cid:16) { }(cid:17)) Itmustbeemphasizedthatingeneral,ensemblesdonotassignequalweightstobaselearnersin theaggregation. Thegeneralformulation inthe contextofregressionforinstanceis B gˆ(agg)(x)= ∑ α(b)gˆ(b)(x). b=1 wheretheaggregationisoftenconvex,i.e. ∑B α(b) =1. b=1 Parallelization When the computational complexity for building the base learner is high, using ensemble learn- ingtechniqueslikebaggingbecomesveryinefficient,sometimestothepointofbeingimpractical. One way around this difficulty is the use of parallel computation. In recent years, both R and Matlab have offered the capacity to parallelize operations. Big data analytics will increasingly need parallelization as a way to speed up computations or sometimes just make it possible to handlemassivedatathatcannotfitintoasinglecomputer memory. 7 Regularization and the Power of Prior Information Allstatisticalmachinelearningproblemsareinherentlyinverseproblems,inthesensethatlearn- ing methods seek to optimally estimate an unknown generating function using empirical obser- vations assumed to be generated by it. As a result statistical machine learning problems are inherently ill-posed, in the sense that they typically violate atleast one of Hadamard’sthree well- posedness conditions. For clarity, according to Hadamard a problem is well-posed if it fulfills the following three conditions: (a) A solution exists; (b) The solution is unique; (c) The solution is stable, i.e does not change drastically under small perturbations. For many machine learning problems, the first condition of well-posedness, namely existence, is fulfilled. However, the solution is either notuniqueor notstable. Withlarge psmallnforinstance,notonlyisthereamultiplicityofsolu- tions but also the instability thereof, due to the singularities resulting from the fact that n ≪ p. Typically, the regularization framework is used to isolate a feasible and optimal (in some sense) solution. Tikhonov’sregularizationisthe one most commonly resortedto, andtypicallyamounts toaLagrangianformulationofaconstrainedversionoftheinitialproblem,theconstraintsbeing thedevices/objectsusedtoisolate aunique andstablesolution. tatistical achine earning ethods for assive ata IV. S M L M M D We consider the traditional supervised learning task of pattern recognition with the goal of esti- matinga function f thatmapsaninput space toa setof labels . We consider the symmetric zero-lossℓ(Y, f(X))=1 ,andthecorreXspondingtheoreticaYlriskfunction Y=f(X) { 6 } R(f)=E[ℓ(Y, f(X))]= ℓ(y, f(x))dP(x,y)=Pr[Y = f(X)]. 6 ZX×Y Ideally, one would like to find the universally best classifier f that minimizes the rate R(f) of ∗ misclassification, i.e., f∗ =argmin E[ℓ(Y, f(X))] =argmin Pr[Y = f(X)] . 6 f f n o n o It is impossible in practice to find f , because that would require knowing the joint distribution ∗ of (X,Y) which is usually unknown. In a sense, R(f), the theoretical risk, serves as a standard only and helps establish some important theoretical results in pattern recognition. For instance, although in most practical problems one cannot effectively compute it, it has been shown theo- retically that the universally best classifier f is the so-called Bayes classifier, the one obtained ∗ through the Bayes’ formula by computing the posterior probability of class membership as the discriminantfunction,namely, π p(x y= j) j f∗(x)=class∗(x)= argmax Pr[Y = j x] = argmax | . { | } p(x) j 1, ,g j 1, ,g (cid:26) (cid:27) ∈{ ··· } ∈{ ··· } Assuming multivariate Gaussian class conditional densities with common covariance matrix Σ and mean vectors µ and µ , the Bayes Risk, that is the risk associated to the Bayes classifier, is 0 1 givenby R(f ) = R =Φ( √∆/2) whereΦ( ) isthestandardnormalcdfand ∗ ∗ − · ∆= (µ1 µ0)⊤Σ−1(µ1 µ0). − − 8 Once again, it is important to recall that this R is not knowable in practice, and what typically ∗ happens is that, instead of seeking to minimize the theoretical risk R(f), experimenters focus on minimizing its empirical counterpart, known as the empirical risk. Given an i.i.d sample = (x ,y ),(x ,y ) ,(x ,y ) ,thecorrespondingempiricalriskisgivenby 1 1 2 2 n n D { ··· } 1 n Rˆ(f)= ∑1 , n i=1 {yi6=f(xi)} which is simply the observed (empirical) misclassification rate. It is re-assuring to know that fundamentalresultsinstatisticallearningtheory(SeeVapnik(2000))establishthatasthesample sizegoestoinfinity, theempiricalriskmimicsthetheoreticalrisk limPr[ Rˆ(f) R(f) <ǫ] =1. n ∞ | − | → Fromapracticalperspective,thismeansthattheempiricalriskprovidesatangiblewaytosearch the space of possible classifiers. Another crucialpoint isthe emphasis onthe factthatevenwith this empirical risk, we still cannot feasibly search the universally best function, for such a space would be formidably large. That’s where the need to choose a particular function class arises. In other words, instead of seeking an elusive universally best classifier, one simply proposes a plausibleclassifier,possiblybasedonaspectsofthedata,thenfindstheempiricalriskminimizer in that space, and then, if the need arises maybe theoretically find out how the associated risk comparestotheBayesrisk. Oneofthefundamentalresultsinstatisticallearningtheoryhastodo withthe factthe minimizer of the empiricalriskcould turn out tobe overlyoptimistic, and lead topoorgeneralizationperformance. Itisindeedthecase,thatbymakingourestimatedclassifier very complex, it can adapt too well to the data at hand, meaning very low in-sample error rate, but yield very high out of sample error rates, due to overfitting, the estimated classifier having learned both the signal and the noise. In technical terms, this is referredto as the bias-variance dilemma, in the sense that by increasing the complexity of the estimated classifier, the bias is reduced (good fit all the way to the point of overfitting) but the variance of that estimator is increased. On the other hand, considering much simpler estimators, leads to less variance but higher bias (due to underfitting, model not rich enough to fit the data well). This phenomenon ofbiasvariancedilemma,isparticularlypotentwithmassivedatawhenthenumberofpredictor variables p is muchlarger thanthe sample size n. One of the maintools in the modern machine learning arsenalfor dealing with this is the so-called regularizationframeworkwherebyinstead of using the empirical risk alone, a constrained version of it, also known as the regularized or penalizedversionisused. 1 n Rˆreg(f)= Rˆ(f)+λkfkH = n i∑=11{yi6=f(xi)}+λkfkH, whereλisreferredtoasthetuning(regularization)parameter,and f issomemeasureofthe k kH complexityof f withtheclass fromwhichitischosen. Itmakessensethatchoosingafunction H f with a smaller value of f helps avoid overfitting. The value of λ [0,+∞), controls the k kH ∈ trade-off between bias (goodness of fit), and function complexity (which is responsibility for variance). Practically though, it may still be hard to even explore the theoretical properties of a given classifier and compare it to the Bayes risk, precisely because, methods typically do not directlyactonthezero-onelossfunction,butinsteaduseatbestsurrogatesofit. Indeed,withina selectedclass ofpotentialclassifiers,onetypicallychoosessomelossfunctionℓ( , )withsome H · · 9 desirable properties like smoothness and/or convexity (this is because one needs at least to be abletobuildthe desiredclassifier),andthenfindstheminimizer ofitsregularizedversion,i.e., 1 n Rˆreg(f)= ∑ℓ(yi, f(xi))+λ f . n k kH i=1 Note that λ stills controls the bias-variance trade-off as before. Now, since the loss function typically chosen is not the zero-one loss on which the Bayesclassifier (universallybest) isbased, there is no guarantee that the best in the selected class under the chosen loss function ℓ( , ) H · · will mimic f . As a matter of fact, each optimal classifier from a given class will typically ∗ H perform well if the data at hand and the generator from which ut came, somewhat accord with the properties of the space . This remark is probably what prompted the famous so-called no H freelunchtheorem,hereinstatedinformally. Theorem 1. (No Free Lunch) There is no learning method that is universally superior to all other methodson all datasets. In other words, if a learning method is presented with a data set whose inherent patternsviolateitsassumptions,thenthatlearningmethodwillunder-perform. The above no free lunch theorem basically says that there is no such thing as a universally su- perior learning method that outperforms all other methods on all possible data, no matter how sophisticated the method may appear to be. Indeed, it is very humbling to see that some of the methods deemed somewhat simple sometimes hugely outperform the most sophisticated ones when compared on the basis of averageout of sample (test) error. It is common practice in data mining and machine learning in general, to compare methods based on benchmark data, and empirical counterparts of the theoretical predictive measures, often computed using some form of re-sampling tool like the bootstrap or cross-validation. Massive data, also known as big data, come in various types, shapes, forms and sizes. The specific statistical machine learning tech- nique used to handle a particular massive data set depends on which aspect of the taxonomy it fallsin. Indeed,itisimportanttoemphasizethatthenofreelunchtheoremappliesmorepotently here, in the sense that there is no panacea that universally applies to all massive data sets. It is important however to quickly stress the fact that simplicity in the sense of Ockham’s razor non plurality principle of parsimony tends to reign supreme when it comes to massive data. In this paper,we propose a rough idea of a possible taxomony of massive data,along with some of the most commonly used tools for handling each particular class of massiveness. In this paper, we consider a few datasets of different types of massiveness, and we demonstrate through compu- tational results that the no free lunch applies as a stronger as ever. We typically consider some ofthemostcommonlyusedpatternrecognitiontechniques, fromthosethataremostsimpleand intuitive to some that are considered sophisticated and state-of-the-art, and we show that the performances vary sometimes drastically from data to data. It turns out, as we will show, that dependingonthetypeofmassiveness,some methodscannotevenbeused. Wealsoprovideour taxonomyof massivenessalongwithdifferentapproachestodealingwith eachcase. SeeVapnik (2000)andGuoetal.(2005) Linear Discriminant Analysis Under the assumption of multivariate normality of the class conditional densities with equal covariancematrices,namely(x y= j) MVN(µ ,Σ),orspecifically, j | ∼ 1 1 p(x|y= j) = (2π)p/2 Σ 1/2 exp −2(x−µj)⊤Σ−1(x−µj) , | | (cid:26) (cid:27) 10