TuPAQ: An Efficient Planner for Large-scale Predictive Analytic Queries Evan R. Sparks Ameet Talwalkar ComputerScienceDivision ComputerScienceDept. UCBerkeley UCLA [email protected] [email protected] Michael J. Franklin Michael I. Jordan Tim Kraska ComputerScienceDivision ComputerScienceDivision Dept. ofComputerScience UCBerkeley UCBerkeley BrownUniversity 5 [email protected] [email protected] [email protected] 1 0 2 ABSTRACT SELECT vm.sender, vm.arrived, PREDICT(vm.text, vm.audio) r a Theproliferationofmassivedatasetscombinedwiththedevelop- GIVEN LabeledVoiceMails M ment of sophisticated analytical techniques have enabled a wide FROM VoiceMails vm varietyofnovelapplicationssuchasimprovedproductrecommen- WHERE vm.user = ’Bob’ AND vm.listened is NULL 8 dations,automaticimagetagging,andimprovedspeech-drivenin- ORDER BY vm.arrived DESC LIMIT 50 terfaces. Theseandmanyotherapplicationscanbesupportedby ] PredictiveAnalyticQueries(PAQs). Amajorobstacletosupport- B ingPAQsisthechallengingandexpensiveprocessofidentifying (a)Speech-to-texttranscription. D andtraininganappropriatepredictivemodel. Recenteffortsaim- SELECT p.image ing to automate this process have focused on single node imple- FROM Pictures p . s mentationsandhaveassumedthatmodeltrainingitselfisablack WHERE PREDICT(p.tag, p.photo) = ’Plant’ GIVEN c LabeledPhotos box, thus limiting the effectiveness of such approaches on large- [ AND p.likes > 500 scale problems. In this work, we build upon these recent efforts 2 and propose an integrated PAQ planning architecture that com- v bines advanced model search techniques, bandit resource alloca- (b)Photoclassification. 8 tionviaruntimealgorithmintrospection,andphysicaloptimization Figure1:TwoexamplesofPAQs,withthepredictiveclauseshigh- 6 viabatching. Theresultis TUPAQ,acomponentoftheMLbase lighted in green. (1a) returns the predicted text transcription of 0 system,whichsolvesthePAQplanningproblemwithcomparable Bob’svoicemailsfromtheiraudiocontent. (1b)findspopularpic- 0 qualitytoexhaustivestrategiesbutanorderofmagnitudemoreeffi- tures of photos based on an image classification model—even if 0 cientlythanthestandardbaselineapproach,andcanscaletomodels the images are not labeled. Each of these use cases may require 2. trainedonterabytesofdataacrosshundredsofmachines. considerabletrainingdata. 0 5 1. INTRODUCTION ticalproceduretoanswertheirquery,andconfigurethatprocedure 1 : Rapidlygrowingdatavolumescoupledwiththematurityofso- appropriately. v phisticated statistical techniques have led to a new type of data- Inourwork,ourgoalistoraisethelevelofabstractionfordata Xi intensiveworkload:predictiveanalyticsoverlargescale,distributed analysts. Insteadofchoosingaspecificstatisticalmodelandfea- datasets. Indeed,thesupportofpredictiveanalyticsisanincreas- turizationstrategy,weprovideadeclarativequeryinterfacewhere ar inglyactiveareaofdatabasesystemsresearch.Severalsystemsthat usersdeclarethattheywishtopredictanattributefromsomeother integratestatisticalqueryprocessingwithadatamanagementsys- collectionofattributesandoptionallyprovideexampletrainingdata. temhavebeendeveloped. However, thesesystemsforceusersto Giventheseinputs,thesystemautomaticallymakespredictionsfor describetheirstatisticalmodelindensemathematicalnotation[28, thetargetattributeonnewdata. Withoursystem,usersissuePre- 26]orintermsofaspecificmodel[29, 53, 20]andprovidelittle dictiveAnalyticQueries, orPAQs, whicharetraditionaldatabase guidance about the proper configuration of the model—that is, a queries, augmented with new predictive clauses. Two examples usermustknowthatalinearSVMorKalmanfilterisagoodstatis- ofPAQsare giveninFigure1—withthepredictive clauses high- lighted. The output of a predictive clause is an attribute like any other—one that can be grouped and sorted on or used in other clauses.Thesyntaxofthesepredictiveclausesisasfollows: PREDICT(a [,a ,...,a ])GIVENR predicted 1 n Where,a istheattributetobepredicted.a ,...,a isan predicted 1 n optionalsetofpredictorattributes.Risarelationcontainingtrain- ing examples with the restriction that {a ,a ,...,a } − predicted 1 n . Attributes(R) = ∅. This syntax is general enough to support awiderangeofpredictivetasks—includingclassification, regres- Model Predic5ve sion,anditemrecommendation. Family 1 Model 1.1 chiGneivelenarrencinengt(aMdvLa)ncteecshinnisqtuateisstaicraelamnetahtuordaollowgayy,stuopesurvpipsoedrtmthae- ExFteraactutorer 1 FMam…oidlye l2 MProeddeicl5 1v.e2 predictive clauses in PAQs. In the supervised learning setting, a Model Family n statisticalmodeliscreatedviatrainingdatatorelatetheinputat- Feature … tributestothedesiredoutputattribute. Furthermore,MLmethods Extractor 2 … learnbettermodelsasthesizeofthetrainingdataincreases,andre- Data centadvancesindistributedMLalgorithmdevelopment,areaimed … atenablinglarge-scalemodeltraininginthedistributedsetting.[7, Model 40,36] Feature Family 1 Best PAQ Plan Unfortunately,theapplicationofsupervisedlearningtechniques Extractor n FMamoidlye l2 MProeddeicl5 nv.e2 to a new input dataset is computationally demanding and techni- … … Model callychallenging. Foranon-expert, theprocessofcarefullypre- Family m Predic5ve Model n.m processingtheinputattributes,selectingtheappropriateMLmodel, andtuningitshyperparameterscanbeanad-hocandtime-consuming task. Forexample,tobuildapredictivemodelforaclassification task like the one shown in Figure 1b using conventional tools, a Figure2: Findinganappropriatepredictivemodelisaprocessof userneedstochoosefromoneofmanyalgorithmsforextracting continuousrefinement. Eachstagemustbecarefullytunedtoen- featuresfromimagedata,thenselectanappropriateclassification surehighquality. TUPAQautomatesthisprocess. model—allthewhiletuningtheconfigurationofeach. Finally,the userwillneedtosettleonastrategytoselectthebestperforming ciated with PAQ planning and introduce a PAQ planner for the model. Failuretofollowthesestepscanleadtomodelsthatdonot MLbase system [33] for declarative machine learning, called the workatall,orworse,provideinaccuratepredictions. Inpractice,thisprocessoftrainingasupervisedmodelishighly Training-supportedPredictiveAnalyticQueryplanner(TUPAQ). procedural,withevenMLexpertsoftenhavingtofallbackonstan- The goal of TUPAQ in the context of MLbase is to tackle the PAQplanningproblematscale.Usingadvancedsearchtechniques, dardrecipes(e.g.thelibSVMguide[30])intheattempttotoobtain reasonableresults. Atscale,theproblemoffindingagoodmodel bandit resource allocation, and batching optimizations, TUPAQ identifiesasuitablemodeltosatisfyauser’shigh-leveldeclarative is exacerbated, and conventional approaches can be prohibitively query, thus differing from previous systems [29, 20, 49, 53] fo- expensive.Forexample,sequentialgridsearchisapopularmethod cusedonpredictiveanalytics,whichforceuserstoperformmodel implemented in many software packages [5, 41, 22], but as we searchthemselves.Further,althoughtheMLcommunityhasdevel- show,requiresmanymoremodelstobetrainedthannecessaryto opedtechniquesforfindinggoodconfigurationsforlearningmod- achievegoodresults. els, nonehavefocusedonapplyingthesetechniquesinthelarge- Atanabstractleveltheprocessoffindingagoodmachinelearn- scalesetting. ingmodelisinsomewaysanalogoustoaqueryplanningproblem wherethechallengeistoconstructamodelfromtrainingdatagiven WithTUPAQ,wemakethefollowingcontributions: apotentialspaceofmodelfamilies. InthePAQsetting(illustrated • WeintroducePAQs,adeclarativequeryinterfacethatenables inFigure2),someprocessmustbefollowedtorefinethechoiceof analyststooperateonimputedattributevalues. featureandmodelselection, andinpracticethisproblemisoften • We demonstrate the effectiveness of supervised ML tech- iterative. ThePAQplanningproblemisthetaskofefficientlyfind- niques at supporting PAQs, both in terms of high accuracy ingahighqualityPAQplan,giventrainingdata,asetofcandidate andefficiency,especiallywhencomparedwithbasicapproaches. statisticalmodelfamilies,andtheirconfigurations. A good PAQ planner will return a high quality PAQ plan effi- • WedescribetheTUPAQalgorithmforPAQplanningwhich ciently.ThequalityofaPAQplanismeasuredintermsofastatis- combineslogicaloptimizationviamodelsearchandphysical tic relevant to the predictive task, such as accuracy on validation optimizationviabatchingandbanditresourceallocationvia dataintheclassificationsettingortheMeanSquaredError(MSE) runtimeintrospection. in the regression setting. Efficiency is measured in terms of the total time to arrive at such a model—we refer to this as learning • WedescribeanimplementationoftheTUPAQalgorithmin time. Inthisworkwefocusonmodelsearchandhyperparameter ApacheSpark,buildingonourearlierworkontheMLbase tuning when the dimensionality of the search space is small, and architecture[33]. trainingbudgetisalsosmall. Thisscenariomapswelltopractical • Weevaluateseveralpointsinthedesignspacewithrespectto demands,giventhatthereexistalimitedsetoflearningalgorithms eachofourlogicalandphysicaloptimizations,anddemon- andhyperparametersforagivenpredictivelearningtask, andthe stratethatproperselectionofeachcandramaticallyimprove costoftrainingamodelmaybehigh,particularlyinthelarge-scale bothaccuracyandefficiency. setting. In such scenarios, PAQ planning can lead to substantial improvements in model quality. We further restrict our focus to • Wepresentexperimentalresultsonlarge,distributeddatasets modelfamilyselectionandhyperparametertuning,asopposedto uptoterabytesinsize,demonstratingthatTUPAQconverges alsoconsideringselectionofappropriatefeaturizers. However,we tohighqualityPAQplansanorderofmagnitudefasterthan believethetechniquespresentedherecanbegeneralizedtomore asimplePAQplanningstrategy. complicatedpredictiveanalyticpipelines,sincethenumberofes- Theremainderofthispaperisorganizedasfollows. Section2 tablishedfeaturizationtechniquesforagivendomainisalsolim- formally defines the PAQ planning problem, explains its connec- ited. tiontotraditionaldatabasequeryoptimizationresearch,andintro- In the remainder of the paper, we explore the challenges asso- ducesastandardbaselineapproachfortheproblem,andprovides a high-level overview of TUPAQ. We next present details about areahandfulofgeneral-purposeclassificationmethodsthatarede- TUPAQ’s three main components in Section 3, highlighting the ployed in practice. Further, we expect that these techniques will design decisions for each of these components. Section 4 sub- naturallyapplytoothersupervisedlearningtasks—suchasregres- sequently presents an empirical study of this design space and a sionandcollaborativefiltering,whichmayonlydifferintermsof comparisonwithtraditionalmethodsforsolvingthisproblem. We theirdefinitionofplanquality.Weevaluatethequalityofeachplan thenpresentresultsonalarge-scaleevaluationofTUPAQinSec- bycomputingaccuracyonheld-outdatasets,andwemeasurelearn- tion 5, with each of TUPAQ’s three components tuned based on ingtimeastheamountoftimerequiredtoexploreafixednumber the results of the previous section. In Section 6 we explore the ofmodelsfromsomemodelspace. Inourlarge-scaledistributed relationship between PAQ planning and existing works related to experiments(seeSection5)wereportparallelruntimes. supportingpredictiveandanalyticalworkloads. Weconcludewith Additionally, inthispaperwefocusonmodelfamiliesthatare Section7,whichsummarizesourworkanddiscussesfutureexten- trained via multiple sequential scans of the training data. In par- sionstoTUPAQ. ticular, we focus on three model families: linear Support Vector Machines(SVM),logisticregressiontrainedviagradientdescent, 2. PAQPLANNINGANDTUPAQ andnonlinearSVMsusingrandomfeatures[43]trainedviablock coordinatedescent.Thisiterativesequentialaccesspatternencom- Inthissection,wedefinethePAQplanningprobleminmorede- passesawiderangeoflearningalgorithms,especiallyinthelarge- tail,anddescribeitsrelationshiptotraditionalqueryoptimization. scaledistributedsetting. Forinstance, efficientdistributedimple- Then,wediscusstwoapproachestoPAQplanning.Thefirst,which mentationsoflinearregression[25],treebasedmodels[40],Naive wecallthebaselineapproach,isinspiredbycommonpractice.The Bayesclassifiers[25],andk-meansclustering[25]allfollowthis secondapproach, TUPAQ,allowsustotakeadvantageoflogical sameaccesspattern. andphysicaloptimizationsintheplanningprocess. TUPAQhasa richdesignspace,whichwedescribeinfurtherdetailinSection3. 2.2 ConnectionstoQueryOptimization Finally,wedescribehowTUPAQfitsintothebroaderMLbasear- chitecture. Given that PAQ planning is the automation of a declaratively specifiedtask,itisnaturaltodrawconnectionstodecadesworthof 2.1 DefiningPAQPlanning relationalqueryoptimizationresearchwhentacklingthePAQplan- Figure1showsseveralexamplePAQsthat,inpractice,needex- ningproblem.Traditionaldatabasesystemsinvestinthecostlypro- tremelylargetrainingdatasetswithmillionsofexampleseachwith cessofqueryplanningtodetermineagoodexecutionplanthatcan hundredsofthousandsoffeaturestoreturnaccurateresults. Other bereusedrepeatedlyuponsubsequentexecutionofsimilarqueries. PAQsthatneedlargetrainingsetsincludeproblemsinimageclassi- WhilequeryplanningforaPAQinvolvesthecostlyprocessofiden- fication,speech-to-texttranslation,andweb-scaletextmining. As tifying a high quality predictive model, this cost is offset by the definedinSection1,PAQscanbeanyquerywhereanattributeor subsequentabilitytoperformnearreal-timePAQevaluation. Ad- predicatevaluemustbeimputedtocompletethequery.Inthiswork ditionally, both types of query planning can be viewed as search weconcernourselvesspecificallywithPAQsthatcanbeanswered problems,withtraditionalqueryplanningsearchingoverthespace basedonuser-suppliedlabeledtrainingdata,typicallyofthesame ofjoinorderingsandaccessmethods,andPAQplanningsearching formatasthedataforwhichvaluesaretobeimputed. Wefocus overthespaceofmachinelearningmodels. specificallyonthecomponentsofthesystemthatarenecessaryto Therearesomenotabledifferencesbetweenthesetwoproblems, efficientlysupportclausesoftheformshowninSection1. While however,leadingtoanovelsetofchallengestoaddressinthecon- thestrategiesdiscussedherecanoperateinsituationswherequeries text of PAQ planning. First, unlike traditional database queries, havejoinedrelationsorcomplexaggregates,weexpectthatfuture PAQsdonothaveuniqueanswersduetotheinherentuncertainty workwillexploreoptimizationsspecifictothesesituations. inpredictivemodelslearnedfromfinitedatasets.Hence,PAQplan- ThePAQplanner’sjobistofindaPAQplanthatmaximizessome ningfocusesonbothqualityandefficiency(comparedtojusteffi- measure of quality (e.g., in terms of goodness of fit to held-out ciency for traditional query planning), and needs to balance be- data)inashortamountoftime,wherelearningtimeisconstrained tweenthesegoalswhentheyconflict.Second,thesearchspacefor bysomebudgetintermsofthenumberofmodelsconsidered,to- PAQsisnotendowedwithwell-definedalgebraicproperties,asit talexecutiontime, orthenumberofscansoverthetrainingdata. consists of possibly unrelated model families and feature extrac- Theplannerthustakesasinputatrainingdataset,adescriptionof tors,eachwithitsownaccesspatternsandhyperparameters.Third, aspaceofmodelstosearch,andsomebudgetorstoppingcriterion. evaluatingacandidatequeryplanisexpensiveandinthiscontext Thedescriptionofthespaceofmodelstosearchincludesthesetof involveslearningtheparametersofastatisticalmodel.Learningthe modelfamiliestosearchover(e.g.,SVM,decisiontree,etc.) and parametersofasinglemodelcaninvolveupwardsofhundredsof reasonable ranges for their associated hyperparameters (e.g., reg- passesovertheinputdata,andthereexistfewheuristicstoestimate ularizationparameterforaregularizedlinearmodelormaximum theeffectivenessofamodelbeforethiscostlytrainingprocess. depth of a decision tree). The output of a PAQ Planner is a plan Now,weturnourattentiontoalgorithmsforPAQplanning. thatcanbeappliedtounlabeleddatapointstoobtainaprediction 2.3 BaselinePAQPlanning forthedesiredattribute. Inthecontextof TUPAQ,thisplanisa statisticalmodelthatcanbeappliedtounseentrainingdata. The conventional approach to PAQ planning is sequential grid In this work, we operate in a scenario where individual mod- search [5, 41, 22]. For instance, consider the tag prediction ex- elsareofdimensionalityd, wheredislessthanthetotalnumber ampleinFigure1b, inwhichthePAQisprocessedviaanunder- of example data points N. Note that d can nonetheless be quite lying classification model trained on LabeledPhotos. More- large,e.g.,d=200,000inourlarge-scalespeechexperimentsand over,considerasingleMLmodelfamilyforbinaryclassification— d=160,000inourlargescaleimageexperiments(seeSection5). logisticregression—whichhastwohyperparameters: learningrate Recallthatinthispaper,wefocusonclassification,andconsidera andregularization.Sequentialgridsearchdividesthehyperparam- smallnumberofmodelfamilies,f ∈ F,eachwithseveralhyper- eter space into a grid and iteratively trains models at these grid parameters,λ∈Λ.Theseassumptionsmapwelltoreality,asthere points. Gridsearchhasseveralshortcomings. First,theresultsofprevi- ousiterationsinthesequenceofgridpointsarenotusedtoinform futureiterationsofsearch.Second,thecurseofdimensionalitylim- itstheusefulnessofthismethodinhighdimensionalhyperparame- input :LabeledData,ModelSpace,Budget,PartialIters, terspaces.Third,gridpointsmaynotrepresentagoodapproxima- BatchSize tionofglobalminima—trueglobalminimamaybehiddenbetween output:BestModel gridpoints,particularlyinthecaseofaverycoarsegrid.Nonethe- 1 bestModel←∅; less,sequentialgridsearchiscommonlyusedinpractice,andisa 2 history←[]; naturalbaselineforPAQplanners. 3 proposals←[]; InAlgorithm1,weshowthelogicencapsulatedinsuchabase- 4 freeSlots←batchSize; linePAQplanner. Inthisexample,thebudgetisthetotalnumber 5 whileBudget>0do ofmodelstotrain. 6 freeSlots←batchSize-length(proposals); 7 proposals←proposals+proposeModels(freeSlots, input :LabeledData,ModelSpace,Budget ModelSpace,history); // Model Search output:BestModel 8 models←trainPartial(proposals,LabeledData, 1 bestModel←∅; PartialIters); // Batching 2 grid←gridPoints(ModelSpace,Budget); 9 Budget←Budget−len(models)∗PartialIters; 3 whileBudget>0do 10 (finishedModels,history,proposals)← 4 proposal←nextPoint(grid); banditAllocation(models,history); // Bandits 5 model←train(proposal,LabeledData); 11 forminfinishedModelsdo 6 ifquality(model)>quality(bestModel)then 12 ifquality(m)>quality(bestModel)then 7 bestModel←model; 13 bestModel←m; 8 end 14 end 9 Budget←Budget−1; 15 end 10 end 16 end 11 returnbestModel; 17 return(bestModel); Algorithm1:AbaselinePAQplanningprocedurewithconven- Algorithm2:TheplanningprocedureusedbyTUPAQ. tionalgridsearch.Thefunction“gridPoints”returnsacoarsegrid over the dimensions of model space, where the total number of gridpointsisdeterminedbythebudget. 2.4 TUPAQPlanning Asdiscussedintheprevioussection,gridsearchisasuboptimal search method despite its popularity. Moreover, from a systems perspective,thealgorithmillustratedinAlgorithm1hasadditional drawbacksbeyondthoseofgridsearch. Inparticular, thisproce- PAQ prediction dureperformssequentialmodeltrainingandalsotreatsthetraining ofeachmodelasablack-boxprocedure. User Incontrast,weproposetheTUPAQalgorithm,describedinAl- Master Server gorithm 2, to address all three of these shortcomings via logical PAQ Plans Parser manodrephsyospihciasltiocpattiemdizseaatirocnhs.stFrairtsetg,itehse.TLUinPeA7Qshaolgwosritthhamtoaullrowmsodfoerl ML Code MLlib/MLI LLP Mas PAQ Planner (TuPAQ) te searchprocedurecannowusetraininghistoryasinput.Here,“pro- Statistics r poseModel”canbeanarbitrarymodelsearchprocedure. Second, PLP ouralgorithmperformsbatchingtotrainmultiplemodelssimulta- Execution/Monitoring (Spark) neously(Line8). Third,ouralgorithmdeploysbanditresourceal- ML Developer locationviaruntimeinspectiontomakeon-the-flydecisions.Specif- ically,thealgorithmcomparesthequalityofthemodelscurrently Spark Spark Spark Spark beingtrainedwithhistoricalinformationaboutthetrainingprocess, Worker Worker Worker …. Worker Wo rk anddetermineswhichofthecurrentmodelsshouldbetrainedfur- ers ther(Line10). These three optimizations are discussed in detail in Section 3, Figure3: TheTUPAQplannerisacriticalcomponentoftheML- withafocusonthedesignspaceforeachofthem.InSection4we base[33]systemforsimplifiedlargescalemachinelearning. TU- thenevaluatetheoptionsinthisdesignspaceexperimentally,and PAQ interacts with a distributed run-time and existing machine theninSection5comparethebaselinealgorithm(Algorithm1)to learningalgorithmstoefficientlyfindaPAQplanwhichyieldshigh TUPAQrunningwithgoodchoicesforsearchmethod,batchsize, qualitypredictions. andbanditallocationcriterion,i.e.,choicesinformedbytheresults ofSection4. BeforeexploringthedesignspacefortheTUPAQalgorithm,we firstdescribehowTUPAQfitsintoalargersystemtosupportPAQs. 2.5 TUPAQandtheMLbaseArchitecture samples points uniformly at random from this space. Powell’s TUPAQ lies at the heart of MLbase [33], a novel system de- method can be seen as a derivative-free analog to coordinate de- signedtosimplifytheprocessofimplementingandusingscalable scent,whiletheNelder-Meadmethodcanberoughlyinterpretedas machinelearningtechniques. Bygivingusersadeclarativeinter- aderivative-freeanalogtogradientdescent. face for machine learning tasks, the problem of hyperparameter BothPowell’smethodandtheNelder-Meadmethodexpectun- tuningandfeatureselectioncanbepusheddownintothesystem. constrainedsearchspaces,butfunctionevaluationscanbemodified ThearchitectureofthissystemisshowninFigure3. to severely penalize exploring out of the search space. However, Atthecenterofthesystem,someoptimizerorplannermustbe bothmethodsrequiresomedegreeofsmoothnessinthehyperpa- abletoquicklyidentifyasuitablemodelforsupportingpredictive rameterspacetoworkwell,andcaneasilygetstuckinlocalmin- queries.Wenotethatthesystemdescribedinthispaperintroduces ima. Additionally, neithermethodlendsitselfwelltocategorical somekeyarchitecturaldifferencescomparedwiththeoriginalML- hyperparameters, sincethefunctionspaceismodeledascontinu- base architecture in [33]. In particular, we make the concept of ous.Forthesereasons,weareunsurprisedthattheyareinappropri- a“PAQplanner”explicit, andintroduceacatalogforPAQplans. atemethodstouseinthemodelsearchproblemwhereoptimization WhenanewPAQarrives,itispassedtotheplannerwhichdeter- isdoneoveranunknownfunctionthatislikelynon-smoothandnot mines whether a new PAQ plan needs to be created. The entire convex. systemisbuiltuponApacheSpark,aclustercomputesystemde- More recently, various methods specifically for model search signed for iterative computing [52], and we leverage MLlib, and have been recently introduced in the ML community, including othercomponentspresentinApacheSpark,aswellasMLI[46]. Tree-basedParzenEstimators(HyperOpt)[13],SequentialModel- basedAlgorithmConfiguration(Auto-WEKA)[47]andGaussian 3. TUPAQDESIGNCHOICES Processbasedmethods,e.g.,Spearmint[45]. Thesealgorithmsall sharethepropertythattheycansearchoverspaceswhicharenested In this section, we examinethe design choices available to the (e.g. multiplemodelfamilies)andacceptcategoricalhyperparam- TUPAQplanner. Asstatedpreviously,wearetargetingalgorithms eters(e.g.regularizationmethod).HyperOptbeginswitharandom that run on tens to thousands of nodes in commodity computing search and then probabilistically samples from points with more clusters,andtrainingdatasetsthatfitcomfortablyintoclustermemory— promising minima, Auto-WEKA builds a Random Forest model on the order of tens of gigabytes to terabytes. Training of a sin- fromobservedhyperparameterresults,andSpearmintimplements glemodeltoconvergenceonsuchaclusterisexpectedtorequire aBayesianmethodsbasedonGaussianProcesses. tenstohundredsofpassesthroughthetrainingdata,andmaytake ontheorderofminutes. Moreover,withamulti-terabytedataset, 3.2 BanditResourceAllocation performingasequentialgridsearchinvolvingevenjust100model Modelsarenotallcreatedequal.Inthecontextofmodelsearch, configurationseachwithabudgetof100scansofthetrainingdata typically only a fraction of the models are of high-quality, with couldtakehourstodaysofprocessingtime,evenassumingthatthe manyoftheremainingmodelsperformingdrasticallyworse.Under algorithmrunsatmemoryspeed.Hence,inthisregimethebaseline certain assumptions, allocating resources among different model PAQplanneristremendouslycostly. configurations can be naturally framed as a multi-armed bandit Given the design choices presented in Section 2, we ask how problem[16]. Indeed,assumewearegivenafixedsetofkmodel thesedesignchoicesmightbeoptimizedtoprovidefast,highqual- configurationstoevaluate,asinthecaseofgridorrandomsearch, ity PAQ planning. In the remainder of this section we present alongwithafixedbudgetB. Then,eachmodelcanbeviewedas the following optimizations—advanced model search techniques, an‘arm’andthemodelsearchproblemcanbecastasak-armed banditresourceallocationviaruntimealgorithmintrospection,and banditproblemwithT rounds. Ateachroundweperformasingle physical optimization via batching—that in concert provide TU- iterationofaparticularmodelconfiguration,andreturnarewardin- PAQwithanorder-of-magnitudegaininperformance. dicatingthequalityoftheupdatedmodel,e.g.,validationaccuracy. 3.1 BetterModelSearch Insuchsettings,multi-armedbanditalgorithmscanbeusedtode- termineaschedulingpolicytoefficientlyallocateresourcesacross We call the problem of finding the highest quality model from thekmodelconfigurations.Typically,thesealgorithmskeeparun- a space of model families and their hyperparameters the model ningscoreforeachofthekarms,andateachiterationchoosean searchproblem,andthesolutiontothisproblemisofcentralim- armasafunctionofthecurrentscores. portance to TUPAQ. We view model search as an optimization problem over a potentially non-smooth, non-convex function in input :Models,History highdimensionalspace,whereitisexpensivetoevaluatethefunc- output:FinishedModels,History,Proposals tionandforwhichwehavenoclosedformexpressionforthefunc- tiontobeoptimized(andhencecannotcomputederivatives). Al- 1 Proposal←[]; thoughgridsearchremainsthestandardsolutiontothisproblem, 2 FinishedModels←[]; various alternatives have been proposed for the general problem 3 bestModel=getBestFromHistory(History); ofderivative-freeoptimization,someofwhichareparticularlytai- 4 forminmodelsdo lored for the model search problem. Each of these methods pro- 5 history.append(m); videsanopportunitytospeedup TUPAQ’splanningtime,andin 6 iffullyTrained(m)then thissectionweprovideabriefsurveyofthemostcommonlyused 7 FinishedModels.append(m); methods.InSection4weevaluateeachmethodonseveraldatasets 8 elseifquality(m)∗(1+(cid:15))>quality(bestModel)then todeterminewhichmethodismostsuitableforPAQplanning. 9 proposals.append(m); Traditionalmethodsforderivative-freeoptimizationincludegrid 10 end search(thebaselinechoiceforaPAQplanner)aswellasrandom 11 end search, Powell’smethod[42], andtheNelder-Meadmethod[39]. 12 return(FinishedModels,History,Proposals); Given a hyperparameter space, grid search selects evenly spaced Algorithm3:ThebanditallocationstrategyusedbyTUPAQ. points(inlinearorlogspace)fromthisspace,whilerandomsearch Oursettingdiffersfromthisstandardsettingintwocrucialways. where σ is the logistic function. The gradient descent algorithm First,severalofoursearchalgorithmsselectmodelconfigurations (Algorithm4)mustevaluatethisgradientfunctionforallinputdata toevaluateinaniterativefashion,sowedonothaveadvancedac- points,ataskwhichcanbeeasilyperformedinadataparallelfash- cesstoafixedsetofk modelconfigurations. Second,inaddition ion. Similarly,minibatchStochasticGradientDescent(SGD)has toefficientlyallocatingresources,weaimtoreturnareasonablere- anidenticalaccesspatternandcanbeoptimizedinthesamewayby sulttoauserasquicklyaspossible,andhencethereisabenefitto workingwithcontiguoussubsetsoftheinputdataoneachpartition. finishtrainingpromisingmodelconfigurationsoncetheyhavebeen identified. input :X,LearningRate,MaxIterations Ourbanditselectionstrategyisavariantoftheactionelimination output:Model algorithmof[23],andtoourknowledgethisisthefirsttimethisal- 1 i←0; gorithmhasbeenappliedtohyperparametertuning.Ourstrategyis 2 InitializeModel; detailedinAlgorithm3.Thisstrategypreemptivelyprunesmodels 3 whilei<MaxIterationsdo thatfailtoshowpromiseofconverging. Foreachmodel(orbatch 4 readcurrent; ofmodels),wefirstallocateafixednumberofiterationsfortrain- 5 Model←Model-LearningRate*Gradient(Model,X); ing; in Algorithm 2 the trainPartial() function trains each model 6 i←i+1; forPartialItersiterations. Partiallytrainedmodelsarefedintothe 7 end banditallocationalgorithm,whichdetermineswhethertotrainthe modeltocompletionbycomparingthequalityofthesemodelsto Algorithm 4: Pseudocode for convex optimization via gradient thequalityofthebestmodelthathasbeentrainedtodate. More- descent. over,thiscomparisonisperformedusingaslackfactorof(1+(cid:15)); inourexperimentsweset(cid:15)=.5andthuscontinuetotrainallmod- Theaboveformulationrepresentsthecomputationofthegradi- elswithqualitywithin50%ofthebestqualitymodelobservedso ent by taking a single point and single model at a time. We can far.Thealgorithmstopsallocatingfurtherresourcestomodelsthat naturallyextendthistomultiplemodelssimultaneouslyifwerep- failthistest,aswellastomodelsthathavealreadybeentrainedto resentourmodelsasamatrixW ∈ Rd×k,wherekisthenumber completion. ofmodelswewanttotrainsimultaneously,i.e., 3.3 Batching (cid:20) (cid:21) Batchingisanaturalsystemoptimizationinthecontextoftrain- ∇f = X(cid:62)(cid:0)σ(XW)−y(cid:1) . (2) ing machine learning models, with applications for cross valida- tionandensembling[34,17]. ForPAQplanning,wenotethatthe Thisoperationcanbeeasilyparallelizedacrossdataitemswith accesspatternoverthetrainingsetisidenticalformanymachine eachworkerinadistributedsystemcomputingtheportionofthe learning algorithms. Specifically, each algorithm takes multiple gradient for the data that it stores locally. Specifically, the por- passesovertheinputdataandupdatessomeintermediatestate(e.g., tion of the gradient that is derived from the set of local data is modelweights)duringeachpass.Asaresult,itispossibletobatch computedindependentlyateachmachine,andthesegradientsare together the training of multiple models effectively sharing scans simplysummedattheendofaniteration. Thesizeofthepartial across multiple model estimations. In a data parallel distributed gradients(inthiscaseO(d×k))ismuchsmallerthantheactual environment,thishasseveraladvantages: data(whichisO(n×d)),sooverheadsoftransferringtheseover thenetworkisrelativelysmall. Forlargedatasets, thetimespent 1. BetterCPUutilizationbyreducingwastedcycles. performingthisoperationisalmostcompletelydeterminedbythe costofperformingtwomatrixmultiplications—theinputtotheσ 2. Amortizedtasklaunchingoverheadacrossseveralmodelsat functionwhichtakesO(ndk)operationsandrequiresascanofthe once. input data as well as the final multiply by X(cid:62) which also takes O(ndk)operationsandrequiresascanofthedata. Thisformula- 3. Amortizednetworklatencyacrossseveralmodelsatonce. tionallowsustoleveragehighperformancelinearalgebralibraries Ultimately,thesethreeadvantagesleadtoasignificantreductionin thatimplementBLAS[35]—theselibrariesaretailoredtoexecute learningtime. Wetakeadvantageofthisoptimizationinline8of exactly dense linear algebra operations as efficiently as possible Algorithm2. andareautomaticallytunedtothearchitecturewearerunningon Forconcretenessandsimplicity,wewillfocusononealgorithm— via[50]. logisticregressiontrainedviagradientdescent—fortheremainder Thecarefulreaderwillnotethatifindividualdatapointsareof of this section, but we note that these techniques apply to many sufficientlylowdimension,thegradientfunctioninEquation1can modelfamiliesandlearningalgorithms. beexecutedinasinglepassoverthedatafrommainmemorybe- causethesecondreferencetox willlikelybeacachehit,whereas i 3.3.1 LogisticRegression weassumethatX isbigenoughthatitisunlikelytofitentirelyin CPUcache.WeexaminethiseffectmorecarefullyinSection5. Logistic Regression is a widely used machine learning model for binary classification. The procedure estimates a set of model 3.3.2 MachineBalance parameters, w ∈ Rd, given a set of data features X ∈ Rn×d, and binary labels y ∈ 0,1n. The optimal model w∗ ∈ Rd can One obvious question the reader may ask is why implement- befoundbyminimizingthenegativelikelihoodfunction,f(w) = ingthesealgorithmsviamatrix-multiplicationshouldofferspeedup −logp(X|w). Takingthegradientofthenegativeloglikelihood, overvector/vectorversionsofthealgorithms.Afterall,theruntime wehave: complexities of both algorithms are identical. However, modern x86 machines have been shown to have processor cores that sig- n (cid:20) (cid:21) ∇f =(cid:88) (cid:0)σ(w(cid:62)x )−y (cid:1)x , (1) nificantly outperform their ability to read data from main mem- i i i ory[38]. Inparticular,onatypicalx86machine,thehardwareis i=1 capable of reading 0.45B doubles/s from main memory per core, our bandit approach. Next, we tested our batching optimizations while the hardware is capable of executing 6.8B FLOPS in the ondatasetsofvarioussizesinaclusterenvironment,tobetterun- sameamountoftime[37]. Specifically,onthemachineswetested derstandtheimpactofbatchingasmodelsanddatasetsgetbigger. (Amazonc3.8xlargeEC2instances),LINPACKreportedpeak Inallexperiments,beforetraining,wesplitourbasedatasetsinto GFLOPSof110GFLOPS/swhenrunningonallcores,whilethe 70%training,20%validation,and10%testing. Inallcases,mod- STREAM benchmark reported 60GB/s of throughput across 16 elsarefittominimizeclassificationerroronthetrainingset,while physicalcores.Thisequatestoamachinebalanceofapproximately modelsearchoccursbasedonclassificationerroronthevalidation 15 FLOPS per double precision floating point number read from set (validation error).1 We only report validation error numbers mainmemoryifthemachineisusingbothallavailableFLOPsand here, buttesterrorwassimilar. TUPAQ iscapableofoptimizing all available memory bandwidth solely for its core computation. forarbitraryperformancemetricsaslongastheycanbecomputed Thisapproximatevalueforthemachinebalancesuggestsanoppor- mid-flight,andextendstoothersupervisedlearningscenarios. tunityforoptimizationbyreducingunusedresources,i.e.,wasted 4.1 ModelSearch cycles. By performing more computation for every number read frommemory,wecanreducethisresourcegap. We validated our multiple model search strategies on a series TheRooflinemodel[51]offersamoreformalwaytostudythis ofsmalldatasetswithwell-formedbinaryclassificationproblems effect. According to the model, total throughput of an algorithm embedded in them. These datasets come from the UCI Machine isboundedbythesmallerof1)peakfloatingpointperformanceof LearningRepository[10]. Themodelsearchtaskinvolvedtuning themachine,and2)memorybandwidthtimesoperationalintensity fourhyperparameters—learningrate,L2regularizationparameter, of the algorithm, where operational intensity is a function of the sizeofarandomprojectionmatrix, andnoiseassociatedwiththe numberofFLOPsperformedperbytereadfrommemory. Thatis, random feature matrix. The random features are constructed ac- for an efficiently implemented algorithm, the bottleneck is either cordingtotheprocedureoutlinedin[43]. Toaccommodateforthe I/ObandwidthfrommemoryorCPUFLOPs. linearscale-upthatcomeswithaddingrandomfeatures,wedown Analysisoftheunbatchedgradientdescentalgorithmrevealsthat sample the number of data points for each model training by the the number of FLOPs required per byte is quite small—just over sameproportion. 2flopspernumberreadfrommemory—amultiplyandanadd— Ourrangesforthesehyperparameterswerelearningrate∈(10−3,101), andsincewerepresentourdataasdouble-precisionfloatingpoint regularization∈ (10−4,102), projectionsize∈ (1×d,10×d), numbers, this equates to 1/2 FLOP per byte. Batching allows us andnoise∈(10−4,102). tomove“uptheroofline”byincreasingalgorithmiccomplexityby Weevaluatedsevensearchmethods:gridsearch,randomsearch, afactorofk,ourbatchsize. Theexactsettingofk thatachieves Powell’smethod,theNelder-Meadmethod,Auto-WEKA,Hyper- balance(andmaximizesthroughput)ishardwaredependent,butwe Opt,andSpearmint. showinSection5thatonmodernmachines,k=10isareasonable Eachdatasetwasprocessedwitheachsearchmethodwithavary- choice. ing number of function calls, chosen to align well with a regular grid of n4 points where we vary n from 2 to 5. This restriction 3.3.3 AmortizedOverheads onaregulargridisonlynecessaryforgridsearchbutincludedfor InthecontextofadistributedmachinelearningsystemlikeML- comparability. base,whichrunsonApacheSpark,delaysduetotaskscheduling ResultsofthesearchexperimentsarepresentedinFigure4.Each and serialization/deserialization can be significant relative to the tilerepresentsadifferentdataset/searchmethodcombination.Each timespentcomputing. barwithinthetilerepresentsadifferentbudgetintermsoffunction Bybatchingourupdatesintotasksthatrequiremorecomputa- calls/modelstrained. Theheightofeachbarrepresentsclassifica- tion,weareabletoreducetheaggregateoverheadoflaunchingnew tionerroronthevalidationdataset. tasks substantially. Assuming a fixed scheduling delay of 200ms Withthisexperiment,wearelookingformethodsthatconverge per task, if we have a batch size of k = 10 models, the average to good models in as small a budget as possible. Of all methods taskoverheadpermodeliterationdropsto20ms. Overthecourse tried,HyperOptandAuto-WEKAtendtoachievethiscriteriabest, ofhundredsofiterationsforhundredsofmodels, thesavingscan but random search is not far behind. We chose to integrate Hy- besubstantial. perOpt into the larger experiments because it performed slightly Foratypicaldistributedsystem,amodelupdaterequiresatleast betterthanAuto-WEKA.Ourarchitecturefullysupportsadditional twonetworkround-tripstocomplete: onetolaunchataskoneach searchmethods,andweexpecttoimplementadditionalmethodsin workernode, andonetoreporttheresultsofthetaskbacktothe oursystemovertime. master.Ifwewereindeedboundbynetworkmessaging,thenamor- 4.2 BanditResourceAllocation tizingthiscostacrossmultiplemodelscouldsubstantiallyreduce total overhead due to network latency. In our experiments, how- WeevaluatedtheTUPAQbanditresourceallocationschemeon ever, the number of messages is relatively small and the network thesamedatasetswithrandomsearchand625totalfunctionevaluations— overheadsaresubstantiallylowerthanschedulingandcomputation thesameasthemaximumbudgetinthesearchexperiments. The overheads,sofutureworkinthissettingshouldfocusonminimiz- key question to answer here was whether we could identify and ingschedulingoverheads. terminate poorly performing models early in the training process withoutsignificantlyaffectingoverallsearchquality. 4. DESIGNSPACEEVALUATION InFigure5weillustratetheeffectthattheTUPAQbanditstrat- egyhasonvalidationerroraswellasonthenumberoftotalscansof Nowthatwehavelaidoutthepossibleoptimizationsavailableto theinputdataset.Modelswereallocated100iterationstoconverge TUPAQ,weidentifyavailablespeedupsfromtheseoptimizations. onthecorrectanswer.Afterthefirst10iterations,modelsthatwere We validated the strategies for model search and bandit resource allocation on five representative datasets across a variable sized 1Whilewehavethusfardiscussedmodelquality,fortheremainder modelfittingbudget. Inparticular, theseresultsmotivatedwhich ofthepaperwereportvalidationerror,i.e.,theinverseofquality, model search strategies to incorporate into TUPAQ and validate becauseitismorecommonlyreportedinpractice. Comparison of Search Methods Across Learning Problems GRID RANDOM NELDER_MEAD POWELL SPEARMINT AUTOWEKA HYPEROPT 0.5 0.4 000...123 australian 0.0 0.5 0.4 00..23 breast Dataset and Validation Error0000000000..........0101234545 diabetes Budg18266152e65t 000...123 fourclass 0.0 0.5 0.4 00..23 splice 0.1 0.0 16 81 256625 16 81 256625 16 81 256625 16 81 256625 16 81 256625 16 81 256625 16 81 256625 Method and Budget Figure4: Searchmethodswerecomparedacrossseveraldatasetswithavariablenumberoffunctionevaluations. Classificationerrorona validationdatasetisshownforeachcombination.HyperOptandAuto-WEKAprovidestateoftheartresults,whilerandomsearchperforms bestoftheclassicmethods. notwithin50%oftheclassificationerrorofthebestmodeltrained sofarwerepreemptivelyterminated.Alargepercentageofmodels thatshowlittleornopromiseofconvergingtoareasonablevalida- tionerrorareeliminated. Inthefigure,thetopsetofbarsrepresentsthenumberofscans of the training data at the end of the entire search process. The bottom set of bars represent the validation error achieved at the end of the search procedure. The three scenarios evaluated—No Effectiveness of Bandit Optimization on Training Time and Model Accuracy australian breast diabetes fourclass splice Bandit, Bandit, and Baseline—represent the results of the search Data60000 withnobanditallocationprocedure(thatis,eachmodelistrainedto Scans of Training 2400000000 Epochs cadonamdtTahstpheeletersetb,ioaawnsnae)d,slittnhaheeneeav8rlagr6olo%irdriratahdtitmeoecnfrtoeheraersreboeaarcnihindsdirttaooattualalgsloehcetla.yptoicoconhmsppraoaccrraeobdsluseretthoeentshaeeblfiuevnde-, Validation Error 00000.....001234 Validation Error otvprieaplodeltniuidrmceiantsitizoiooemunndrsocsedietnrerraalorltulreeonrgocrtyioafm.triaOoev.nssni.mmavpeneltoehrtaobgsdateosp,pertlephisininseengmtmseeoaothrdploeypdlo.wratTchuhehniniiestivcereosesmlfa9opt3riav%dreerladyrmeswdaiumittcihc-- 4.3 Batching No Bandit Bandit Baseline No Bandit BanditDataBaselineset andNo Bandit BandBanditit OptBaselineimizatioNo Banditn LevBanditel Baseline No Bandit Bandit Baseline Toevaluatethebatchingoptimization,weusedasyntheticdataset of1,000,000datapointsinvariousdimensionality. Toillustrate Figure 5: Here we show the effects of bandit resource allocation the effects of amortizing scheduler overheads vs. achieving ma- ontrainedmodelperformance.Modelsearchcompletesinanaver- chinebalance,thesedatasetsvaryinsizebetween750MBand75GB. ageof83%fewerpassesoverthetrainingdatathanwithoutbandit Wetrainedthesemodelsona16-nodeclusterofc3.8xlarge allocation. Exceptinonecase, validationerrorisnearlyindistin- nodesonAmazonEC2,runningApacheSpark1.1.0.Wetraineda guishablevs.thecasewherewedonotemploythebanditstrategy. logisticregressionmodelonthesedatapointsviagradientdescent withnobatching(batchsize=1)andbatchingupto20modelsat once.Weimplementedbothanaiveversionofthisoptimization— with while loops evaluating equation 1 over each model in each task, aswellasamoresophisticatedversionofthismodelwhich makesBLAScallstoperformthecomputationdescribedinequa- tion2. Forthebatchingexperimentsonly,weruneachalgorithm for10iterationsovertheinputdata. D wemaximizethroughputintermsofmodelsperhouratbatchsize 100 1000 10000 BatchSize 15. InFigure7wecomparetwodifferentstrategiesofimplement- 1 826.44 599.60 553.59 ingbatching—oneviathenaivemethod,andtheotherviathemore 2 1521.23 1214.37 701.07 sophisticatedmethod—computinggradientupdatesviaBLASma- 5 2411.53 3037.97 992.01 trixmultiplication.Forsmallbatchsizes,thenaiveimplementation 8 5557.69 3502.79 1243.79 actuallyperformsfasterthantheBLASoptimizedone.Thematrix- 10 7148.53 4216.44 1769.12 based implementation easily dominates the naive implementation 15 7874.01 6260.14 2485.15 as batch size increases. This is because the algorithm is slightly 20 11881.18 8248.36 2445.98 morecacheefficientandrequiresonlyasinglepassthroughthein- putdata. Theoverallspeedupduetobatchingisnearlyafactorof (a)Modelstrainedperhourforvaryingbatchsizesandmodel 5whenexecutedviamatrixmultiplication. complexity. Datasizesrangedfrom750MB(D=100)to75GB ThedownsidetobatchinginthecontextofPAQplanningisthat (D=10000). thesystemmaygaininformationbytryingplanssequentiallythat D 100 1000 10000 couldinformsubsequentplansthatisnotincorporatedinlaterruns. BatchSize Byfixingourbatchsizetoarelativelysmallconstant(10)weare 1 1.00 1.00 1.00 abletobalancethistradeoff. 2 1.84 2.02 1.26 5 2.91 5.06 1.79 5. PUTTINGITALLTOGETHER 8 6.72 5.84 2.24 10 8.64 7.03 3.19 Now that we have examined each point in the PAQ planning 15 9.52 10.44 4.48 design space individually, let us now evaluate end-to-end perfor- 20 14.37 13.75 4.41 manceoftheTUPAQplanner. By employing batching, using state-of-the-art search methods, (b)Speedupfactorvsfastestsequentialunbatchedmethodfor and using bandit resource allocation to terminate non-promising varyingbatchsizeandmodelcomplexity. models,weareabletoseea10xincreaseinrawthroughputofthe Figure6: Effectofbatchingisexaminedon16nodeswithasyn- systemintermsofmodelstrainedperunittime,whilefindingPAQ theticdataset. Speedupsdiminishbutremainsignificantasmodels plansthathaveasgoodorhigherqualitythanthosefoundwiththe increaseincomplexity. baselineapproach. WeevaluatedTUPAQonverylargescaledataproblems,atclus- tersizesrangingfrom16to128nodesanddatasetsrangingfrom InFigure6weshowthetotalthroughputofthesysteminterms 30GBtoover3TBinsize.Thesesizesrepresentthesizeoftheac- of models trained per hour varying the batch size and the model tualfeaturesthemodelwastrainedon,nottherawdatafromwhich complexity. Formodelstrainedonthesmallerdataset,weseethe thesefeatureswerederived. totalnumberofmodelsperhourcanincreasebyuptoafactorof 15 for large batch sizes. This effect should not be surprising, as 5.1 PlatformConfiguration theactualtimespentcomputingisontheorderofmillisecondsand WeevaluatedTUPAQonLinuxmachinesrunningunderAma- virtuallyallthetimegoestoschedulingtaskexecution. Initscur- zonEC2,instancetypec3.8xlarge.Thesemachineswerecon- rent implementation, due to these scheduling overheads, this im- figured with Redhat Enterprise Linux, Scala 2.10, version 1.9 of plementation of the algorithm under Spark will not outperform a the Anaconda python distribution from Continuum Analytics[1], singlemachineimplementationforadatasetthissmall.Wediscuss and Apache Spark 1.1.0. Additionally, we made use of Hadoop analternativeexecutionstrategythatwouldbetterutilizeclusterre- 1.0.4configuredonlocaldisksasourdatastoreforthelargescale sourcesforsituationswheretheinputdatasetissmallinSection7. experiments. Finally,weuseMLIasofcommit3e164a2d8casa Attheotherendofthespectrumintermsofdatasizeandmodel basisforTUPAQ. complexity,weseetheeffectsofschedulerdelaystarttolessen,and 5.1.1 ApacheSparkConfiguration As with any complex system, proper configuration of the plat- formtoexecuteagivenworkloadisnecessaryandApacheSparkis Comparison of TuPAQ's Matrix−Oriented Batching vs. the Naive Method 2500 l l noexception. Specifically—choosingacorrectBLASimplemen- tation,configuringsparktouseit,andpickingtherightbalanceof ned Per Hour00, d=10,000)12500000 l SltraTteUgPyAQ ed5xee.ta1ciu.l2stoorfthoEruxerapcdeosrnpifiemgrueerxnaettciaoulntoSarerpetruaopvcaeaislsanbtdoleoDkonacotraenqssiuedetessrta.bleeffort.Full Models Train=1,0000,01000 ll ll l l l l NAIVE ofTAhpeacchoemSppleatrek,syMstLelmib,inavnodlvMesLIa.SHcearlea,cwoederabnaesxepbeuriimlteonntstoopn ( l 16 and 128 machines. We used two datasets with two different 500 ll learningobjectivestoevaluateoursystematscale. l 5 10 15 20 Thefirstdatasetisapre-featurizedversionoftheImageNetLarge Batch Size ScaleVisualRecognitionChallenge2010(ILSVRC2010)dataset[12], Figure7: Leveraginghighperformancelinearalgebralibrariesfor featurizedusingaprocedureattributedto[19].Thisprocessyields batchingleadstosubstantialspeedupsvs. naivemethods. Atbot- adatasetwith160,000featuresandapproximately1,200,000ex- tom, we show models per hour via the fastest sequential (non- amples, or1.4TBofrawimagefeatures. Inour16-nodeexperi- batched) strategy and demonstrate a 5x improvement in through- mentswedownsampletothefirst16,000ofthesefeaturesanduse put. 20%ofthebasedatasetformodeltraining,whichisapproximately 30GB of data. In the 128-node experiments we train on the en- tiredataset.Weexplorefivehyperparametershere—oneparameter fortheclassifierwetrain—SVMorlogisticregression,aswellas learningrateandL2Regularizationparametersforeachmatching theaboveexperiments. Weallotabudgetof128modelfittingsto SearchMethod theproblem. Grid Random HyperOpt Optimization AsinFigure1b,wesearchforaPAQplancapableofdiscrimi- None 104.7 100.5 103.9 natingplantsfromnon-plantsgiventheseimagefeatures. Theim- BanditsOnly 31.3 29.7 50.5 ages are generally in 1000 base classes, but these classes form a BatchingOnly 31.3 32.1 31.8 hierarchy and thus can be mapped into plant vs. non-plant cate- gories. Baseline error for this modeling task is 14.2%, which is All(TUPAQ) 11.5 10.4 15.8 abitmoreskewedthanthepreviousexamples. Ourgoalistore- Figure 8: Learning time in minutes for a 128-configuration bud- ducevalidationerrorasmuchaspossible,butourexperiencewith getacrossvariousoptimizationlevelsforImageNetdata. Unopti- thisparticulardatasethasputalowerboundonvalidationerrorto mized, sequential execution takes over 100 minutes regardless of around9%accuracywithlinearclassificationmodels. searchprocedureused. Fullyoptimizedexecutioncanbeanorder Theseconddatasetisapre-featurizedversionoftheTIMITAcoustic- ofmagnitudefasterwithTUPAQ. Phonetic continuous speech corpus [24], featurized according to theproceduredescribedin[44]—yieldingroughly2,300,000ex- ampleseachhaving440features.Whilethisdatasetisquitesmall, in order to achieve strong performance on this dataset, other re- searchers have noted that Kernel Methods offer the best perfor- mance[31].Followingtheprocessof[43],thisinvolvesexpanding thefeaturespaceofthedatasetbynearlytwoordersofmagnitude, yieldingadatasetthathas204,800features.Again,thisisapproxi- mately3.4TBofspeechfeatures.Weexplorefivehyperparameters here—oneparameterdescribingthedistributionfamilyoftheran- SearchMethod SearchTime(m) TestError(%) domprojectionmatrix—inthiscaseCauchyorGaussian,thescale andskewofthesedistributions,aswellastheL2regularizationpa- Grid(unoptimized) 104.7 11.05 rameterforthismodel,whichwillhaveadifferentsettingforeach Random(optimized) 10.4 11.41 distribution. HyperOpt(optimized) 15.8 10.38 AnecessarypreconditiontosupportingPAQslikethoseinFig- Figure9: BothoptimizedHyperOptandRandomsearchperform ure1a,thisdatasetprovidesaexamplesoflabeledphonemes,and significantlyfasterthanunoptimizedGridsearch,whileHyperOpt ourchallengeistofindamodelcapableoflabelingphonemesgiven yieldsthebestmodelforthisimageclassificationproblem. someinputaudio.Baselineerrorforthismodelingtaskis95%,and state-of-the-artperformanceonthisdatasetis35%error[31]. 5.2 OptimizationEffects InFigure8wecanseetheeffectsofbatchingandbanditalloca- tiononthePAQplanningprocessfortheImageNetdataset.Specif- ically,giventhatwewanttoevaluatethefitnessof128models,it takesnearly2hourstofitall128modelsonthe30GBdatasetof dataonthe16nodecluster. Bycomparison, withthebanditrule andbatchingturnedon,thesystemtakesjust10minutestotraina randomsearchmodeltocompletionandabitlongertotrainaHy- Convergence of Model Accuracy on ImageNet Dataset (1.5TB) perOptmodeltocompletion,a10xspeedupinthecaseofrandom llllllll llllll lllll llll searchanda7xspeedupinthecaseofHyperOpt. HyperOpttakes slightlylongerbecauseitdoesagoodjobofpickingpointsthatdo So Far0.75 nitmsioo,oTtnmdnuetoerhlnreatedinhnoatgrofnatnobrhduaeenortmdmeaortomtmsdeeineansltreasicoatthenrhd.cahtAtpoHrgcemciyveopmoerneddpreitOtnlhivgpeceltoyslsyn,aevmHbleeyyercgptttehseranreaOicrnbeepianttnigrladalrubiirtnsuivstedrtdeargsatetetoated.tgcaoyin.mbTeFpthitlgeea-r-t Validation Error Seen 0.50 ure 9, we can see that on this dataset HyperOpt converges to the Best 0.25 llll bestanswerinjust15minutes,whilerandomsearchconvergesto ll ll l within5%ofthebesttesterrorachievedbygridsearchafullorder 25 50 75 Time elapsed (m) ofmagnitudefasterthanthebaselineapproach. Figure 10: Training a model with a budget of 32 function evalu- ationsona1.2m×160kdatasettakes90minutesona128-node 5.3 LargeScaleSpeechandVision clusterwithTUPAQ. Becauseweemploydata-parallelversionsofourlearningalgo- rithms,achievinghorizontalscalabilitywithadditionalcomputere- sourcesistrivial. TUPAQreadilyscalestomulti-terabytedatasets that are an order of magnitude more complicated with respect to the feature space. For these experiments, we ran with the Ima-