LearningOverJoins by ArunKumar Adissertationsubmittedinpartialfulfillmentof therequirementsforthedegreeof DoctorofPhilosophy (ComputerSciences) atthe UNIVERSITYOFWISCONSIN–MADISON 2016 Dateoffinaloralexamination: 07/21/2016 ThedissertationisapprovedbythefollowingmembersoftheFinalOralCommittee: JeffreyNaughton,ProfessorEmeritus,ComputerSciences,UW-Madison;Google JigneshM.Patel,Professor,ComputerSciences,UW-Madison C.DavidPageJr.,Professor,BiostatisticsandMedicalInformatics,UW-Madison ChristopherRé,AssistantProfessor,ComputerScience,StanfordUniversity StephenJ.Wright,Professor,ComputerSciences,UW-Madison XiaojinZhu,AssociateProfessor,ComputerSciences,UW-Madison ©CopyrightbyArunKumar2016 AllRightsReserved i acknowledgments No amount of words can fully express my infinite gratitude to my co-advisors, Jeff Naughton and Jignesh Patel. This dissertation would not have been possible without theirunwaveringsupportformycrazyideastoexploretopicsthatwereoutsideoftheir coreinterests. Jeff’sincredibleresearchwisdomanduncannyabilitytounderstandhisstu- dents’situationshavehelpedmeinnumerabletimes,ashaveJignesh’sinfectiousgo-getter spiritandremarkablegraspongroundingresearchwithpracticalrelevance. Theseareall attributesthatIhopetoemulateinmycareer. Ithinkthemostrewardinggiftanadvisor cangivetheirstudentisthefreedomtopursuetheirinterestsandcollaborations,while stayingcloselyengagedbygivinghonestadviceandcriticalfeedback. Iamveryfortunate thatJeffandJigneshtrustedmeenoughtogivemethisgift. IamdeeplygratefultoChrisRéforgettingmestartedindatamanagementresearchas apartofhisresearchgroup. Hispatienceandconfidenceinmeearlyonwereinstrumental ingettingmetocontinueinresearch. Hisoutstandingabilitytoweaveeleganttheoretical insightswithsolidsystemsworkacrossmultipleareasisaskillthatIhopetoemulate. I thank Steve Wright and Jerry Zhu for their collaborations and for serving on my committee. Theirdeepinsightsaboutoptimizationandmachinelearning,theirpatience in explaining new concepts to me, and their honest feedback on my ideas were crucial forthisresearch. IthankDavidPageforservingonmycommitteeandforhisinsightful feedbackonmytalks. IamalsodeeplygratefultoDavidDeWittandtheMicrosoftJimGraySystemsLab forfundingmydissertationresearchandforgivingmeaccesstoMicrosoft’sresources withoutanystringsattached. IthankDavidandtheothermembersoftheLabfortheir continualfeedbackonmypapersandtalks,whichisanintegralpartoftheremarkable close-knitcommunityenvironmentoftheLab. IthankRobertMcCannfromMicrosoft forourperiodicinsightfuldiscussionsonresearchandpractice,forhisfeedbackonmy papers,andforhelpingtosetupaproductiveresearchcollaborationwithMicrosoft. Iwasfortunatetobeabletomentoragreatsetofstudentsaspartofmydissertation research: LingjiaoChen,ZhiweiFan,MonaJalal,FenganLi,BoqunYan,andFujieZhan. It isarewardingexperiencetoworkwithsuchbrightstudentsandwatchthemmatureas researchers. Thisisakeyreasonformetowanttocontinueinacademicresearch. Iam thankfultoJeffandJigneshforgivingmetheopportunitytoadvisethesestudents. I am thankful to all of my other co-authors and research collaborators throughout mygraduateschoollife. AnincompletelistincludesMikeCafarella,JoeHellerstein,Ben Recht,AaronFeng,PradapKonda,FengNiu,andCeZhang. Iamgratefultomyfriends, ii BruhathiSundarmurthy,WentaoWu,andtheotherstudentsoftheDatabaseGroup,for theircontinualfeedbackonmyideas,papersandtalks. Finally,Iamindebtedbeyondmeasuretomyfamily,especiallymyfather,Kumar,my brother,Balaji,andmysister-in-law,Indira,fortheirlove,support,andadvicethroughout my graduate school life, especially during the tough times. I am grateful to my other closefriends,whowerealwaystheretosupportandhelpme. Anincompletelistincludes Levent,Thanu,andVijay. Lastbutdefinitelynottheleast,Iamgratefultomywonderful fiancé,Wade,forhisloveandsupportduringthecrucialfinalstagesofmydissertation research and my job search. I look forward to an exciting journey with him as I move forwardtoacareerinacademia. Overall,Iamdeeplyfortunatetohavehad,andcontinuetohave,suchagreatsetof mentorsandsuchawonderfulsetoflovedones. Iwillneverforgetthesupportofallof thesepeoplewithmydissertationresearchandIknowIcancountontheminanyofmy futureendeavorstoo. MostofthisdissertationresearchwasfundedbyagrantfromtheMicrosoftJimGray SystemsLab. Allviewsexpressedinthisworkarethatoftheauthorsanddonotnecessarily reflectanyviewsofMicrosoft. iii abstract Advancedanalyticsusingmachinelearning(ML)isincreasinglycriticalforawidevariety ofdata-drivenapplicationsthatunderpinthemodernworld. Manyreal-worlddatasets have multiple tables with relationships, but most ML toolkits force data key-foreign key scientists to join them into a single table before using ML. This process of “learning joins” introduces redundancy in the data, which results in storage and runtime after inefficiencies,aswellasdatamaintenanceheadachesfordatascientists. Tomitigatethese issues,thisdissertationintroducestheparadigmof“learning joins,”whichincludes over twoorthogonaltechniques: and . Theformer avoidingjoinsphysically avoidingjoinslogically showshowtopushMLcomputationsthroughjoinstothebasetables,whichimproves runtimeperformancewithoutaffectingMLaccuracy. Thelattershowsthatinmanycases, itispossible, somewhatsurprisingly, toignoreentirebasetableswithoutaffectingML accuracysignificantly. Overall, ourtechniqueshelpimprovetheusabilityandruntime performanceofMLovermulti-tabledatasets,sometimesbyordersofmagnitude,without degradingaccuracysignificantly. Ourworkforcesonetorethinkaprevalentpracticein advancedanalyticsandopensupnewconnectionsbetweendatamanagementsystems, databasedependencytheory,andmachinelearning. iv contents Abstract iii Contents iv ListofFigures vi ListofTables x 1 Introduction 1 1 1.1 Example 3 1.2 TechnicalContributions 5 1.3 SummaryandImpact 2 Preliminaries 7 7 2.1 ProblemSetupandNotation Orion 8 2.2 Backgroundfor Hamlet 9 2.3 Backgroundfor 3 Orion: AvoidingJoinsPhysically 12 15 3.1 LearningOverJoins 19 3.2 FactorizedLearning 26 3.3 Experiments 35 3.4 Conclusion: AvoidingJoinsPhysically 4 ExtensionsandGeneralizationofOrion 37 Santoku 37 4.1 Extension: ProbabilisticClassifiersOverJoinsand 42 4.2 Extension: OtherOptimizationMethodsOverJoins 44 4.3 Extension: ClusteringAlgorithmsOverJoins 46 4.4 Generalization: LinearAlgebraOverJoins 5 Hamlet: AvoidingJoinsLogically 50 53 5.1 EffectsofKFKJoinsonML 60 5.2 PredictingaprioriifitisSafetoAvoidaKFKJoin 69 5.3 ExperimentsonRealData 79 5.4 Conclusion: AvoidingJoinsLogically 6 RelatedWork 81 v Orion 81 6.1 RelatedWorkfor Orion 82 6.2 RelatedWorkfor ExtensionsandGeneralization Hamlet 83 6.3 RelatedWorkfor 7 ConclusionandFutureWork 85 References 88 A Appendix: Orion 95 95 A.1 Proofs 97 A.2 AdditionalRuntimePlots 99 A.3 MoreCostModelsandApproaches 102 A.4 ComparingGradientMethods B Appendix: Hamlet 104 104 B.1 Proofs 106 B.2 MoreSimulationResults 111 B.3 OutputFeatureSets vi list of figures 1.1 ExamplescenarioforMLovermulti-tabledata. . . . . . . . . . . . . . . . . . . 2 3.1 Learningoverajoin: (A)Schemaandlogicalworkflow. FeaturevectorsfromS (e.g.,Customers)andR(e.g.,Employers)areconcatenatedandusedforBGD. The loss (F) and gradient (∇F) for BGD can be computed together during a passoverthedata. Approachescompared: Materialize(M),Stream(S),Stream- Reuse(SR),andFactorizedLearning(FL).High-levelqualitativecomparisonof storage-runtimetrade-offsandCPU-I/Ocosttrade-offsforruntimesofthefour approaches. SisassumedtobelargerthanR,andtheplotsarenottoscale. (B) WhenthehashtableonRdoesnotfitinbuffermemory,S,SR,andMrequire extrastoragespacefortemporarytablesorpartitions. But,SRcouldbefaster than FL due to lower I/O costs. (C) When the hash table on R fits in buffer memory,butSdoesnot,SRbecomessimilartoSandneitherneedextrastorage space,butbothcouldbeslowerthanFL.(D)Whenalldatafitcomfortablyin buffermemory,noneoftheapproachesneedextrastoragespace,andMcould befasterthanFL. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 3.2 Redundancyratioagainstthetwodimensionratios(ford = 20). (A)Fix dR S d S andvary nS. (B)Fix nS andvary dR. . . . . . . . . . . . . . . . . . . . . . . . . 18 n n d R R S 3.3 Logicalworkflowoffactorizedlearning,consistingofthreestepsasnumbered. HR and HS are logical intermediate relations. PartialIP refers to the partial innerproductsfromR.SumScaledIPreferstothegroupedsumsofthescalar outputofG()appliedtothefullinnerproductsontheconcatenatedfeature vectors. Here,γ denotesaSUMaggregationandγ (RID)denotesaSUM SUM SUM aggregationwithaGROUP BYonRID.. . . . . . . . . . . . . . . . . . . . . . . . 20 3.4 Analyticalcostmodel-basedplotsforvaryingthebuffermemory(m). (A)Total time. (B)I/Otime(with100MB/sI/Orate). (C)CPUtime(with2.5GHzclock). Thevaluesfixedaren =108 (inshort,1E8),n =1E7,d =40,d =60,and S R S R Iters=20. Notethatthexaxesareinlogscale. . . . . . . . . . . . . . . . . . . 27 3.5 Implementation-based performance against each of (1) tuple ratio (nS), (2) n R featureratio(dR),and(3)numberofiterations(Iters)–separatedcolumn-wise d S –forthe(A)RSM,(B)RMM,and(C)RLMmemoryregion–separatedrow-wise. SR is skipped for RMM and RLM since its runtime is very similar to S. The otherparametersarefixedasperTable3.3. . . . . . . . . . . . . . . . . . . . . 29 vii 3.6 Analyticalcostmodel-basedplotsofperformanceagainsteachof(A) nS,(B) n R dR, and (C) Iters for the RSM region. The other parameters are fixed as per d S Table3.3. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 3.7 AnalyticalplotsforwhenmisinsufficientforFL.Weassumem=4GB,and plottheruntimeagainsteachofn ,d ,Iters,andn ,whilefixingtheothers. S R R Wherevertheyarefixed,weset(n ,n ,d ,d ,Iters)=(1E9,2E8,2,6,20). . . 33 S R S R 3.8 Parallelism with Hive. (A) Speedup against cluster size (number of worker nodes) for (n ,n ,d ,d ,Iters) = (15E8,5E6,40,120,20). Each approach is S R S R comparedtoitself,e.g.,FLon24nodesis3.5xfasterthanFLon8nodes. The runtimeson24nodeswere7.4hforS,9.5hforFL,and23.5hforM.(B)Scaleup asboththeclusteranddatasetsizesarescaled. Theinputsarethesameasfor (A)for8nodes,whilen isscaled. Thus,thesizeof Tvariesfrom0.6TBto1.8TB. 34 S 4.1 IllustrationofFactorizedLearningforNaiveBayes. (A)ThebasetablesCustomers (the“entitytable”asdefinedinKumaretal.[2015c])andEmployers(an“at- tributetable”asdefinedinKumaretal.[2015c]). ThetargetfeatureisChurn in Customers. (B) The denormalized table Temp. Naive Bayes computations using Temp have redundancy, as shown here for the conditional probability calculationsforStateandSize. (C)FLavoidscomputationalredundancyby pre-countingreferences,whicharestoredinCustRefs,andbydecomposing (“factorizing”)thesumsusingStateRefsandSizeRefs. . . . . . . . . . . . . 37 4.2 Screenshotsof Santoku: (A)TheGUItoloadthedatasets,specifythedatabase dependencies,andtrainMLmodels. (B)Resultsoftrainingasinglemodel. (C) Resultsoffeatureexplorationcomparingmultiplefeaturevectors. (D)AnR scriptthatperformsthesetasksprogrammaticallyfromanRconsoleusingthe SantokuAPI. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 4.3 High-levelarchitecture. UsersinteractwithSantokueitherusingtheGUIor Rscripts. Santokuoptimizesthecomputationsusingfactorizedlearningand invokesanunderlyingRexecutionengine. . . . . . . . . . . . . . . . . . . . . 40 4.4 ResultsonrealdatasetsforK-Means. Theapproachescomparedare–M:Mate- rialize(usethedenormalizeddataset),F:Factorizedclustering,FRC:Factorized clusteringwithrecoding(improvesF),NC:NaiveLZWcompression,andOC: Optimizedcompression(improvesNC). . . . . . . . . . . . . . . . . . . . . . . 46 4.5 Performanceonrealdatasetsfor(A)LinearRegression,(B)LogisticRegression, (C)K-Means,and(D)GNMF.E,M,Y,W,L,B,andFcorrespondtotheExpedia, Movies,Yelp,Walmart,LastFM,Books,andFlightsdatasetrespectively. The numberofiterations/centroids/topicsis20/5/5. . . . . . . . . . . . . . . . . . 49 viii 5.1 Illustratingtherelationshipbetweenthedecisionrulestotellwhichjoinsare “safetoavoid.” . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 5.2 Relationshipbetweenhypothesisspaces. . . . . . . . . . . . . . . . . . . . . . . 58 5.3 Simulation results for the scenario in which only a single X ∈ X is part of r R thetruedistribution,whichhasP(Y = 0|X = 0) = P(Y = 1|X = 1) = p. For r r theseresults,wesetp = 0.1(varyingthisprobabilitydidnotchangetheoverall trends). (A) Vary n , while fixing (d ,d ,|D |) = (2,4,40). (B) Vary |D | S S R FK FK (= n ),whilefixing(n ,d ,d ) = (1000,4,4). . . . . . . . . . . . . . . . . . . 63 R S S R 5.4 Whenq∗ = |D | (cid:28) |D |,theRORishigh. Whenq∗ ≈ |D |,theRORislow. R X∗r FK R FK TheTRrulecannotdistinguishbetweenthesetwoscenarios. . . . . . . . . . . 67 5.5 Scatterplotsbasedonalltheresultsofthesimulationexperimentsreferredto byFigure5.3. (A)Increaseintesterrorcausedbyavoidingthejoin(denoted “∆Testerror”)againstROR.(B)∆Testerroragainsttupleratio. (C)RORagainst inversesquarerootoftupleratio. . . . . . . . . . . . . . . . . . . . . . . . . . . 67 5.6 End-to-endresultsonrealdata: Errorafterfeatureselection. . . . . . . . . . . 72 5.7 End-to-endresultsonrealdata: Runtimeoffeatureselection. . . . . . . . . . . 74 5.8 Robustness: Holdout test errors after Forward Selection (FS) and Backward Selection(BS).The“plan”chosenbyJoinOptishighlighted,e.g.,NoJoinson Walmart. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 5.9 Sensitivity: Wesetρ = 2.5andτ = 20. Anattributetableisdeemed“okayto avoid”iftheincreaseinerrorwaswithin0.001witheitherForwardSelection (FS)andBackwardSelection(BS). . . . . . . . . . . . . . . . . . . . . . . . . . 75 A.1 Implementation-based performance against each of (1) tuple ratio (nS), (2) n R featureratio(dR),and(3)numberofiterations(Iters)–separatedcolumn-wise d S – for (A) RMM, and (B) RLM – separated row-wise. SR is skipped since its runtimeisverysimilartoS.TheotherparametersarefixedasperTable3.3. . 97 A.2 Analyticalplotsofruntimeagainsteachof(1) nS,(2) dR,and(3)Iters,forboth n d R S the(A)RMM,and(B)RLMmemoryregions. Theotherparametersarefixed asperTable3.3. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 A.3 Analyticalplotsforthecasewhen|S| < |R|butn > n . Weplottheruntime S R against each of m, n , d , Iters, and n , while fixing the others. Wherever S R R theyarefixed,weset(m,n ,n ,d ,d ,Iters)=(24GB,1E8,1E7,6,100,20).. . 98 S R S R A.4 Analytical plots for the case when n (cid:54) n (mostly). We plot the runtime S R against each of m, n , d , Iters, and n , while fixing the others. Wherever S R R theyarefixed,weset(m,n ,n ,d ,d ,Iters)=(24GB,2E7,5E7,6,9,20). . . . 99 S R S R

