ebook img

Learning Over Joins by Arun Kumar A dissertation submitted in partial fulfillment of the ... PDF

125 Pages·2016·8.26 MB·English
by  
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Learning Over Joins by Arun Kumar A dissertation submitted in partial fulfillment of the ...

LearningOverJoins by ArunKumar Adissertationsubmittedinpartialfulfillmentof therequirementsforthedegreeof DoctorofPhilosophy (ComputerSciences) atthe UNIVERSITYOFWISCONSIN–MADISON 2016 Dateoffinaloralexamination: 07/21/2016 ThedissertationisapprovedbythefollowingmembersoftheFinalOralCommittee: JeffreyNaughton,ProfessorEmeritus,ComputerSciences,UW-Madison;Google JigneshM.Patel,Professor,ComputerSciences,UW-Madison C.DavidPageJr.,Professor,BiostatisticsandMedicalInformatics,UW-Madison ChristopherRé,AssistantProfessor,ComputerScience,StanfordUniversity StephenJ.Wright,Professor,ComputerSciences,UW-Madison XiaojinZhu,AssociateProfessor,ComputerSciences,UW-Madison ©CopyrightbyArunKumar2016 AllRightsReserved i acknowledgments No amount of words can fully express my infinite gratitude to my co-advisors, Jeff Naughton and Jignesh Patel. This dissertation would not have been possible without theirunwaveringsupportformycrazyideastoexploretopicsthatwereoutsideoftheir coreinterests. Jeff’sincredibleresearchwisdomanduncannyabilitytounderstandhisstu- dents’situationshavehelpedmeinnumerabletimes,ashaveJignesh’sinfectiousgo-getter spiritandremarkablegraspongroundingresearchwithpracticalrelevance. Theseareall attributesthatIhopetoemulateinmycareer. Ithinkthemostrewardinggiftanadvisor cangivetheirstudentisthefreedomtopursuetheirinterestsandcollaborations,while stayingcloselyengagedbygivinghonestadviceandcriticalfeedback. Iamveryfortunate thatJeffandJigneshtrustedmeenoughtogivemethisgift. IamdeeplygratefultoChrisRéforgettingmestartedindatamanagementresearchas apartofhisresearchgroup. Hispatienceandconfidenceinmeearlyonwereinstrumental ingettingmetocontinueinresearch. Hisoutstandingabilitytoweaveeleganttheoretical insightswithsolidsystemsworkacrossmultipleareasisaskillthatIhopetoemulate. I thank Steve Wright and Jerry Zhu for their collaborations and for serving on my committee. Theirdeepinsightsaboutoptimizationandmachinelearning,theirpatience in explaining new concepts to me, and their honest feedback on my ideas were crucial forthisresearch. IthankDavidPageforservingonmycommitteeandforhisinsightful feedbackonmytalks. IamalsodeeplygratefultoDavidDeWittandtheMicrosoftJimGraySystemsLab forfundingmydissertationresearchandforgivingmeaccesstoMicrosoft’sresources withoutanystringsattached. IthankDavidandtheothermembersoftheLabfortheir continualfeedbackonmypapersandtalks,whichisanintegralpartoftheremarkable close-knitcommunityenvironmentoftheLab. IthankRobertMcCannfromMicrosoft forourperiodicinsightfuldiscussionsonresearchandpractice,forhisfeedbackonmy papers,andforhelpingtosetupaproductiveresearchcollaborationwithMicrosoft. Iwasfortunatetobeabletomentoragreatsetofstudentsaspartofmydissertation research: LingjiaoChen,ZhiweiFan,MonaJalal,FenganLi,BoqunYan,andFujieZhan. It isarewardingexperiencetoworkwithsuchbrightstudentsandwatchthemmatureas researchers. Thisisakeyreasonformetowanttocontinueinacademicresearch. Iam thankfultoJeffandJigneshforgivingmetheopportunitytoadvisethesestudents. I am thankful to all of my other co-authors and research collaborators throughout mygraduateschoollife. AnincompletelistincludesMikeCafarella,JoeHellerstein,Ben Recht,AaronFeng,PradapKonda,FengNiu,andCeZhang. Iamgratefultomyfriends, ii BruhathiSundarmurthy,WentaoWu,andtheotherstudentsoftheDatabaseGroup,for theircontinualfeedbackonmyideas,papersandtalks. Finally,Iamindebtedbeyondmeasuretomyfamily,especiallymyfather,Kumar,my brother,Balaji,andmysister-in-law,Indira,fortheirlove,support,andadvicethroughout my graduate school life, especially during the tough times. I am grateful to my other closefriends,whowerealwaystheretosupportandhelpme. Anincompletelistincludes Levent,Thanu,andVijay. Lastbutdefinitelynottheleast,Iamgratefultomywonderful fiancé,Wade,forhisloveandsupportduringthecrucialfinalstagesofmydissertation research and my job search. I look forward to an exciting journey with him as I move forwardtoacareerinacademia. Overall,Iamdeeplyfortunatetohavehad,andcontinuetohave,suchagreatsetof mentorsandsuchawonderfulsetoflovedones. Iwillneverforgetthesupportofallof thesepeoplewithmydissertationresearchandIknowIcancountontheminanyofmy futureendeavorstoo. MostofthisdissertationresearchwasfundedbyagrantfromtheMicrosoftJimGray SystemsLab. Allviewsexpressedinthisworkarethatoftheauthorsanddonotnecessarily reflectanyviewsofMicrosoft. iii abstract Advancedanalyticsusingmachinelearning(ML)isincreasinglycriticalforawidevariety ofdata-drivenapplicationsthatunderpinthemodernworld. Manyreal-worlddatasets have multiple tables with relationships, but most ML toolkits force data key-foreign key scientists to join them into a single table before using ML. This process of “learning joins” introduces redundancy in the data, which results in storage and runtime after inefficiencies,aswellasdatamaintenanceheadachesfordatascientists. Tomitigatethese issues,thisdissertationintroducestheparadigmof“learning joins,”whichincludes over twoorthogonaltechniques: and . Theformer avoidingjoinsphysically avoidingjoinslogically showshowtopushMLcomputationsthroughjoinstothebasetables,whichimproves runtimeperformancewithoutaffectingMLaccuracy. Thelattershowsthatinmanycases, itispossible, somewhatsurprisingly, toignoreentirebasetableswithoutaffectingML accuracysignificantly. Overall, ourtechniqueshelpimprovetheusabilityandruntime performanceofMLovermulti-tabledatasets,sometimesbyordersofmagnitude,without degradingaccuracysignificantly. Ourworkforcesonetorethinkaprevalentpracticein advancedanalyticsandopensupnewconnectionsbetweendatamanagementsystems, databasedependencytheory,andmachinelearning. iv contents Abstract iii Contents iv ListofFigures vi ListofTables x 1 Introduction 1 1 1.1 Example 3 1.2 TechnicalContributions 5 1.3 SummaryandImpact 2 Preliminaries 7 7 2.1 ProblemSetupandNotation Orion 8 2.2 Backgroundfor Hamlet 9 2.3 Backgroundfor 3 Orion: AvoidingJoinsPhysically 12 15 3.1 LearningOverJoins 19 3.2 FactorizedLearning 26 3.3 Experiments 35 3.4 Conclusion: AvoidingJoinsPhysically 4 ExtensionsandGeneralizationofOrion 37 Santoku 37 4.1 Extension: ProbabilisticClassifiersOverJoinsand 42 4.2 Extension: OtherOptimizationMethodsOverJoins 44 4.3 Extension: ClusteringAlgorithmsOverJoins 46 4.4 Generalization: LinearAlgebraOverJoins 5 Hamlet: AvoidingJoinsLogically 50 53 5.1 EffectsofKFKJoinsonML 60 5.2 PredictingaprioriifitisSafetoAvoidaKFKJoin 69 5.3 ExperimentsonRealData 79 5.4 Conclusion: AvoidingJoinsLogically 6 RelatedWork 81 v Orion 81 6.1 RelatedWorkfor Orion 82 6.2 RelatedWorkfor ExtensionsandGeneralization Hamlet 83 6.3 RelatedWorkfor 7 ConclusionandFutureWork 85 References 88 A Appendix: Orion 95 95 A.1 Proofs 97 A.2 AdditionalRuntimePlots 99 A.3 MoreCostModelsandApproaches 102 A.4 ComparingGradientMethods B Appendix: Hamlet 104 104 B.1 Proofs 106 B.2 MoreSimulationResults 111 B.3 OutputFeatureSets vi list of figures 1.1 ExamplescenarioforMLovermulti-tabledata. . . . . . . . . . . . . . . . . . . 2 3.1 Learningoverajoin: (A)Schemaandlogicalworkflow. FeaturevectorsfromS (e.g.,Customers)andR(e.g.,Employers)areconcatenatedandusedforBGD. The loss (F) and gradient (∇F) for BGD can be computed together during a passoverthedata. Approachescompared: Materialize(M),Stream(S),Stream- Reuse(SR),andFactorizedLearning(FL).High-levelqualitativecomparisonof storage-runtimetrade-offsandCPU-I/Ocosttrade-offsforruntimesofthefour approaches. SisassumedtobelargerthanR,andtheplotsarenottoscale. (B) WhenthehashtableonRdoesnotfitinbuffermemory,S,SR,andMrequire extrastoragespacefortemporarytablesorpartitions. But,SRcouldbefaster than FL due to lower I/O costs. (C) When the hash table on R fits in buffer memory,butSdoesnot,SRbecomessimilartoSandneitherneedextrastorage space,butbothcouldbeslowerthanFL.(D)Whenalldatafitcomfortablyin buffermemory,noneoftheapproachesneedextrastoragespace,andMcould befasterthanFL. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 3.2 Redundancyratioagainstthetwodimensionratios(ford = 20). (A)Fix dR S d S andvary nS. (B)Fix nS andvary dR. . . . . . . . . . . . . . . . . . . . . . . . . 18 n n d R R S 3.3 Logicalworkflowoffactorizedlearning,consistingofthreestepsasnumbered. HR and HS are logical intermediate relations. PartialIP refers to the partial innerproductsfromR.SumScaledIPreferstothegroupedsumsofthescalar outputofG()appliedtothefullinnerproductsontheconcatenatedfeature vectors. Here,γ denotesaSUMaggregationandγ (RID)denotesaSUM SUM SUM aggregationwithaGROUP BYonRID.. . . . . . . . . . . . . . . . . . . . . . . . 20 3.4 Analyticalcostmodel-basedplotsforvaryingthebuffermemory(m). (A)Total time. (B)I/Otime(with100MB/sI/Orate). (C)CPUtime(with2.5GHzclock). Thevaluesfixedaren =108 (inshort,1E8),n =1E7,d =40,d =60,and S R S R Iters=20. Notethatthexaxesareinlogscale. . . . . . . . . . . . . . . . . . . 27 3.5 Implementation-based performance against each of (1) tuple ratio (nS), (2) n R featureratio(dR),and(3)numberofiterations(Iters)–separatedcolumn-wise d S –forthe(A)RSM,(B)RMM,and(C)RLMmemoryregion–separatedrow-wise. SR is skipped for RMM and RLM since its runtime is very similar to S. The otherparametersarefixedasperTable3.3. . . . . . . . . . . . . . . . . . . . . 29 vii 3.6 Analyticalcostmodel-basedplotsofperformanceagainsteachof(A) nS,(B) n R dR, and (C) Iters for the RSM region. The other parameters are fixed as per d S Table3.3. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 3.7 AnalyticalplotsforwhenmisinsufficientforFL.Weassumem=4GB,and plottheruntimeagainsteachofn ,d ,Iters,andn ,whilefixingtheothers. S R R Wherevertheyarefixed,weset(n ,n ,d ,d ,Iters)=(1E9,2E8,2,6,20). . . 33 S R S R 3.8 Parallelism with Hive. (A) Speedup against cluster size (number of worker nodes) for (n ,n ,d ,d ,Iters) = (15E8,5E6,40,120,20). Each approach is S R S R comparedtoitself,e.g.,FLon24nodesis3.5xfasterthanFLon8nodes. The runtimeson24nodeswere7.4hforS,9.5hforFL,and23.5hforM.(B)Scaleup asboththeclusteranddatasetsizesarescaled. Theinputsarethesameasfor (A)for8nodes,whilen isscaled. Thus,thesizeof Tvariesfrom0.6TBto1.8TB. 34 S 4.1 IllustrationofFactorizedLearningforNaiveBayes. (A)ThebasetablesCustomers (the“entitytable”asdefinedinKumaretal.[2015c])andEmployers(an“at- tributetable”asdefinedinKumaretal.[2015c]). ThetargetfeatureisChurn in Customers. (B) The denormalized table Temp. Naive Bayes computations using Temp have redundancy, as shown here for the conditional probability calculationsforStateandSize. (C)FLavoidscomputationalredundancyby pre-countingreferences,whicharestoredinCustRefs,andbydecomposing (“factorizing”)thesumsusingStateRefsandSizeRefs. . . . . . . . . . . . . 37 4.2 Screenshotsof Santoku: (A)TheGUItoloadthedatasets,specifythedatabase dependencies,andtrainMLmodels. (B)Resultsoftrainingasinglemodel. (C) Resultsoffeatureexplorationcomparingmultiplefeaturevectors. (D)AnR scriptthatperformsthesetasksprogrammaticallyfromanRconsoleusingthe SantokuAPI. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 4.3 High-levelarchitecture. UsersinteractwithSantokueitherusingtheGUIor Rscripts. Santokuoptimizesthecomputationsusingfactorizedlearningand invokesanunderlyingRexecutionengine. . . . . . . . . . . . . . . . . . . . . 40 4.4 ResultsonrealdatasetsforK-Means. Theapproachescomparedare–M:Mate- rialize(usethedenormalizeddataset),F:Factorizedclustering,FRC:Factorized clusteringwithrecoding(improvesF),NC:NaiveLZWcompression,andOC: Optimizedcompression(improvesNC). . . . . . . . . . . . . . . . . . . . . . . 46 4.5 Performanceonrealdatasetsfor(A)LinearRegression,(B)LogisticRegression, (C)K-Means,and(D)GNMF.E,M,Y,W,L,B,andFcorrespondtotheExpedia, Movies,Yelp,Walmart,LastFM,Books,andFlightsdatasetrespectively. The numberofiterations/centroids/topicsis20/5/5. . . . . . . . . . . . . . . . . . 49 viii 5.1 Illustratingtherelationshipbetweenthedecisionrulestotellwhichjoinsare “safetoavoid.” . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 5.2 Relationshipbetweenhypothesisspaces. . . . . . . . . . . . . . . . . . . . . . . 58 5.3 Simulation results for the scenario in which only a single X ∈ X is part of r R thetruedistribution,whichhasP(Y = 0|X = 0) = P(Y = 1|X = 1) = p. For r r theseresults,wesetp = 0.1(varyingthisprobabilitydidnotchangetheoverall trends). (A) Vary n , while fixing (d ,d ,|D |) = (2,4,40). (B) Vary |D | S S R FK FK (= n ),whilefixing(n ,d ,d ) = (1000,4,4). . . . . . . . . . . . . . . . . . . 63 R S S R 5.4 Whenq∗ = |D | (cid:28) |D |,theRORishigh. Whenq∗ ≈ |D |,theRORislow. R X∗r FK R FK TheTRrulecannotdistinguishbetweenthesetwoscenarios. . . . . . . . . . . 67 5.5 Scatterplotsbasedonalltheresultsofthesimulationexperimentsreferredto byFigure5.3. (A)Increaseintesterrorcausedbyavoidingthejoin(denoted “∆Testerror”)againstROR.(B)∆Testerroragainsttupleratio. (C)RORagainst inversesquarerootoftupleratio. . . . . . . . . . . . . . . . . . . . . . . . . . . 67 5.6 End-to-endresultsonrealdata: Errorafterfeatureselection. . . . . . . . . . . 72 5.7 End-to-endresultsonrealdata: Runtimeoffeatureselection. . . . . . . . . . . 74 5.8 Robustness: Holdout test errors after Forward Selection (FS) and Backward Selection(BS).The“plan”chosenbyJoinOptishighlighted,e.g.,NoJoinson Walmart. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 5.9 Sensitivity: Wesetρ = 2.5andτ = 20. Anattributetableisdeemed“okayto avoid”iftheincreaseinerrorwaswithin0.001witheitherForwardSelection (FS)andBackwardSelection(BS). . . . . . . . . . . . . . . . . . . . . . . . . . 75 A.1 Implementation-based performance against each of (1) tuple ratio (nS), (2) n R featureratio(dR),and(3)numberofiterations(Iters)–separatedcolumn-wise d S – for (A) RMM, and (B) RLM – separated row-wise. SR is skipped since its runtimeisverysimilartoS.TheotherparametersarefixedasperTable3.3. . 97 A.2 Analyticalplotsofruntimeagainsteachof(1) nS,(2) dR,and(3)Iters,forboth n d R S the(A)RMM,and(B)RLMmemoryregions. Theotherparametersarefixed asperTable3.3. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 A.3 Analyticalplotsforthecasewhen|S| < |R|butn > n . Weplottheruntime S R against each of m, n , d , Iters, and n , while fixing the others. Wherever S R R theyarefixed,weset(m,n ,n ,d ,d ,Iters)=(24GB,1E8,1E7,6,100,20).. . 98 S R S R A.4 Analytical plots for the case when n (cid:54) n (mostly). We plot the runtime S R against each of m, n , d , Iters, and n , while fixing the others. Wherever S R R theyarefixed,weset(m,n ,n ,d ,d ,Iters)=(24GB,2E7,5E7,6,9,20). . . . 99 S R S R

Description:
Xiaojin Zhu, Associate Professor, Computer Sciences, UW-Madison issues, this dissertation introduces the paradigm of “learning over joins,” which
See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.