Table Of Content

LearningOverJoins by ArunKumar Adissertationsubmittedinpartialfulfillmentof therequirementsforthedegreeof DoctorofPhilosophy (ComputerSciences) atthe UNIVERSITYOFWISCONSIN–MADISON 2016 Dateoffinaloralexamination: 07/21/2016 ThedissertationisapprovedbythefollowingmembersoftheFinalOralCommittee: JeffreyNaughton,ProfessorEmeritus,ComputerSciences,UW-Madison;Google JigneshM.Patel,Professor,ComputerSciences,UW-Madison C.DavidPageJr.,Professor,BiostatisticsandMedicalInformatics,UW-Madison ChristopherRé,AssistantProfessor,ComputerScience,StanfordUniversity StephenJ.Wright,Professor,ComputerSciences,UW-Madison XiaojinZhu,AssociateProfessor,ComputerSciences,UW-Madison ©CopyrightbyArunKumar2016 AllRightsReserved i acknowledgments No amount of words can fully express my infinite gratitude to my co-advisors, Jeff Naughton and Jignesh Patel. This dissertation would not have been possible without theirunwaveringsupportformycrazyideastoexploretopicsthatwereoutsideoftheir coreinterests. Jeff’sincredibleresearchwisdomanduncannyabilitytounderstandhisstu- dents’situationshavehelpedmeinnumerabletimes,ashaveJignesh’sinfectiousgo-getter spiritandremarkablegraspongroundingresearchwithpracticalrelevance. Theseareall attributesthatIhopetoemulateinmycareer. Ithinkthemostrewardinggiftanadvisor cangivetheirstudentisthefreedomtopursuetheirinterestsandcollaborations,while stayingcloselyengagedbygivinghonestadviceandcriticalfeedback. Iamveryfortunate thatJeffandJigneshtrustedmeenoughtogivemethisgift. IamdeeplygratefultoChrisRéforgettingmestartedindatamanagementresearchas apartofhisresearchgroup. Hispatienceandconfidenceinmeearlyonwereinstrumental ingettingmetocontinueinresearch. Hisoutstandingabilitytoweaveeleganttheoretical insightswithsolidsystemsworkacrossmultipleareasisaskillthatIhopetoemulate. I thank Steve Wright and Jerry Zhu for their collaborations and for serving on my committee. Theirdeepinsightsaboutoptimizationandmachinelearning,theirpatience in explaining new concepts to me, and their honest feedback on my ideas were crucial forthisresearch. IthankDavidPageforservingonmycommitteeandforhisinsightful feedbackonmytalks. IamalsodeeplygratefultoDavidDeWittandtheMicrosoftJimGraySystemsLab forfundingmydissertationresearchandforgivingmeaccesstoMicrosoft’sresources withoutanystringsattached. IthankDavidandtheothermembersoftheLabfortheir continualfeedbackonmypapersandtalks,whichisanintegralpartoftheremarkable close-knitcommunityenvironmentoftheLab. IthankRobertMcCannfromMicrosoft forourperiodicinsightfuldiscussionsonresearchandpractice,forhisfeedbackonmy papers,andforhelpingtosetupaproductiveresearchcollaborationwithMicrosoft. Iwasfortunatetobeabletomentoragreatsetofstudentsaspartofmydissertation research: LingjiaoChen,ZhiweiFan,MonaJalal,FenganLi,BoqunYan,andFujieZhan. It isarewardingexperiencetoworkwithsuchbrightstudentsandwatchthemmatureas researchers. Thisisakeyreasonformetowanttocontinueinacademicresearch. Iam thankfultoJeffandJigneshforgivingmetheopportunitytoadvisethesestudents. I am thankful to all of my other co-authors and research collaborators throughout mygraduateschoollife. AnincompletelistincludesMikeCafarella,JoeHellerstein,Ben Recht,AaronFeng,PradapKonda,FengNiu,andCeZhang. Iamgratefultomyfriends, ii BruhathiSundarmurthy,WentaoWu,andtheotherstudentsoftheDatabaseGroup,for theircontinualfeedbackonmyideas,papersandtalks. Finally,Iamindebtedbeyondmeasuretomyfamily,especiallymyfather,Kumar,my brother,Balaji,andmysister-in-law,Indira,fortheirlove,support,andadvicethroughout my graduate school life, especially during the tough times. I am grateful to my other closefriends,whowerealwaystheretosupportandhelpme. Anincompletelistincludes Levent,Thanu,andVijay. Lastbutdefinitelynottheleast,Iamgratefultomywonderful fiancé,Wade,forhisloveandsupportduringthecrucialfinalstagesofmydissertation research and my job search. I look forward to an exciting journey with him as I move forwardtoacareerinacademia. Overall,Iamdeeplyfortunatetohavehad,andcontinuetohave,suchagreatsetof mentorsandsuchawonderfulsetoflovedones. Iwillneverforgetthesupportofallof thesepeoplewithmydissertationresearchandIknowIcancountontheminanyofmy futureendeavorstoo. MostofthisdissertationresearchwasfundedbyagrantfromtheMicrosoftJimGray SystemsLab. Allviewsexpressedinthisworkarethatoftheauthorsanddonotnecessarily reflectanyviewsofMicrosoft. iii abstract Advancedanalyticsusingmachinelearning(ML)isincreasinglycriticalforawidevariety ofdata-drivenapplicationsthatunderpinthemodernworld. Manyreal-worlddatasets have multiple tables with relationships, but most ML toolkits force data key-foreign key scientists to join them into a single table before using ML. This process of “learning joins” introduces redundancy in the data, which results in storage and runtime after inefficiencies,aswellasdatamaintenanceheadachesfordatascientists. Tomitigatethese issues,thisdissertationintroducestheparadigmof“learning joins,”whichincludes over twoorthogonaltechniques: and . Theformer avoidingjoinsphysically avoidingjoinslogically showshowtopushMLcomputationsthroughjoinstothebasetables,whichimproves runtimeperformancewithoutaffectingMLaccuracy. Thelattershowsthatinmanycases, itispossible, somewhatsurprisingly, toignoreentirebasetableswithoutaffectingML accuracysignificantly. Overall, ourtechniqueshelpimprovetheusabilityandruntime performanceofMLovermulti-tabledatasets,sometimesbyordersofmagnitude,without degradingaccuracysignificantly. Ourworkforcesonetorethinkaprevalentpracticein advancedanalyticsandopensupnewconnectionsbetweendatamanagementsystems, databasedependencytheory,andmachinelearning. iv contents Abstract iii Contents iv ListofFigures vi ListofTables x 1 Introduction 1 1 1.1 Example 3 1.2 TechnicalContributions 5 1.3 SummaryandImpact 2 Preliminaries 7 7 2.1 ProblemSetupandNotation Orion 8 2.2 Backgroundfor Hamlet 9 2.3 Backgroundfor 3 Orion: AvoidingJoinsPhysically 12 15 3.1 LearningOverJoins 19 3.2 FactorizedLearning 26 3.3 Experiments 35 3.4 Conclusion: AvoidingJoinsPhysically 4 ExtensionsandGeneralizationofOrion 37 Santoku 37 4.1 Extension: ProbabilisticClassifiersOverJoinsand 42 4.2 Extension: OtherOptimizationMethodsOverJoins 44 4.3 Extension: ClusteringAlgorithmsOverJoins 46 4.4 Generalization: LinearAlgebraOverJoins 5 Hamlet: AvoidingJoinsLogically 50 53 5.1 EffectsofKFKJoinsonML 60 5.2 PredictingaprioriifitisSafetoAvoidaKFKJoin 69 5.3 ExperimentsonRealData 79 5.4 Conclusion: AvoidingJoinsLogically 6 RelatedWork 81 v Orion 81 6.1 RelatedWorkfor Orion 82 6.2 RelatedWorkfor ExtensionsandGeneralization Hamlet 83 6.3 RelatedWorkfor 7 ConclusionandFutureWork 85 References 88 A Appendix: Orion 95 95 A.1 Proofs 97 A.2 AdditionalRuntimePlots 99 A.3 MoreCostModelsandApproaches 102 A.4 ComparingGradientMethods B Appendix: Hamlet 104 104 B.1 Proofs 106 B.2 MoreSimulationResults 111 B.3 OutputFeatureSets vi list of figures 1.1 ExamplescenarioforMLovermulti-tabledata. . . . . . . . . . . . . . . . . . . 2 3.1 Learningoverajoin: (A)Schemaandlogicalworkflow. FeaturevectorsfromS (e.g.,Customers)andR(e.g.,Employers)areconcatenatedandusedforBGD. The loss (F) and gradient (∇F) for BGD can be computed together during a passoverthedata. Approachescompared: Materialize(M),Stream(S),Stream- Reuse(SR),andFactorizedLearning(FL).High-levelqualitativecomparisonof storage-runtimetrade-offsandCPU-I/Ocosttrade-offsforruntimesofthefour approaches. SisassumedtobelargerthanR,andtheplotsarenottoscale. (B) WhenthehashtableonRdoesnotfitinbuffermemory,S,SR,andMrequire extrastoragespacefortemporarytablesorpartitions. But,SRcouldbefaster than FL due to lower I/O costs. (C) When the hash table on R fits in buffer memory,butSdoesnot,SRbecomessimilartoSandneitherneedextrastorage space,butbothcouldbeslowerthanFL.(D)Whenalldatafitcomfortablyin buffermemory,noneoftheapproachesneedextrastoragespace,andMcould befasterthanFL. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 3.2 Redundancyratioagainstthetwodimensionratios(ford = 20). (A)Fix dR S d S andvary nS. (B)Fix nS andvary dR. . . . . . . . . . . . . . . . . . . . . . . . . 18 n n d R R S 3.3 Logicalworkflowoffactorizedlearning,consistingofthreestepsasnumbered. HR and HS are logical intermediate relations. PartialIP refers to the partial innerproductsfromR.SumScaledIPreferstothegroupedsumsofthescalar outputofG()appliedtothefullinnerproductsontheconcatenatedfeature vectors. Here,γ denotesaSUMaggregationandγ (RID)denotesaSUM SUM SUM aggregationwithaGROUP BYonRID.. . . . . . . . . . . . . . . . . . . . . . . . 20 3.4 Analyticalcostmodel-basedplotsforvaryingthebuffermemory(m). (A)Total time. (B)I/Otime(with100MB/sI/Orate). (C)CPUtime(with2.5GHzclock). Thevaluesfixedaren =108 (inshort,1E8),n =1E7,d =40,d =60,and S R S R Iters=20. Notethatthexaxesareinlogscale. . . . . . . . . . . . . . . . . . . 27 3.5 Implementation-based performance against each of (1) tuple ratio (nS), (2) n R featureratio(dR),and(3)numberofiterations(Iters)–separatedcolumn-wise d S –forthe(A)RSM,(B)RMM,and(C)RLMmemoryregion–separatedrow-wise. SR is skipped for RMM and RLM since its runtime is very similar to S. The otherparametersarefixedasperTable3.3. . . . . . . . . . . . . . . . . . . . . 29 vii 3.6 Analyticalcostmodel-basedplotsofperformanceagainsteachof(A) nS,(B) n R dR, and (C) Iters for the RSM region. The other parameters are fixed as per d S Table3.3. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 3.7 AnalyticalplotsforwhenmisinsufficientforFL.Weassumem=4GB,and plottheruntimeagainsteachofn ,d ,Iters,andn ,whilefixingtheothers. S R R Wherevertheyarefixed,weset(n ,n ,d ,d ,Iters)=(1E9,2E8,2,6,20). . . 33 S R S R 3.8 Parallelism with Hive. (A) Speedup against cluster size (number of worker nodes) for (n ,n ,d ,d ,Iters) = (15E8,5E6,40,120,20). Each approach is S R S R comparedtoitself,e.g.,FLon24nodesis3.5xfasterthanFLon8nodes. The runtimeson24nodeswere7.4hforS,9.5hforFL,and23.5hforM.(B)Scaleup asboththeclusteranddatasetsizesarescaled. Theinputsarethesameasfor (A)for8nodes,whilen isscaled. Thus,thesizeof Tvariesfrom0.6TBto1.8TB. 34 S 4.1 IllustrationofFactorizedLearningforNaiveBayes. (A)ThebasetablesCustomers (the“entitytable”asdefinedinKumaretal.[2015c])andEmployers(an“at- tributetable”asdefinedinKumaretal.[2015c]). ThetargetfeatureisChurn in Customers. (B) The denormalized table Temp. Naive Bayes computations using Temp have redundancy, as shown here for the conditional probability calculationsforStateandSize. (C)FLavoidscomputationalredundancyby pre-countingreferences,whicharestoredinCustRefs,andbydecomposing (“factorizing”)thesumsusingStateRefsandSizeRefs. . . . . . . . . . . . . 37 4.2 Screenshotsof Santoku: (A)TheGUItoloadthedatasets,specifythedatabase dependencies,andtrainMLmodels. (B)Resultsoftrainingasinglemodel. (C) Resultsoffeatureexplorationcomparingmultiplefeaturevectors. (D)AnR scriptthatperformsthesetasksprogrammaticallyfromanRconsoleusingthe SantokuAPI. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 4.3 High-levelarchitecture. UsersinteractwithSantokueitherusingtheGUIor Rscripts. Santokuoptimizesthecomputationsusingfactorizedlearningand invokesanunderlyingRexecutionengine. . . . . . . . . . . . . . . . . . . . . 40 4.4 ResultsonrealdatasetsforK-Means. Theapproachescomparedare–M:Mate- rialize(usethedenormalizeddataset),F:Factorizedclustering,FRC:Factorized clusteringwithrecoding(improvesF),NC:NaiveLZWcompression,andOC: Optimizedcompression(improvesNC). . . . . . . . . . . . . . . . . . . . . . . 46 4.5 Performanceonrealdatasetsfor(A)LinearRegression,(B)LogisticRegression, (C)K-Means,and(D)GNMF.E,M,Y,W,L,B,andFcorrespondtotheExpedia, Movies,Yelp,Walmart,LastFM,Books,andFlightsdatasetrespectively. The numberofiterations/centroids/topicsis20/5/5. . . . . . . . . . . . . . . . . . 49 viii 5.1 Illustratingtherelationshipbetweenthedecisionrulestotellwhichjoinsare “safetoavoid.” . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 5.2 Relationshipbetweenhypothesisspaces. . . . . . . . . . . . . . . . . . . . . . . 58 5.3 Simulation results for the scenario in which only a single X ∈ X is part of r R thetruedistribution,whichhasP(Y = 0|X = 0) = P(Y = 1|X = 1) = p. For r r theseresults,wesetp = 0.1(varyingthisprobabilitydidnotchangetheoverall trends). (A) Vary n , while fixing (d ,d ,|D |) = (2,4,40). (B) Vary |D | S S R FK FK (= n ),whilefixing(n ,d ,d ) = (1000,4,4). . . . . . . . . . . . . . . . . . . 63 R S S R 5.4 Whenq∗ = |D | (cid:28) |D |,theRORishigh. Whenq∗ ≈ |D |,theRORislow. R X∗r FK R FK TheTRrulecannotdistinguishbetweenthesetwoscenarios. . . . . . . . . . . 67 5.5 Scatterplotsbasedonalltheresultsofthesimulationexperimentsreferredto byFigure5.3. (A)Increaseintesterrorcausedbyavoidingthejoin(denoted “∆Testerror”)againstROR.(B)∆Testerroragainsttupleratio. (C)RORagainst inversesquarerootoftupleratio. . . . . . . . . . . . . . . . . . . . . . . . . . . 67 5.6 End-to-endresultsonrealdata: Errorafterfeatureselection. . . . . . . . . . . 72 5.7 End-to-endresultsonrealdata: Runtimeoffeatureselection. . . . . . . . . . . 74 5.8 Robustness: Holdout test errors after Forward Selection (FS) and Backward Selection(BS).The“plan”chosenbyJoinOptishighlighted,e.g.,NoJoinson Walmart. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 5.9 Sensitivity: Wesetρ = 2.5andτ = 20. Anattributetableisdeemed“okayto avoid”iftheincreaseinerrorwaswithin0.001witheitherForwardSelection (FS)andBackwardSelection(BS). . . . . . . . . . . . . . . . . . . . . . . . . . 75 A.1 Implementation-based performance against each of (1) tuple ratio (nS), (2) n R featureratio(dR),and(3)numberofiterations(Iters)–separatedcolumn-wise d S – for (A) RMM, and (B) RLM – separated row-wise. SR is skipped since its runtimeisverysimilartoS.TheotherparametersarefixedasperTable3.3. . 97 A.2 Analyticalplotsofruntimeagainsteachof(1) nS,(2) dR,and(3)Iters,forboth n d R S the(A)RMM,and(B)RLMmemoryregions. Theotherparametersarefixed asperTable3.3. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 A.3 Analyticalplotsforthecasewhen|S| < |R|butn > n . Weplottheruntime S R against each of m, n , d , Iters, and n , while fixing the others. Wherever S R R theyarefixed,weset(m,n ,n ,d ,d ,Iters)=(24GB,1E8,1E7,6,100,20).. . 98 S R S R A.4 Analytical plots for the case when n (cid:54) n (mostly). We plot the runtime S R against each of m, n , d , Iters, and n , while fixing the others. Wherever S R R theyarefixed,weset(m,n ,n ,d ,d ,Iters)=(24GB,2E7,5E7,6,9,20). . . . 99 S R S R

Description:

Xiaojin Zhu, Associate Professor, Computer Sciences, UW-Madison issues, this dissertation introduces the paradigm of “learning over joins,” which

Learning Over Joins by Arun Kumar A dissertation submitted in partial fulfillment of the ... PDF

125 Pages·2016·8.26 MB·English

Checking for file health...

Download

Upgrade Premium

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Download Learning Over Joins by Arun Kumar A dissertation submitted in partial fulfillment of the ... PDF Free - Full Version

by Unknow| 2016| 125 pages| 8.26| English

Download Learning Over Joins by Arun Kumar A dissertation submitted in partial fulfillment of the ... by in PDF format completely FREE. No registration required, no payment needed. Get instant access to this valuable resource on PDFdrive.to!

Free Download PDF

About Learning Over Joins by Arun Kumar A dissertation submitted in partial fulfillment of the ...

Xiaojin Zhu, Associate Professor, Computer Sciences, UW-Madison issues, this dissertation introduces the paradigm of “learning over joins,” which

Detailed Information

Author:	Unknown
Publication Year:	2016
Pages:	125
Language:	English
File Size:	8.26
Format:	PDF
Price:	FREE

Download Free PDF

Safe & Secure Download - No registration required

Why Choose PDFdrive for Your Free Learning Over Joins by Arun Kumar A dissertation submitted in partial fulfillment of the ... Download?

100% Free: No hidden fees or subscriptions required for one book every day.
No Registration: Immediate access is available without creating accounts for one book every day.
Safe and Secure: Clean downloads without malware or viruses
Multiple Formats: PDF, MOBI, Mpub,... optimized for all devices
Educational Resource: Supporting knowledge sharing and learning

Frequently Asked Questions

Is it really free to download Learning Over Joins by Arun Kumar A dissertation submitted in partial fulfillment of the ... PDF?

Yes, on https://PDFdrive.to you can download Learning Over Joins by Arun Kumar A dissertation submitted in partial fulfillment of the ... by completely free. We don't require any payment, subscription, or registration to access this PDF file. For 3 books every day.

How can I read Learning Over Joins by Arun Kumar A dissertation submitted in partial fulfillment of the ... on my mobile device?

After downloading Learning Over Joins by Arun Kumar A dissertation submitted in partial fulfillment of the ... PDF, you can open it with any PDF reader app on your phone or tablet. We recommend using Adobe Acrobat Reader, Apple Books, or Google Play Books for the best reading experience.

Is this the full version of Learning Over Joins by Arun Kumar A dissertation submitted in partial fulfillment of the ...?

Yes, this is the complete PDF version of Learning Over Joins by Arun Kumar A dissertation submitted in partial fulfillment of the ... by Unknow. You will be able to read the entire content as in the printed version without missing any pages.

Is it legal to download Learning Over Joins by Arun Kumar A dissertation submitted in partial fulfillment of the ... PDF for free?

https://PDFdrive.to provides links to free educational resources available online. We do not store any files on our servers. Please be aware of copyright laws in your country before downloading.

The materials shared are intended for research, educational, and personal use in accordance with fair use principles.