Table Of ContentLearningOverJoins
by
ArunKumar
Adissertationsubmittedinpartialfulfillmentof
therequirementsforthedegreeof
DoctorofPhilosophy
(ComputerSciences)
atthe
UNIVERSITYOFWISCONSIN–MADISON
2016
Dateoffinaloralexamination: 07/21/2016
ThedissertationisapprovedbythefollowingmembersoftheFinalOralCommittee:
JeffreyNaughton,ProfessorEmeritus,ComputerSciences,UW-Madison;Google
JigneshM.Patel,Professor,ComputerSciences,UW-Madison
C.DavidPageJr.,Professor,BiostatisticsandMedicalInformatics,UW-Madison
ChristopherRé,AssistantProfessor,ComputerScience,StanfordUniversity
StephenJ.Wright,Professor,ComputerSciences,UW-Madison
XiaojinZhu,AssociateProfessor,ComputerSciences,UW-Madison
©CopyrightbyArunKumar2016
AllRightsReserved
i
acknowledgments
No amount of words can fully express my infinite gratitude to my co-advisors, Jeff
Naughton and Jignesh Patel. This dissertation would not have been possible without
theirunwaveringsupportformycrazyideastoexploretopicsthatwereoutsideoftheir
coreinterests. Jeff’sincredibleresearchwisdomanduncannyabilitytounderstandhisstu-
dents’situationshavehelpedmeinnumerabletimes,ashaveJignesh’sinfectiousgo-getter
spiritandremarkablegraspongroundingresearchwithpracticalrelevance. Theseareall
attributesthatIhopetoemulateinmycareer. Ithinkthemostrewardinggiftanadvisor
cangivetheirstudentisthefreedomtopursuetheirinterestsandcollaborations,while
stayingcloselyengagedbygivinghonestadviceandcriticalfeedback. Iamveryfortunate
thatJeffandJigneshtrustedmeenoughtogivemethisgift.
IamdeeplygratefultoChrisRéforgettingmestartedindatamanagementresearchas
apartofhisresearchgroup. Hispatienceandconfidenceinmeearlyonwereinstrumental
ingettingmetocontinueinresearch. Hisoutstandingabilitytoweaveeleganttheoretical
insightswithsolidsystemsworkacrossmultipleareasisaskillthatIhopetoemulate.
I thank Steve Wright and Jerry Zhu for their collaborations and for serving on my
committee. Theirdeepinsightsaboutoptimizationandmachinelearning,theirpatience
in explaining new concepts to me, and their honest feedback on my ideas were crucial
forthisresearch. IthankDavidPageforservingonmycommitteeandforhisinsightful
feedbackonmytalks.
IamalsodeeplygratefultoDavidDeWittandtheMicrosoftJimGraySystemsLab
forfundingmydissertationresearchandforgivingmeaccesstoMicrosoft’sresources
withoutanystringsattached. IthankDavidandtheothermembersoftheLabfortheir
continualfeedbackonmypapersandtalks,whichisanintegralpartoftheremarkable
close-knitcommunityenvironmentoftheLab. IthankRobertMcCannfromMicrosoft
forourperiodicinsightfuldiscussionsonresearchandpractice,forhisfeedbackonmy
papers,andforhelpingtosetupaproductiveresearchcollaborationwithMicrosoft.
Iwasfortunatetobeabletomentoragreatsetofstudentsaspartofmydissertation
research: LingjiaoChen,ZhiweiFan,MonaJalal,FenganLi,BoqunYan,andFujieZhan. It
isarewardingexperiencetoworkwithsuchbrightstudentsandwatchthemmatureas
researchers. Thisisakeyreasonformetowanttocontinueinacademicresearch. Iam
thankfultoJeffandJigneshforgivingmetheopportunitytoadvisethesestudents.
I am thankful to all of my other co-authors and research collaborators throughout
mygraduateschoollife. AnincompletelistincludesMikeCafarella,JoeHellerstein,Ben
Recht,AaronFeng,PradapKonda,FengNiu,andCeZhang. Iamgratefultomyfriends,
ii
BruhathiSundarmurthy,WentaoWu,andtheotherstudentsoftheDatabaseGroup,for
theircontinualfeedbackonmyideas,papersandtalks.
Finally,Iamindebtedbeyondmeasuretomyfamily,especiallymyfather,Kumar,my
brother,Balaji,andmysister-in-law,Indira,fortheirlove,support,andadvicethroughout
my graduate school life, especially during the tough times. I am grateful to my other
closefriends,whowerealwaystheretosupportandhelpme. Anincompletelistincludes
Levent,Thanu,andVijay. Lastbutdefinitelynottheleast,Iamgratefultomywonderful
fiancé,Wade,forhisloveandsupportduringthecrucialfinalstagesofmydissertation
research and my job search. I look forward to an exciting journey with him as I move
forwardtoacareerinacademia.
Overall,Iamdeeplyfortunatetohavehad,andcontinuetohave,suchagreatsetof
mentorsandsuchawonderfulsetoflovedones. Iwillneverforgetthesupportofallof
thesepeoplewithmydissertationresearchandIknowIcancountontheminanyofmy
futureendeavorstoo.
MostofthisdissertationresearchwasfundedbyagrantfromtheMicrosoftJimGray
SystemsLab. Allviewsexpressedinthisworkarethatoftheauthorsanddonotnecessarily
reflectanyviewsofMicrosoft.
iii
abstract
Advancedanalyticsusingmachinelearning(ML)isincreasinglycriticalforawidevariety
ofdata-drivenapplicationsthatunderpinthemodernworld. Manyreal-worlddatasets
have multiple tables with relationships, but most ML toolkits force data
key-foreign key
scientists to join them into a single table before using ML. This process of “learning
joins” introduces redundancy in the data, which results in storage and runtime
after
inefficiencies,aswellasdatamaintenanceheadachesfordatascientists. Tomitigatethese
issues,thisdissertationintroducestheparadigmof“learning joins,”whichincludes
over
twoorthogonaltechniques: and . Theformer
avoidingjoinsphysically avoidingjoinslogically
showshowtopushMLcomputationsthroughjoinstothebasetables,whichimproves
runtimeperformancewithoutaffectingMLaccuracy. Thelattershowsthatinmanycases,
itispossible, somewhatsurprisingly, toignoreentirebasetableswithoutaffectingML
accuracysignificantly. Overall, ourtechniqueshelpimprovetheusabilityandruntime
performanceofMLovermulti-tabledatasets,sometimesbyordersofmagnitude,without
degradingaccuracysignificantly. Ourworkforcesonetorethinkaprevalentpracticein
advancedanalyticsandopensupnewconnectionsbetweendatamanagementsystems,
databasedependencytheory,andmachinelearning.
iv
contents
Abstract iii
Contents iv
ListofFigures vi
ListofTables x
1 Introduction 1
1
1.1 Example
3
1.2 TechnicalContributions
5
1.3 SummaryandImpact
2 Preliminaries 7
7
2.1 ProblemSetupandNotation
Orion 8
2.2 Backgroundfor
Hamlet 9
2.3 Backgroundfor
3 Orion: AvoidingJoinsPhysically 12
15
3.1 LearningOverJoins
19
3.2 FactorizedLearning
26
3.3 Experiments
35
3.4 Conclusion: AvoidingJoinsPhysically
4 ExtensionsandGeneralizationofOrion 37
Santoku 37
4.1 Extension: ProbabilisticClassifiersOverJoinsand
42
4.2 Extension: OtherOptimizationMethodsOverJoins
44
4.3 Extension: ClusteringAlgorithmsOverJoins
46
4.4 Generalization: LinearAlgebraOverJoins
5 Hamlet: AvoidingJoinsLogically 50
53
5.1 EffectsofKFKJoinsonML
60
5.2 PredictingaprioriifitisSafetoAvoidaKFKJoin
69
5.3 ExperimentsonRealData
79
5.4 Conclusion: AvoidingJoinsLogically
6 RelatedWork 81
v
Orion 81
6.1 RelatedWorkfor
Orion 82
6.2 RelatedWorkfor ExtensionsandGeneralization
Hamlet 83
6.3 RelatedWorkfor
7 ConclusionandFutureWork 85
References 88
A Appendix: Orion 95
95
A.1 Proofs
97
A.2 AdditionalRuntimePlots
99
A.3 MoreCostModelsandApproaches
102
A.4 ComparingGradientMethods
B Appendix: Hamlet 104
104
B.1 Proofs
106
B.2 MoreSimulationResults
111
B.3 OutputFeatureSets
vi
list of figures
1.1 ExamplescenarioforMLovermulti-tabledata. . . . . . . . . . . . . . . . . . . 2
3.1 Learningoverajoin: (A)Schemaandlogicalworkflow. FeaturevectorsfromS
(e.g.,Customers)andR(e.g.,Employers)areconcatenatedandusedforBGD.
The loss (F) and gradient (∇F) for BGD can be computed together during a
passoverthedata. Approachescompared: Materialize(M),Stream(S),Stream-
Reuse(SR),andFactorizedLearning(FL).High-levelqualitativecomparisonof
storage-runtimetrade-offsandCPU-I/Ocosttrade-offsforruntimesofthefour
approaches. SisassumedtobelargerthanR,andtheplotsarenottoscale. (B)
WhenthehashtableonRdoesnotfitinbuffermemory,S,SR,andMrequire
extrastoragespacefortemporarytablesorpartitions. But,SRcouldbefaster
than FL due to lower I/O costs. (C) When the hash table on R fits in buffer
memory,butSdoesnot,SRbecomessimilartoSandneitherneedextrastorage
space,butbothcouldbeslowerthanFL.(D)Whenalldatafitcomfortablyin
buffermemory,noneoftheapproachesneedextrastoragespace,andMcould
befasterthanFL. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.2 Redundancyratioagainstthetwodimensionratios(ford = 20). (A)Fix dR
S d
S
andvary nS. (B)Fix nS andvary dR. . . . . . . . . . . . . . . . . . . . . . . . . 18
n n d
R R S
3.3 Logicalworkflowoffactorizedlearning,consistingofthreestepsasnumbered.
HR and HS are logical intermediate relations. PartialIP refers to the partial
innerproductsfromR.SumScaledIPreferstothegroupedsumsofthescalar
outputofG()appliedtothefullinnerproductsontheconcatenatedfeature
vectors. Here,γ denotesaSUMaggregationandγ (RID)denotesaSUM
SUM SUM
aggregationwithaGROUP BYonRID.. . . . . . . . . . . . . . . . . . . . . . . . 20
3.4 Analyticalcostmodel-basedplotsforvaryingthebuffermemory(m). (A)Total
time. (B)I/Otime(with100MB/sI/Orate). (C)CPUtime(with2.5GHzclock).
Thevaluesfixedaren =108 (inshort,1E8),n =1E7,d =40,d =60,and
S R S R
Iters=20. Notethatthexaxesareinlogscale. . . . . . . . . . . . . . . . . . . 27
3.5 Implementation-based performance against each of (1) tuple ratio (nS), (2)
n
R
featureratio(dR),and(3)numberofiterations(Iters)–separatedcolumn-wise
d
S
–forthe(A)RSM,(B)RMM,and(C)RLMmemoryregion–separatedrow-wise.
SR is skipped for RMM and RLM since its runtime is very similar to S. The
otherparametersarefixedasperTable3.3. . . . . . . . . . . . . . . . . . . . . 29
vii
3.6 Analyticalcostmodel-basedplotsofperformanceagainsteachof(A) nS,(B)
n
R
dR, and (C) Iters for the RSM region. The other parameters are fixed as per
d
S
Table3.3. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.7 AnalyticalplotsforwhenmisinsufficientforFL.Weassumem=4GB,and
plottheruntimeagainsteachofn ,d ,Iters,andn ,whilefixingtheothers.
S R R
Wherevertheyarefixed,weset(n ,n ,d ,d ,Iters)=(1E9,2E8,2,6,20). . . 33
S R S R
3.8 Parallelism with Hive. (A) Speedup against cluster size (number of worker
nodes) for (n ,n ,d ,d ,Iters) = (15E8,5E6,40,120,20). Each approach is
S R S R
comparedtoitself,e.g.,FLon24nodesis3.5xfasterthanFLon8nodes. The
runtimeson24nodeswere7.4hforS,9.5hforFL,and23.5hforM.(B)Scaleup
asboththeclusteranddatasetsizesarescaled. Theinputsarethesameasfor
(A)for8nodes,whilen isscaled. Thus,thesizeof Tvariesfrom0.6TBto1.8TB. 34
S
4.1 IllustrationofFactorizedLearningforNaiveBayes. (A)ThebasetablesCustomers
(the“entitytable”asdefinedinKumaretal.[2015c])andEmployers(an“at-
tributetable”asdefinedinKumaretal.[2015c]). ThetargetfeatureisChurn
in Customers. (B) The denormalized table Temp. Naive Bayes computations
using Temp have redundancy, as shown here for the conditional probability
calculationsforStateandSize. (C)FLavoidscomputationalredundancyby
pre-countingreferences,whicharestoredinCustRefs,andbydecomposing
(“factorizing”)thesumsusingStateRefsandSizeRefs. . . . . . . . . . . . . 37
4.2 Screenshotsof Santoku: (A)TheGUItoloadthedatasets,specifythedatabase
dependencies,andtrainMLmodels. (B)Resultsoftrainingasinglemodel. (C)
Resultsoffeatureexplorationcomparingmultiplefeaturevectors. (D)AnR
scriptthatperformsthesetasksprogrammaticallyfromanRconsoleusingthe
SantokuAPI. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.3 High-levelarchitecture. UsersinteractwithSantokueitherusingtheGUIor
Rscripts. Santokuoptimizesthecomputationsusingfactorizedlearningand
invokesanunderlyingRexecutionengine. . . . . . . . . . . . . . . . . . . . . 40
4.4 ResultsonrealdatasetsforK-Means. Theapproachescomparedare–M:Mate-
rialize(usethedenormalizeddataset),F:Factorizedclustering,FRC:Factorized
clusteringwithrecoding(improvesF),NC:NaiveLZWcompression,andOC:
Optimizedcompression(improvesNC). . . . . . . . . . . . . . . . . . . . . . . 46
4.5 Performanceonrealdatasetsfor(A)LinearRegression,(B)LogisticRegression,
(C)K-Means,and(D)GNMF.E,M,Y,W,L,B,andFcorrespondtotheExpedia,
Movies,Yelp,Walmart,LastFM,Books,andFlightsdatasetrespectively. The
numberofiterations/centroids/topicsis20/5/5. . . . . . . . . . . . . . . . . . 49
viii
5.1 Illustratingtherelationshipbetweenthedecisionrulestotellwhichjoinsare
“safetoavoid.” . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
5.2 Relationshipbetweenhypothesisspaces. . . . . . . . . . . . . . . . . . . . . . . 58
5.3 Simulation results for the scenario in which only a single X ∈ X is part of
r R
thetruedistribution,whichhasP(Y = 0|X = 0) = P(Y = 1|X = 1) = p. For
r r
theseresults,wesetp = 0.1(varyingthisprobabilitydidnotchangetheoverall
trends). (A) Vary n , while fixing (d ,d ,|D |) = (2,4,40). (B) Vary |D |
S S R FK FK
(= n ),whilefixing(n ,d ,d ) = (1000,4,4). . . . . . . . . . . . . . . . . . . 63
R S S R
5.4 Whenq∗ = |D | (cid:28) |D |,theRORishigh. Whenq∗ ≈ |D |,theRORislow.
R X∗r FK R FK
TheTRrulecannotdistinguishbetweenthesetwoscenarios. . . . . . . . . . . 67
5.5 Scatterplotsbasedonalltheresultsofthesimulationexperimentsreferredto
byFigure5.3. (A)Increaseintesterrorcausedbyavoidingthejoin(denoted
“∆Testerror”)againstROR.(B)∆Testerroragainsttupleratio. (C)RORagainst
inversesquarerootoftupleratio. . . . . . . . . . . . . . . . . . . . . . . . . . . 67
5.6 End-to-endresultsonrealdata: Errorafterfeatureselection. . . . . . . . . . . 72
5.7 End-to-endresultsonrealdata: Runtimeoffeatureselection. . . . . . . . . . . 74
5.8 Robustness: Holdout test errors after Forward Selection (FS) and Backward
Selection(BS).The“plan”chosenbyJoinOptishighlighted,e.g.,NoJoinson
Walmart. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
5.9 Sensitivity: Wesetρ = 2.5andτ = 20. Anattributetableisdeemed“okayto
avoid”iftheincreaseinerrorwaswithin0.001witheitherForwardSelection
(FS)andBackwardSelection(BS). . . . . . . . . . . . . . . . . . . . . . . . . . 75
A.1 Implementation-based performance against each of (1) tuple ratio (nS), (2)
n
R
featureratio(dR),and(3)numberofiterations(Iters)–separatedcolumn-wise
d
S
– for (A) RMM, and (B) RLM – separated row-wise. SR is skipped since its
runtimeisverysimilartoS.TheotherparametersarefixedasperTable3.3. . 97
A.2 Analyticalplotsofruntimeagainsteachof(1) nS,(2) dR,and(3)Iters,forboth
n d
R S
the(A)RMM,and(B)RLMmemoryregions. Theotherparametersarefixed
asperTable3.3. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
A.3 Analyticalplotsforthecasewhen|S| < |R|butn > n . Weplottheruntime
S R
against each of m, n , d , Iters, and n , while fixing the others. Wherever
S R R
theyarefixed,weset(m,n ,n ,d ,d ,Iters)=(24GB,1E8,1E7,6,100,20).. . 98
S R S R
A.4 Analytical plots for the case when n (cid:54) n (mostly). We plot the runtime
S R
against each of m, n , d , Iters, and n , while fixing the others. Wherever
S R R
theyarefixed,weset(m,n ,n ,d ,d ,Iters)=(24GB,2E7,5E7,6,9,20). . . . 99
S R S R
Description:Xiaojin Zhu, Associate Professor, Computer Sciences, UW-Madison issues, this dissertation introduces the paradigm of “learning over joins,” which