Apache Mahout: Beyond MapReduce Distributed Algorithm Design Dmitriy Lyubimov Andrew Palumbo DmitriyLyubimov,AndrewPalumbo. ApacheMahout: BeyondMapReduce DistributedAlgorithmDesign FirstEdition ISBN-13:978-1523775781 ISBN-10:1523775785 LibraryofCongressControlNumber:2016902011 BISAC:Computers/Mathematical&StatisticalSoftware 7"x10"(17.78x25.4cm)paperback Copyright©2016byDmitriyLyubimovandAndrewPalumbo. Allrightsreserved. Nopartofthispublicationmaybereproduced,storedinaretrievalsystem,ortransmitted, in any form or by means electronic, mechanical, photocopying, or otherwise, without priorwrittenpermissionofthecopyrightholders. Attributions. LATEX styling in this book is derived from work by Mathias Legrand and Velimir Gayevskiy(http://latextemplates.com)withpermissionofVelimirGayevskiy. Preface TargetAudience. Thisbookisformachinelearningpractitioners,algorithmdesigners,appliedresearchers, and anyone who likes to experiment with bits and pieces of “mathematically infused” algorithms. It is of course mainly for people who would like to learn about the new Mahout Samsaraenvironmentinparticular. Theaudiencecouldfurtherbedividedintotwocategories: usersofMahout’soff-the- shelfcapabilities,andpeoplewhowouldliketouseMahouttodesigntheirownsolutions (algorithmdesigners). Thefirsteditionofthisbooktargetsmostlythelatter,thatisthe peoplewhowouldliketodeveloptheirownmachinelearningsolutionswiththehelpof MahoutSamsara. Some material assumes an undergraduate level understanding of calculus and on occasionmultivariatecalculus. TheexamplesaregiveninScalaandassumefamiliarity withbasicScalaterminologyandaminimalnotionoffunctionalprogramming. Whatisdifferent? First,atthetimeofthiswriting,muchofthepriorprintedmatteronMahoutisfocused ontheApacheHadoopMapReduce-basedcapabilitiesofMahout. AsofMahout0.10.0alltheMapReducealgorithmsofficiallyenteredtheirend-of-life phase. Theyweresweepinglydeprecated,andtheprojecthasnotbeenacceptinganynew MapReducealgorithmssince. Strictlyspeaking,theannouncementoftheMapReduce end-of-lifeinMahoutcameevenearlier,inthefirsthalfof2014,asanewspostonthe ii MahoutwebsiteandtheJiraissueMAHOUT-15101. In place of Hadoop MapReduce, Mahout has been focusing on implementing a flexibleandbackend-agnosticmachinelearningenvironment. MahoutistargetingApache SparkandApacheFlinkasthemainexecutionbackends, andthereisalsosupportfor MahoutontheH2Oengine. ThenewprogrammingAPIiscompletelybasedonaScala DSL, which is mostly an algebraic DSL. Mahout Samsara is the code name for the aggregateofallthesenewfeaturesworkingtogetherfromthe0.10.0releaseon.2 Tothebestofourknowledge,atthetimeofthiswritingthisistheonlybookthatis devotedsolelyandindepthtoMahoutSamsara. Second,muchofthepreviousprintedmatteronMahouttargetstheaspectofoff-the- shelf usage of Mahout algorithms. While this book has some end-to-end off-the-shelf tutorials too, its main target is machine learning algorithm design using the Mahout Samsaraenvironment. Third,thisbookisalsodifferentfrommanytraditionalbooksonMachineLearning in that it (almost) never re-traces any mathematical inference of the end-game math formulations. There are no lemmas, theorems or proofs supporting the mathematical outcomes of the methods we consider. This is not the focus of this book. Instead, we give references describing the algorithms. These references provide all the necessary detailsweomit. Wethenfocusonturningthemathematicaldescriptionofamethodinto programcode. Thephilosophyofthebook. In many ways in this book we try to do the same thing that a lot of computer science booksdo: weteachdesignprinciplesandhands-oncodingofalgorithms. Exceptinstead ofcodingalgorithmsexpressedinpseudocode,weactuallycodemathematicalformulas. Formulasarealsojustaformofanalgorithm. Themaindifferenceismerelythealgorithm notation. Forthatreason,thisbookhassomeamountofmathematicalequations,simply because we have to outline the final formulation of the methodology we code, before weshowhowtocodeit. Asithasbeenpreviouslysaid,wedonotgointodetailsofthe methodinferencethough. The skills taught in this book allow the reader to implement a significant variety of existing Machine Learning approaches on his or her own, so that he or she is not bound by the algorithm’s off-the-shelf availability. Implementing a machine learning algorithmisstilllargelyanartratherthanarecipe. Nevertheless,weaspiretodemonstrate that with Samsara, the ease and maintainability of such an effort approaches the ease andmaintainabilityofusingnumerical-friendlyplatformssuchas MATLAB orR.One often-usedmeasureofsuchperformanceisthenumberoflinesinaprogram. Most,ifnot all,codeexamplesinthisbookshouldnotexceedasinglepage(assumingallcomments andlineskipsareremoved). Wealsoprovidehelpwithmathnotationalfluency. Bymathfluencywemeanthe abilitytoparsemathematicalexpressionsandseecodingpatternsinthem. 1SomeofthefirstelementsoftheMahoutSamsaraScalaDSLbindingswerepresentasearlyasthe Mahout0.9release. 2Chapter1focusesonthedefinitionofMahoutSamsarainmoredetail. iii Forcomputersciencepeoplewhoperhapsfeelabitintimidatedbymathnotations, wehavejustonepieceofadvice–forthepurposesofpracticalapplicationsremember this: amathformulaisjustanalgorithm;mathnotationisjustpseudocode. Ifyouspend afewminutesgoingoverthenotationconventionsin§A.2,youwillbeabletoreadthe pseudocodeinthisbook,andafterthat,youwillknowthedrill. Attimesforapractitioneritispragmatictoacceptthat“thatishowthemathworks out,” tobe ableto mapit tocode, andto debug andexperimentquickly, ratherthan to trackalloriginsofthefinalformulation. Aslongasanengineerisabletoparseandcodify themathnotations,in-depthunderstandingofthetheoryandmethodinferencemaybe pragmaticallylessimportant. Weaspiretoachievethesegoalsbyprovidingvarioushands-onpatternsoftranslating simpleformulationsintodistributedSamsaracode,aswellasmoreinvolvedexamples where these patterns connect into larger, coherent algorithm code. Like many texts in ComputerScience,weemploythe“learningbyexample”approach. Thus,whilelearningthemathematicaldetailsofthemethoddescribedbytheBFGS exampleinchapter3iseducational,themainpurposeofthatdescriptionisnottoteach theBFGSunderpinnings,butrathertoillustrateavarietyofhands-ontechniqueswhile treating potentially any algebraic approach in the context of the in-memory Mahout SamsaraDSL(alsoknownas“ScalaBindings”). Torecap,inthisbookwedothefollowing: • Wetakeoff-the-shelfmath,forwhichweprovidein-depthreferences. • Weprovidemathnotationexplanationsforthefinalmethodformulation. • Weexplainmath-to-codemappingpatterns. • Wegivetipsforworkingwithdistributedandnon-distributedmath. • Wealsogivebasicdesignnarrationaboutthebehind-the-scenesarchitectureofthe MahoutSamsaraframework. • Wegiveoneend-to-endextendedclassificationtutorial. • Finally,intheappendicesweprovideareferencefortheR-likeMahoutDSL. The example code in this book can be found at: https://github.com/andrewpalumbo/ mahout-samsara-book. Acknowledgments. WewouldliketothankAndrewMusselman,GeneLinetsky,NathanHalko,SuneelMarthi and Tony Jebara (in alphabetical order) who provided preprint reviews, corrections, insightsandideasforthisbook. WeextendourcordialthankstoNathanHalkowhohelpedtheMahoutteamwiththe MapReduceversionofStochasticSVDandeveryoneinvolvedwithhimintherandom projectionstudy[Halkoetal.,2011]. WewouldliketothankSebastianSchelterforhis extensivehelpwithandsupportoftheearlyMahoutSamsaraideas;AnandAvatiforhis extensive and incredibly talented contribution to the physical layer translation for the H2Obackend;AlexeyGrigorev,KostasTzoumas,andStephanEwenfortheirongoing efforttoprovidephysicaltranslationfortheApacheFlinkbackendinMahout. Wewouldliketothankourfamiliesforthemuchneededsupportontheirbehalfand toleratingthelonghourswehavetoputtowardsthisbook’scompletion. iv And of course we would like to thank all the contributors to Mahout – past and present. Disclosures. In this book, when we say “Mahout” or “Mahout Samsara,” we refer to the Apache Mahoutproject,release0.10.0orlater. PartsofthisbookareupdatedforApacheMahout releases0.11.0and0.10.2andmaynotbebackwardscompatiblewithearlierreleases. Whenwesay“Flink,”werefertotheApacheFlinkproject. Whenwesay“Spark,”werefertotheApacheSparkproject,release1.2.xorlater. Whenwesay“GraphX”or“MLlib,”werefertorespectivesubcomponentsoftheApache Sparkproject. Somepartsofthisbookfurtherdevelopconceptscontainedintheworkingnotesthat werelicensedbytheauthorstotheApacheSoftwareFoundationunderanon-exclusive ApacheICLA. Someexamplescontaincodebasedonthecodethathadbeenpreviouslylicensed bythebookauthorstotheApacheSoftwareFoundationunderanon-exclusiveApache ICLA. Athoughtheauthorshavetakeneveryeffortinpreparationofthisbook,thereisno warranty,expressorimplied,ontheinformationofferedinthisbook. Neithertheauthors northepublisherwillbeheldliableforanydamagescausedorallegedtobecausedby theinformationcontainedinthisbook. Seealso“Attributions”onthecopyrightpage. SF BAY AREA Dmitriy Lyubimov NEW YORK Andrew Palumbo May,2015 Contents Preface ............................................ i I First steps 1 Meet Mahout 0.10+ (“Samsara”) .................... 3 2 Setting things up ................................... 7 2.1 Compiling from source 8 2.2 Running the Mahout Spark shell 10 2.3 Using Mahout in a Spark application 12 2.4 Kicking the tires with a unit test: fitting a ridge regression 16 II Coding with Mahout 3 In-core Algebra .................................. 27 3.1 Tensor types 29 3.2 Matrix views 30 3.2.1 TransposedMatrixView . . . . . . . . . . . . . . . . . . . . . . . . . . 30 3.2.2 Blockviews . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 3.2.3 Functionalmatrixviews . . . . . . . . . . . . . . . . . . . . . . . . . . 32 3.3 Iterating over tensors 32 3.3.1 Row-wiseorcolumn-wiseiterations . . . . . . . . . . . . . . . . . 32 3.3.2 Iteratingovervector . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 3.4 Assignments 34 3.4.1 Assignmentoperators‘:=‘and‘::=‘ . . . . . . . . . . . . . . . . . 34 3.4.2 “Magic”Scalaassignment‘=‘vs. ‘:=‘ . . . . . . . . . . . . . . . 35 3.4.3 Associativityofassignmentoperators . . . . . . . . . . . . . . . 36 3.4.4 Usingfunctionalin-placeassignments . . . . . . . . . . . . . . . 37 3.5 Avoiding forming intermediate results 37 3.6 Matrix flavors 39 3.7 Parallelizing in-core algorithms 40 3.7.1 Example: parallelcolMeans,rowMeansusingScalaparallel collections . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 3.7.2 Example: parallelmatrixmultiplication . . . . . . . . . . . . . . 42 3.8 Pitfalls 46 3.8.1 Implicitsideeffectsofsharedreference . . . . . . . . . . . . . 46 3.8.2 Thecaretoperator(‘^‘) . . . . . . . . . . . . . . . . . . . . . . . . . . 47 3.9 Example: Designing BFGS 48 3.9.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 3.9.2 QuickbackgroundonBFGS . . . . . . . . . . . . . . . . . . . . . . 48 3.9.3 ImplementationofBFGSiteration . . . . . . . . . . . . . . . . . . 51 4 Distributed Algebra .............................. 57 4.1 Synergy of the in-core and distributed operations 58 4.2 Distributed Row Matrix 58 4.2.1 Rowkeycontracts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 4.2.2 DRMpersistenceandserialization . . . . . . . . . . . . . . . . . . 60 4.2.3 In-memoryDRMrepresentation . . . . . . . . . . . . . . . . . . . . 60 4.2.4 Integraltyperowkeys . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 4.3 Initializing Mahout distributed context running on Spark 62 4.4 The distributed optimizer architecture 63 4.5 Logical Expressions and Checkpoints 65 4.5.1 Optimizeractionsandcheckpoints . . . . . . . . . . . . . . . . . 65 4.5.2 Checkpointcaching . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 4.5.3 Implicitcheckpointing . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 4.5.4 Checkpointmasking . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 4.5.5 Computationalactions . . . . . . . . . . . . . . . . . . . . . . . . . . 70 4.6 Custom pipelines on matrix blocks 71 4.6.1 ThemapBlock()operator . . . . . . . . . . . . . . . . . . . . . . . . . 71 4.6.2 TheallreduceBlock()operator . . . . . . . . . . . . . . . . . . . . . 73 4.6.3 Effectofblockoperatorsonmissingrows . . . . . . . . . . . . 74 4.7 Attaching non-algebraic pipelines 74 4.8 Broadcasting vectors and matrices to the backend closures 75 4.9 Parallelizing distributed algorithms 77 4.10 The hidden powers of distributed transposition 80 4.11 Untangling a1(cid:62), A1, and A11(cid:62) 82 4.11.1 UntanglingtheA1andA(cid:62)1 . . . . . . . . . . . . . . . . . . . . . . . 83 4.11.2 Untanglinga1(cid:62) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 4.11.3 UntanglingA11(cid:62) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 4.12 Example: distance and distance-based kernel matrices 85 4.12.1 Computingadistancematrix . . . . . . . . . . . . . . . . . . . . . 85 4.12.2 Computingakernelmatrix . . . . . . . . . . . . . . . . . . . . . . . 87 4.13 Example: scaling a dataset 88 4.14 Cache side effects of the mapBlock() operator 91 4.15 Example: testing regression slopes 91 III Approximating Distributed Problems 5 Stochastic SVD .................................. 105 5.1 The SSVD Problem 106 5.2 Quick background behind the Stochastic SVD 106 5.2.1 Johnson-Lindenstrausslemma . . . . . . . . . . . . . . . . . . . . 106 5.2.2 Capturingthesubspacewithlargestvariances . . . . . . 107 5.2.3 Thepoweriterations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 5.2.4 TheformulationoftheMahoutSSVDalgorithm . . . . . . . 108 5.2.5 Codingthein-coreSSVDversion . . . . . . . . . . . . . . . . . . 108 5.3 Cholesky QR and its application for thinQR 112 5.3.1 TheformulationoftheCholeskyQRdecomposition . . . 112 5.3.2 AdistributedthinQRdecomposition . . . . . . . . . . . . . . . 112 5.4 Distributed SSVD 114 5.5 Folding in new values and incremental SVD 114 5.6 LSA and its term/document spaces 117 5.7 Why all the trouble 118 5.8 Where to next 118 6 Stochastic PCA ................................. 119 6.1 The PCA problem 120 6.2 The “standard” solution 121 6.3 Stochastic PCA 123 6.3.1 Theproblemof“bigdata”PCA . . . . . . . . . . . . . . . . . . 123 6.3.2 TheStochasticPCAalgorithm . . . . . . . . . . . . . . . . . . . . 124 7 Data Sketching with Bahmani sketch ............. 131 7.1 The sketch problem 132 7.2 The Bahmani sketch formulation 132 7.3 Coding the Bahmani Sketch with Samsara 133
Description: