ebook img

Distributed Inference Algorithms for Machine Learning Alexander J. Smola and David G. Andersen PDF

23 Pages·2012·0.34 MB·English
by  
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Distributed Inference Algorithms for Machine Learning Alexander J. Smola and David G. Andersen

RI:Small:Collaborative Research: Distributed Inference Algorithms for Machine Learning Alexander J. Smola and David G. Andersen Our goal is to design, analyze, and implement novel inference algorithms that take advantage of emerging hardware paradigms to enable learning on and mining of massive datasets. In particular, we commit to: a) developing new optimization algorithms for machine learning; b) analyzingtheirconvergenceproperties;andc)releasingopen-sourcecodeforallouralgorithms. We base this research upon four likely shifts in the designs of datacenters of the future: 1. Lightweight CPUs for Power Efficiency will become common in server farms to take advantageoftheupto10xadvantageinpowerefficiencyyieldedbyCPUssuchasthosefound in mobile devices. 2. Heterogenous architectures: homogeneous multi-core CPUs are already being replaced by combinations of special purpose units such as CPUs + GPGPUs (graphics processors). 3. Solid state storage: SSDs will become more prevalent in server centers, increasing the rate of random read operations by 1,000 to 10,000 times relative to hard drives. 4. High bisection bandwidth networks: will increasingly replace traditional aggregation trees, enabled by merchant silicon and software-defined networking. Intellectual Merit: Manylearningalgorithmsfailtotakeadvantageofthesechanges. Moreover, many excellent single-machine codes exist that would greatly benefit from parallelization, yet re- engineering each of them is infeasible. The main thrust of our proposed work is to develop a toolkit for designing (and retrofitting) the next generation of systems-aware, efficient, scalable inference algorithms. To keep the research relevant to practitioners, we will leverage collaborations with Intel, Calxeda, and Google to test our ideas on reference hardware and real-life datasets. Our deliverables will lay the foundation for applying machine learning in server centers of the future. Towards this end, we will design and analyze parameter distribution algorithms, fault tolerant replication schemes, and asynchronous optimization algorithms. While some of these problems are understood in a piecemeal fashion in domains such as systems, networking, and databases, they are novel in the context of statistical data analysis. Our proposal, which is a collaborative effort between PIs with complementary expertise, takes a holistic approach to the problem and draws on this confluence of systems research, optimization and machine learning. Broader Impact: Different researchers and research groups in machine learning are developing their own seemingly disparate models specialized for their particular application area, often with single machine or single threaded codes. By tackling distributed inference and parameter distri- bution we can potentially benefit many of these applications. Having an abstract glue layer via a parameter server will enable retrofitting many existing efficient single-machine algorithms for parallel inference. We will actively strive to build a vibrant community in academia and industry around the released code by adapting a public, open development model. Moreover, our open source platform will serve as a teaching tool and we will organize summer schools and workshops focused on scalable machine learning. We also plan to develop a new Scalable Data Analysis course which will be disseminated online. Keywords: Machine learning, Optimization, Mobile Processors, SSDs, Fault Tolerance. i RI:Small:Collaborative Research: Distributed Inference Algorithms for Machine Learning Alexander J. Smola and David G. Andersen 1 Introduction Our goal is to design, analyze, and implement novel inference algorithms that take advantage of emerging hardware paradigms to enable analysis and mining of massive datasets. What are the limitations of existing machine learning algorithms? Existing algorithms assume that data either can be loaded into main memory or can be scanned repeatedly [25, 54, 55, 90–92]. Others lack fault tolerance on large scale deployments [54, 55], or their fault-tolerant deployments are often not very efficient [29]. Furthermore, many popular inference codes are still sequential, restricted to a single machine [28, 41, 49]. Yet others are parallel but heavily optimized forspecificproblems[76]. Weaimtobuildatoolkitthatisbothdistributed,faulttolerant,efficient, and allows for re-use of legacy code. Can advances in systems and hardware help? Computer hardware and networks are un- dergoing dramatic changes, and we believe that server centers of the future will be fundamentally different. This requires redesigned algorithms. Power: many server farms will be made up of large numbers of lightweight CPUs, like the ones found in mobile devices, due to a 10x advantage in power efficiency. The past 5 years have seen a revolution in energy efficiency in mobile devices, motivated by severely limited battery resources. This selective pressure has led to highly efficient micro-architectures such as the ARM A9 and A15 designs. Several companies, such as Calxeda,1 Marvell,2 and AMD3 are developing server processors based on it. Architecture: homogeneous multicore processors are already being replaced by heterogeneous combinations of special purpose units such as CPUs (central processing units) + GPGPUs (general purpose graphic processing units). For instance, by now most major chip manufac- turers sell processors with integrated graphics cores4 for laptops and desktops. This progress is fueled by the need for rich media content and games on power efficient consumer grade hardware, and will eventually trickle down to the server farms. Storage: solid state drives (SSDs) will become more prevalent in server centers, and this will allow us to increase the rate of random read operations by 1,000 to 10,000 times relative to hard drives. These rates are likely to improve further since SSDs benefit from the common advances in microchip process technology. Networks: traditional hierarchical aggregation trees in server centers will be replaced by homoge- neous high bandwidth configurations using software defined networking. This is a necessity since the number of machines is increasing dramatically while the fan-in of conventional net- work switches has essentially stagnated over the past decades. Progress in network design has led to clusters that are practically flat and fully connected [35]. 1http://www.calxeda.com/technology/products/processors/ 2http://www.marvell.com/embedded-processors/armada-xp/ 3http://www.amd.com/us/aboutamd/newsroom/Pages/fact-sheet-2012oct29.aspx 4http://www.intel.com/content/www/us/en/architecture-and-technology/hd-graphics/ 1 Unfortunately, existing learning algorithms are not well suited to take advantage of these changes. We propose to develop the next generation of systems-aware, efficient, scalable optimization algo- rithms which can be applied to machine learning tasks. Why the focus on optimization? Optimization is at the heart of many machine learning algorithms. While different machine learning algorithms can appear dissimilar on the surface, the underlying optimization problems are often strikingly similar. By understanding these similarities, wecanfocusonthecoreproblemsthatwillleadtogoodscalingofmostMachineLearningmethods. Theproblemsfacedbydistributedoptimizationoftencloselyresemblethoseindistributedsampling or related inference methods. In other words, we will use optimization as a stand-in testbed to develop distributed inference techniques, with the understanding that many of our techniques can also be carried over to methods like sampling. Why do we need specialized optimization algorithms? Off-the-shelf optimization routines are typically designed for dealing with general objective functions and constraints. That is, they typicallyfailtotakeadvantageofthespecialpropertiesinherentinmanymachinelearningproblems (e.g., streaming observations from disk, prior knowledge in problem partitioning). This means that while they excel for prototyping small applications, their runtime on massive datasets with billions of variables and observations is typically considerable. However, the objective functions encounteredinmachinelearningusuallyhaveawell-definedstructural form. Becausetheunderlying optimization problem presents a computational bottleneck, significant performance gains can be realized by developing specialized algorithms that exploit the structure in the objective function. Whydoweneedyetanotherparalleltoolkit? Ourproposaldescribesamodulararchitecture that allows us parallelize a large family of existing algorithms without dramatic modification of the single-machine component. This is achieved by means of a parameter server that decouples computation from synchronization in a fully asynchronous fashion. A key benefit is that this often leads to efficient integration of legacy codes. No existing toolkit provides this functionality at present. Many recent parallel processing frameworks either lack higher order update semantics [69] or they do not allow random writes in a large and complex state space [90]. Nonetheless a toolkit is desirable since improvements on the synchronization side will immediately translate into improvements for all the algorithms using it. In summary, the proposal addresses a need manifest in machine learning both in industry and in academia — the need for flexible and scalable optimization algorithms that are robust, take advantage of modern hardware, provide fault tolerance, are easy to deploy, and that allow users to upgrade legacy codes without a complete redesign. In other words, we aim to commoditize the difficult part of parallelization for machine learning by providing a distributed synchronization layer. We commit to the following deliverables: 1. Optimization algorithms for Machine Learning that exploit recent advances in hardware. Our proposed algorithms are distributed, asynchronous, and fault tolerant. 2. Theoretical analysis and convergence bounds for the algorithms. This will extend current results in variable decomposition methods to obtain faster rates beyond dual decomposition. 3. An open-source platform for the code developed in this project. In particular, it will provide a fault tolerant parameter server and distributed inference controller that enables rapid deployment of our new algorithms as well as parallelization of existing legacy codes. 2 2 Background Many machine learning problems can be interpreted as penalized risk minimization; consequently optimizationprovidesausefulunifyingthread. Forinstance,regularizedriskboundsandmaximum- a-posteriori estimates [44, 56, 61, 63, 74, 82] are a staple of statistical estimation. Furthermore, variational bounds extend this reasoning to cases where one would otherwise more commonly employ sampling [16]. A benefit is that it endows the problems with a well-defined objective function that makes comparison of algorithms fairly straightforward. This is clearly not the only way to approach statistical inference. For instance, sampling is a formidable tool in statistics [31]. However, convergence analysis tends to be more easily achievable for optimization than for finding mixing guarantees for sampling algorithms. Moreover, the prob- lems encountered in optimization are also prevalent in sampling (e.g., synchronization of sufficient statistics). Hence we limit ourselves to discussing optimization algorithms for machine learning. 2.1 Machine Learning To illustrate the problems incurred in large scale inference we give a cartoon view of machine learning. It is understood that the reality is considerably more varied and complex. A large class of estimation problems can be viewed as regularized risk minimization: m 1 (cid:88) minimizeR(w) where R(w) := λΩ(w) + l(x ,y ,w). (1) i i w (cid:124)(cid:123)(cid:122)(cid:125) (cid:124)(cid:123)(cid:122)(cid:125) m i=1 RegularizedRisk Regularizer (cid:124) (cid:123)(cid:122) (cid:125) Empiricalrisk Here w is the parameter of the model, x ∈ X ⊆ Rd are the training instances, y ∈ Y are the i i corresponding labels, and l is the loss function. The trade-off between the regularizer and the risk is determined by the constant λ > 0. Four aspects are worth considering: • There is no guarantee that l or Ω are convex. However, quite often Ω is convex and l is convex in a subset of variables provided that the remainder is fixed. Examples include collaborative filtering problems where the objective function is convex in the inner product between subsets of the parameter vectors [12] or in topic models [17] where the objective is convex in the document and topic specific parameters respectively. • Ω may have a nontrivial structure in terms of subsets of variables interacting. This typically can be represented by a sparse graph [11, 57, 64]. Usually Ω is convex, but not always differentiable, e.g., in the case of (cid:96) based regularization [15]. 1 • The subset of coefficients interacting in any given term in l(x ,y ,w) can be small compared i i to the dimensionality of w. This occurs, e.g., in text analysis where only a small number of words out of a much larger dictionary occurs in any given document [40]. • Thesetofcoefficientscanoftenbedecomposedintosubgroupswithdifferentaccessandlocal- ity characteristics [89]. This occurs, for example, when mixing global with local personalized models for mail filtering. Thismeansthatinmanycaseswecanrewritetheregularizedriskfunctionalasasumoverfunctions involving subsets of variables, i.e., (cid:88) R(w) = R (w ) (2) C C C∈C 3 Here C are maximal cliques of coordinates in w and C denotes the set of maximal cliques associated with R(w). Moreover, w is the restriction of w on the clique C. This means that the problem can C often be decomposed such that each part only involves subsets of variables. Notably, however, any variable may occur in multiple cliques: the cliques are not disjoint. Such decompositions are also typical in graphical models [67]. The Hammersley-Clifford decomposition [14, 37] provides further examples of this situation. That is, (2) occurs in both directed and undirected graphical models. Many efficient algorithms in this context exploit the generalized distributive law [5]. Equations (1) and (2) characterize the tension in solving optimization problems in machine learning: Thefirstform(1)canbeoptimizedforefficientdataaccess, e.g., minimizationbystochas- tic gradient descent when streaming data sequentially from disk [49, 72]. The second form (2) instead allows for local computation that operates only on a subset of variables [54]. The resulting subproblems may be considerably easier to solve in this case, at the expense of nonuniform data ac- cess. Furthermore, if we distribute subproblems over several machines, we need to keep overlapping variables synchronized [76]. 2.2 Partitioning To accelerate computation relative to a single machine, we must partition the problem into parts that can be processed on many machines. There are four common strategies for handling this: 1. Naive observation partitioning. Each machine updates its local copy of all parameters based upon results from its own partition of the observations. It provides high bandwidth to read the observation data, and can store large amounts of this data spread across many machines. Unfortunately,storingthefullparametervectorw oneachmachinemayrequiremorememory than is available, particularly when harnessing energy-efficient nodes. 2. Parameter-limiting partitioning. This improves upon the naive approach by distributing data judiciously such that the number of parameters per machine is limited. This works, e.g. in content personalization problems where many parameters can be kept machine local [45, 89]. 3. Clique decomposition. We may decompose R (w ) directly and partition by the set of max- C C imal cliques. A possible side-effect is that data distribution may have overlapping parts. 4. Joint partitioning. Thispartitionsdataandinteractionsintosubgroups. Thatis, wecompute updates on the variables in several machines separately and aggregate within each block. The three improved strategies amount to a row, column and biclustering-type partitioning of the model. Parameter-limiting partitioning The risk in (1) is computed by averaging the loss over the training instances. This implies that gradients, too, can be computed by a distribute & average operation, e.g. via MapReduce [22]. We divide the training instances into as many subsets as there are processors, give each processor its share of the data to compute the loss and its gradient (the map operation), and average the results at the end (the reduce operation). This is computationally attractive only if we have convex optimization problems and whenever the data naturally decom- poses into a shared and a local part, where the size of the local parameter space dwarfs any shared terms (for simple problems). This applies e.g. to personalized spam filtering [89]. In general, however, data partitioning is challenging since not all data cleanly decomposes into sets of bounded range of interaction. For instance, when annotating user activities and preferences 4 we have implicit interactions between users that frequent the same locations. Similarly, topic models implicitly create interactions between documents via the topics they contain. Clique Decomposition A second way to solve the regularized risk minimization problem is to exploit the decomposition of R(w) in (2) and employ many machines to solve the sub-problems. Denote by C a subset of cliques such that i (cid:91) (cid:88) C = C and let R (w) := R (w ). (3) i i C C i C∈Ci (cid:80) Thatis,wedecomposeRintoR = R whereeachR dependsonlyontheunionofallcoordinates i i i occurring in C . Depending on the problem type this set may be considerably smaller than the full i set of nonzero variables. For instance, for movie recommendation, we could partition the set of observations by users. Here each subset of data containing only parameters of a small set of users is amenable to separate optimization. However, the set of movie-related parameters is largely shared between all partitions. Minimizing partitioning cost can be formalized as follows: (cid:91) (cid:91) minimizef({C }) where C = subject to C = C and C ∈ C (4) i i i Ci C∈Ci i Here C denotes the subset of coordinates that are required in partition i, as arising from the clique i partitioning into subsets C , and f is a monotonic function in |C |. The following special cases i i illustrate how one may want to choose f. • Choosing f({C }) = max |C | means that we are trying to minimize the maximum number i i i of variables in any given partition. This is related to minimizing the vertex cut of a graph — there one attempts to find (balanced) cuts such that the number of neighboring vertices per partition is small. In our setting this goal is desirable since the amount of memory required for each partition is linear is |C |, hence minimizing the upper bound is desirable. i (cid:80) • Choosing f({C }) = |C ∩C | minimizes the number of edges between partitions. This i i(cid:54)=j i j is related to finding a (balanced) edge cut of a graph. In the context of optimization this is related to the amount of network traffic a naive partitioning and synchronization requires. Several such partitioning systems have been used in the past. For instance, [88, 94] partition the data according to users. This means that the overlap in C only contains movie parameters. [3, 76] i partition by documents or users. Due to the increased sparsity in the data this restricts the overlap further to only frequently used items. [34] discuss both greedy and random partitioning for natural graphs. Finally, [30]discussablockeddecompositionofvariableswithintermittentsynchronization steps. None of the aforementioned heuristics comes with tangible optimality guarantees. It is therefore desirable to tackle the partitioning problem systematically. A promising strategy is to view (4) as a submodular load-balancing problem [78]. Block Partitioning Finally, one may combine both strategies to partition the set of influence variables into subsets such that parts of the data are kept locally while also only keeping parts of the variables locally. Recent work by [66] shows that this can be efficient in terms of dealing with functions involving many variables. Moreover, the implementation of [47] deals with practical issues of messaging when using block decomposition on disk. Note that none of these strategies are particularly efficient in terms of automatic graph partitioning. There is an opportunity to obtain more efficient algorithms accordingly. 5 2.3 Distributed Systems Problem partitioning is only the first part of the challenge in solving large-scale problems. We need to decide how to solve the individual subproblems efficiently, and how to distribute the storage and update of the shared parameters (e.g., the overlapping nodes among the cliques in C). Of particular importance is fault tolerance: our proposal aims to generate solutions that are practically relevant in the large scale setting. However, large numbers of computers almost in- evitably exhibit failures, or more likely, jobs may be preempted or even terminated by the cluster manager (e.g. on Amazon spot instances). Hence it is not an optional but necessary attribute. Key to fault tolerance is the ability to recover state upon the failure of a node. This recovery may take place by redundancy (replicating the state on multiple machines) or by re-execution. There are three key questions that underlie the design of a fault-tolerant system: • Is it possible to recompute lost data incrementally, i.e., without restarting from scratch? • If so, what is the cost to recompute the data? • What consistency is required between the primary copy of the data and any replicas? Two extremes help illustrate the designs in this space. First is MapReduce, in which computations are forced to be both individually idempotent and only synchronized at the end of a full iteration of Map and Reduce. As a result, work issued to a single worker node can be re-executed upon failure without re-computing any other data. Second is traditional high-performance computing applica- tions, e.g., finite element simulation, which execute computations in lockstep at a fine granularity with a completely consistent view of the shared state between machines. Fault tolerance in HPC applications is typically achieved by pausing the entire system synchronously, taking a consistent shapshot to stable storage, and resuming execution. This adds very high overhead. Attractively, this design question is directly related to the strategies of Section 3.1, in which our goal is to find solutions for optimization that are either partitionable or that can operate asyn- chronously. In general, strategies for accelerating distributed computation also ease the require- mentsforfaulttolerance. Forpartitionableproblem(sub-problemscanbesolvedindependentlyand then combined at the end) it suffices to provide fault tolerance for each sub-problem independently. A failure requires re-executing or recovering a single sub-problem, but not the global instance. For a problem that can be executed asynchronously, we can typically relax the consistency required of replication, taking advantage of more “sloppy” asynchronous replication strategies. This is useful for optimization algorithms already have a built-in tolerance to inaccurate partial solutions. The replication strategies we plan to bring to bear (as appropriate given the algorithmic re- quirements) are, from “strongest” (but most expensive) to “weakest” (and therefore cheapest): Globally consistent snapshotting; Individual node consistent replication using high-performance Paxos variants; causally consistent replication of updates; and “eventually” consistent replication (i.e., almostnoconsistencyguaranteesatall). Ofthese, thecausallyconsistentoptionbearsfurther mention: It is the strongest consistency model that can be provided without requiring synchronous communicationbetweenthereplicas,butitcanstill(inmanycases)maintainconsistencyinvariants relating to the order and grouping of updates. For example, under causal consistency, it is possible to ensure, for two updates u and u generated by the same node, “do not apply update u to w 1 2 2 until update u has been applied.”5 1 5We plan to harness for this PI Andersen’s prior expertise in both strongly consistent key-value stores [8] and scalable, high-performance causally consistent key-value stores [53]. 6 2.4 Hardware As outlined in Section 1, a key driver for the current research proposal is dramatic changes that hardwareiscurrentlyundergoingintermsofpower, architecture, storageandnetworks. Inpractice this means that our algorithms must exhibit strong scaling behavior, i.e., the ability to operate on large numbers of resource constrained processors. Such lightweight systems are relatively constrained in terms of RAM. However, this problem is mitigated by the fact that SSDs (solid state drives) offer high speed out-of-core storage. This is advantageous whenever the data has nonuniform access properties, as is common on natural data [26]. Forinstance, wemayonlywanttokeepthemostfrequentlyusedkeysinmemoryandprefetch theremainderondemandtohidelatency. Thisrequiresalgorithmsthatareabletore-arrangedata. For power-law distributions frequencies of occurrence follow O(x−a) for some exponent a (e.g. a > 2 for English). In this case, the probability of seeing any item of rank x or higher is O(x1−a). With suitable constants this means that after caching the 105 most frequent items would lead to a miss rate of 10−4. Given a rate of 105 IOPS on modern SSDs, this means that we can handle at least 109 operations when combining RAM and SSDs. While quite obvious, few machine learning algorithms currently take advantage of these basic properties. Secondly, it is a commonly stated misperception that machine learning algorithms are disk and IO-bound. Indeed, when designing algorithms geared towards loading data, processing it once and thendiscardingtheobservations, thenetworkanddiskinterfacesareconsiderablyslowerthanwhat the main memory interface offers. To make things more explicit we list a range of systems below: Infiniband Ethernet Disk SSD RAM Cache Capacity n.a. n.a. 3TB 512GB 16GB 16MB Bandwidth 1GB/s 100MB/s 200MB/s 500MB/s 30GB/s 100GB/s IOPS 106 104 102 105 108 109 This means that any algorithm processing data only once will almost inevitably be data-starved since memory and processors are significantly faster. A simple means to address this is to reuse instances that have been loaded into memory more than once before evicting them. This instantly rebalances data access and processing. Preliminary results in [59] indicate that exploiting this property can yield significant savings in performance. 3 Proposed Work We propose to advance the state of the art in the following aspects: Infrastructure: We aim to build an open-source distributed (key,vector) storage system for op- timization between multiple machines. This constitutes the systems component of a parallel inferencetoolkit. Faulttolerance, replication, keydistribution, andselfrepairwillbestudied. Optimization: We will use the above architecture to solve distributed optimization problems. Our general strategy is to extend dual decomposition methods to incorporate second order asynchronous updates in the dual outer loop. Partitioning: The above problems require good partitioning algorithms. At present there are no good established techniques for scalable graph partitioning. That is, while there exist plenty of tools that minimize the number of edges cut in partitioning [10, 32, 43], this is not entirely the case for decompositions that are efficient for memory and computation. 7 Integration: A key point of our design is to allow for easy integration of third party implemen- tations. Codes such as Vowpal Wabbit (http://hunch.net/~vw) (VW) or LibLinear (http: //www.csie.ntu.edu.tw/~cjlin/liblinear/) are highly effective single machine solvers. Individually engineering them to allow for parallelization is nontrivial. That said, variable decomposition allows for efficient problem distribution. Applications: They serve as a testbed for large scale inference. Due to the targeted problem size, we mainly aim to address data-descriptive and unsupervised problems in terms of large amounts of data, such as topic modeling, factorization, and recommendation problems. 3.1 Optimization To perform distributed inference we have essentially two alternatives: Partition and Merge: We partition the optimization problem judiciously into subsets and solve them separately. Once sub-solutions are obtained, we combine them into a joint solution. Continuous Synchronization: While solving the optimization problem we continuously keep partial solutions adjusted and consistent, possibly with delay. Partition and Merge is quite popular for convex problems. [58, 95] show that it is possible to solve subproblemsfordistinctpartitionsandtoperformafinalmergeroperationtoobtainresultsthatare comparable to global convex solutions. This holds since we know that a) averaging local solutions can only improve estimates and b) near optimality averaging solutions is variance reduction. Unfortunately, for nonconvex problems such as clustering, matters are not quite so straightfor- ward. For instance, the inherent invariance in clustering immediately leads to the problem that identical clusters will have different identifiers on different machines. Generally, for such nonconvex settings a great deal of care is required in the Partition and Merge scenario. For instance, [65] discuss solving intermediate linear assignment problems and [50] use staged optimization. This leaves us with the second approach, Continuous Synchronization as the only viable al- ternative in the general nonconvex case. It is well known that this accelerates convergence. For instance, the stochastic EM algorithm [20] for clustering and topic models [60], sequential Monte- Carlo methods for topic models [4, 19], and collapsed samplers [36] all bear witness to this fact, albeit on single processors. Our approach is to extend dual decomposition algorithms to asynchronous and second order strategies. This addresses two problems with ADMM-style algorithms [18]. The problem (cid:88) (cid:88) minimize f (w) is equivalent to minimize f (w ) subject to w = z. (5) i i i i w wi,z i i Subsequently we may specify a matching Lagrange function via L({w },z,{µ },{λ }) = (cid:88)f (w )+ µi (cid:107)w −z(cid:107)2+(cid:104)λ ,w −z(cid:105) (6) i i i i i i i i 2 i Hence the problem decomposes into subproblems pertaining to w , a joint problem in terms of z i that is straightforward to solve, and dual problems in terms of µ and λ. The strategy proposed by [18] is to perform dual ascent in µ and λ. This is problematic since often gradient descent algorithms suffer from slow convergence [13] (also confirmed by preliminary experiments). We aim to address this by employing a second order method in the dual. 8 Figure1: Parameterdistributionand replication in the parameter server. Server Server ... Server Data Data is exchanged between clients base and a distributed pool of servers. Paxos Fault tolerance is achieved by syn- chronizing machine state with a con- Client Client ... Client Client sensusserverimplementingthePaxos protocol[48]andadatabasebackend. Secondly, the optimization in (6) is carried out synchronously. This slows down convergence between machines and also makes the algorithm sensitive to slow and faulty machines. With 1000s of machines the probability of experiencing at least one malfunction is very high [77]. Proposed Strategy An alternative is to perform asynchronous gradient ascent. The second order stochastic gradient descent method with exponential rates, as described by [73] can likely be modified to address the scenario where gradients are updated asynchronously. The key difference is that while [73] use the update strategy on instantaneous loss gradients, we use it on subsets of variables in the dual. We plan on drawing on previous experience with distributed second order algorithms [79] in our work. Lastly, we plan on using the memory hierarchy for efficient caching. Preliminary experiments by [59] show that dual parameter update yields dramatic acceleration relative to algorithms that use data only once at each pass. The challenge is that it is infeasible to communicate the values of associated instance-based Lagrange multipliers as is the default in [59]. However, by virtue of saddle point conditions we believe that it is possible to synchronize only a much smaller set of the parameter space while still retaining near optimality. 3.2 Parameter Server Distributedoptimizationrequiresandefficientmechanismforcollecting, updatingandrebroadcast- ing parameters. This is what the Parameter Server is meant to address. 1. We need a parameter distribution mechanism which can be executed without the need for a central directory manager to decide where parts of the model are stored. 2. We need a mechanism for replication and persistent storage of parameters in case some of the machines die. This is needed for unreliable resources such as Amazon’s spot instances. 3. We need to decompose optimization problems such that optimization on individual client machines can proceed without much modification while the parameter server acts as glue between them. What level of ‘intelligence’ does the parameter server need? 4. We need scheduling algorithms to decide which parameters to synchronize more aggressively than others. [54] has demonstrated significant gains from effective prioritization. As outlined in Section 3.3 many of these questions have been resolved satisfactorily in the context ofdistributed(key,value)pairstorage[23],blockreplicationfordistributedfilesystems[86,87],and consistent hashing [21, 42]. Our design is an extension of the fully connected bipartite replication mechanism introduced in [3, 76]. 9

Description:
source platform will serve as a teaching tool and we will organize summer What are the limitations of existing machine learning algorithms? develop distributed inference techniques, with the understanding that many of .. Carlo methods for topic models [4, 19], and collapsed samplers [36] all bear
See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.