Distributed Low Rank Approximation of Implicit Functions of a Matrix David P. Woodruff Peilin Zhong IBM Almaden Research Center Institute for Interdisciplinary Information Sciences San Jose, CA 95120, USA Tsinghua University, Beijing 100084, China Email: [email protected] Email: [email protected] 6 1 0 Abstract—We study distributed low rank approximation in several servers, and one wants to apply a non-linearoperation 2 which the matrix to be approximated is only implicitly rep- tothepoints,itdoesnotapply.Forexample,thetargetmaybe n resented across the different servers. For example, each of s to analyze importantcomponentsof Gaussian random Fourier a servers may have an n×d matrix At, and we may be interested features [10] of data. Another example could be that each J in computing a low rank approximation to A = f( st=1At), person is characterized by a set of health indicators. The 8 whsereAft.isWaefsuhnocwtiofnorwahwicihdeiscalapspslioefdfuenntcrtyiownisseftoittiPhsepmosastibrilxe value of an indicator of a person may be different across 2 t=1 recordsdistributedineachhospitalforthatperson.Becausethe A] dtfPooerneowftfiehcsiicehtnhtelkyApcr−oomjAepcPtuiotken2F ao≤fdkA×Aod−nt[roAan]tkkhk-e2Fkr+powrεokjesAcpktai2Fonn, owmfhaePtrre,ixaAnPPd pisrsoubeabinilcirtyeatsheastawpheernsoanhhoasspaitparlodbeletemctasstshoecipatreodblwemith, athheeareltahl [A]k denotes the best rank-k approximation to A given by the valueofanindicatorshouldbealmostthemaximumamongall N singular value decomposition. The communication cost of our records.Thismotivatestakingthemaximumvalueof an entry . protocolsisd·(sk/ε)O(1),andtheysucceedwithhighprobability. shared across the servers, rather than a sum. These examples s c Our framework allows us to efficiently compute a low rank cannotbecapturedbyanypreviousmodel.Therefore,wefocus [ approximation to an entry-wise softmax, to a Gaussian kernel on a stronger model and present several results in it. expansion,andtoM-Estimators appliedentrywise(i.e.,forms of 1 robustlow rank approximation). Wealso show that our additive The model(generalizedpartitionmodel). In thegeneral- v errorapproximationisbestpossible,inthesensethatanyprotocol ized partition model, there are s serverslabeled 1 to s. Server 1 achieving relative error for these problems requires significantly t has a local matrix At Rn×d, (n d), of which each 2 more communication. Finally, we experimentally validate our ∈ ≫ row is called a data point.Each of s serverscan communicate 7 algorithms on real datasets. with server1, which we call the CentralProcessor (CP). Such 7 a model simulates arbitrary point-to-point communication up 0 . I. INTRODUCTION to a multiplicative factor of 2 in the number of messages and 1 an additive factor of log s per message. Indeed, if server i 0 In many situations the input data to large-scale data min- 2 would like to send a message to server j, it can send the 6 ing, pattern recognition, and information retrieval tasks is message to server 1 instead, together with the identity j, and 1 partitioned across multiple servers. This has motivated the server1 can forwardthismessage to serverj. Theglobaldata : distributed model as a popular research model for computing v matrix A Rn×d can be computed given At s . That is, Xi omninsiumcihzidnagtac.oCmommumnuicnaictiaotnioncoisstoifstecnruacmiaaljotorbthoettlesuncecceks,sanodf to comput∈e the (i,j)th entry, Ai,j = f( {st=1}At=ti1,j), where f : R R is a specific function known to all servers. Local ar suosmefuelptroootlocfoolrs.anParliynzciinpgallaCrogmepaomnoeunnttsAnoaflydsiisstri(bPuCteAd)diastaa. compu→tation in polynomial time and linearPspace is allowed. The goalof PCA is to find a low dimensionalsubspacewhich Approximate PCA and Low-rank matrix approxima- captures the variance of a set of data as much as possible. tion. Given A Rn×d,k N ,ε R , a low-rank + + Furthermore,it can be used in variousfeature extractiontasks ∈ ∈ ∈ approximation to A is AP, where P is a d d projection such as [1], [2], and serve as a preprocessing step in other matrix with rank at most k satisfying × dimensionalityreductionmethodssuchasLinearDiscriminant Analysis (LDA) [3], [4]. A AP 2 (1+ε) min A X 2 || − ||F ≤ X:rank(X)≤k|| − ||F PCA in the distributed model has been studied in a or number of previous works, including [5], [6], [7], [8], [9]. Several communication-efficient algorithms for approximate A AP 2 min A X 2 +ε A 2 PCA were provided by these works. In the setting of [8], [9], || − ||F ≤X:rank(X)≤k|| − ||F || ||F the information of each data point is completely held by a unique server. In [7] a stronger model called the “arbitrary where the Frobenius norm A 2 is defined as A2 . || ||F i,j i,j partitionmodel”isstudied:eachpointisretrievedbyapplying If P has the former property, we say AP is a low-rank a linear operation across the data on each server. Despite approximation to A with relative error. Otherwise,PAP is an this stronger model, which can be used in several different approximation with additive error. Here, P projects rows of scenarios, it is still not powerful enough to deal with certain A onto a low-dimensionalspace. Thus, it satisfies the form of cases. For instance, if each data point is partitioned across approximate PCA. The target is to compute such a P. In section VI, we will see that the previous Gaussian Ω˜((1 +ε)−p2n1−p1d1−p4) bits. Since n could be very large, random Fourier features and hospital examples can be easily this result implies hardness when f(x) grows quickly. When captured by our model. f(x) = xp(or xp) (p = 0), Ω(1/ε2) bits of communication are needed. Th|e |result a6 lso improves an Ω˜(skd) lower bound Our contributions. Our results can be roughly divided shown in [7] to Ω˜(max(skd,1/ε2)). When f() is max(), into two parts:1. An algorithmicframeworkforadditive error we show an Ω˜(nd) bit lower bound which motiv·ates us usin·g approximatePCA for generalf() and severalapplications. 2. · additive error algorithms as well as a softmax (GM) function. Lower bounds for relative error approximate PCA for some classes of f(). Our lower bounds thus motivate the error Related work. Sampling-based additive error low-rank · guarantee of our upper bounds, as they show that achieving approximation algorithms were developed by [11]. Those relative error cannot be done with low communication. algorithms cannot be implemented in a distributed setting directlyastheyassumethatthesamplercanreportprobabilities Algorithmic framework: For a specific function f(), sup- · perfectly, which is sometimes hard in a distributed model. In pose there is a distributed sampler which can sample rows the row partition model, relative error distributed algorithms from the global data matrix A with probability proportional are provided by [8], [9], and can also be achieved by [20]. to the square of their ℓ norm. Let Q be the probability that 2 j Recently, [7] provides a relative error algorithm in the linear, row j is chosen. If the sampler can precisely report Q for a j aforementioned arbitrary partition model. We stress that none sampled row j, a sampling based algorithm is implicit in the of the algorithms above can be applied to our generalized workofFrieze,Kannan,andVempala[11]:theserverstogether partition model. run the sampler to sample a sufficient number of rows. Each server sends its own data corresponding to the sampled rows to server 1. Server 1 computes the sampled rows of A, scales II. PRELIMINARIES them using the set Q row j is sampled and does PCA { j | } We define [n] to be the set 1,...,n . Similarly, [n] 1 on the scaled rows. This provides a low-rank approximation definesthe set 0,...,n 1 .A ,{A and}A denotesthe−ith to A with additive error. Unfortuately, in some cases, it is row,thejth col{umnand−the}(i,ji)th:e,jntryofAi,jrespectively.At not easy to design an efficient sampler which can report the isthedatamatrixheldbyservert. , and meansp- sampling probabilities without some error. Nevertheless, we |·|p ||·||F ||·|| norm, Frobenius norm and Spectral norm. [A] represents the provethat if the sampler can reportthe sampling probabilities k best rank-k approximation to A. Unless otherwise specified, approximately,then the protocolstill works. Supposethe total all the vectors are column vectors. communicationofrunningthesamplerrequires words.Then the communicationof our protocolis O(sk2d/εC2+ ) words. If the row spaceof A is orthogonalto the row space ofB, C A 2 + B 2 = A+B 2. This is the matrix Pythagorean Applications: One only needs to give a proper sampler to |t|heo||rFem.|T|hu||sF A||AP 2||=F A 2 AP 2 willbeheldfor apply the framework for a specific function f. The softmax || − ||F || ||F−|| ||F any projectionmatrix P. The goalof rank-k approximationto (generalized mean) function is important in multiple instance Acanbealsointerpretedasfindingarank-kprojectionmatrix learning (MIL) [12] and some feature extraction tasks such P which maximizes AP 2. as [13]. Combining this with the sampling method in [14], || ||F [15], a sampler for the softmax (GM) function is provided. For a vectorv Rm anda setof coordinates , we define Here the idea is for each server to locally raise (the absolute v( ) as a vector in∈Rm satisfying: S value of)each entry of its matrixto the p-th powerfor a large S valueof p>1, andthen to applythe ℓ2/p-samplingtechnique v(S)j =vj j ∈S of [14], [15] on the sum (across the servers) of the resulting v( )j =0 j (cid:26) S 6∈S matrices. This approximates sampling from the matrix which is the entry-wise maximumof the matrices across the servers, III. SAMPLING BASEDALGORITHM andseemstobeanewapplicationofℓ -samplingalgorithms. 2/p We start by reviewing several results shown in [11]. Sup- Several works [10], [16] provide approximate Gaussian pose A Rn×d and there is a sampler s which samples row RBF kernel expansions. Because Gaussian kernel expansions i of A w∈ith probability Q satisfying Q cA 2/ A 2 for havethesamelength,simpleuniformsamplingworksinthese a constant c (0,1]. Let siindependentliy≥sam|plie|2r|t|im|e|Fs and scenarios. the set of sa∈mples be A ,...,A . We construct a matrix B Rr×d such that {i′ i1 [r],Bir}= A / rQ . In the Forsomeψ-functionsofM-estimators[17]suchasL1−L2, foll∈owing,wewillshow∀tha∈tweonliy′ needtioi′compuitie′thebest “fair”, and Huber functions, we developa new sampler which lowrankapproximationtoB toacquirea goodpapproximation may also be of independentinterest. This is related to the fact to A. thatsuchfunctionshaveatmostquadraticgrowth,andwe can generalizethesamplersin[14],[15]fromp-thpowerstomore As shown in [11], θ > 0, the condition ATA general functions of at most quadratic growth, similar to the BTB θ A 2 will b∀e violated with probability||at mo−st generalization of Braverman and Ostrovsky [18] to the work O(1/(||θF2r≤)). T||hu|s|F, if the number of samples r is sufficiently ofIndykandWoodruff[19]inthecontextofnormestimation. large, ATA BTB is small with high probability. F || − || Lower bounds: We also obtain several communication Lemma 1. If ATA BTB θ A 2, then for all lower bounds for computing relative error approximate PCA. projection matri|c|es P′ −satisfying||Fran≤k(P′|)| ||Fk, ≤ These results show hardness in designing relative error algo- rithms. When f(x) ≡ Ω(xp) (p > 1), the lower bound is |||AP′||2F −||BP′||2F|≤θk||A||2F. Proof:Withoutlossofgenerality,wesupposerank(P′)= Algorithm 1 Compute P k. Let U:,1,...,U:,k be an orthogonalbasis of the row space 1: Input: At Rn×d s ; k [d]; ε>0; of P′.{We have } 2: Output{: ran∈k-k proj}etc=ti1on m∈atrix P. (cid:12)(cid:12)=||A|P|A′|U|2FU−T||||2FB−P′||||B2F(cid:12)(cid:12)UUT||2F = ||AU||2F −||BU||2F 345::: Sfoertsip′as:ra=amm1pelt→eesrsrrordw=oiΘi′(fkεro22m), γA=anOd(rqep1ro)r.ts Qˆii′ satisfying (cid:12) k k (cid:12) (cid:12) (cid:12) (1 γ)Q Qˆ (1+γ)Q =(cid:12)(cid:12) (U:,j)TATAU:,j − (cid:12)(U:,j(cid:12))TBTBU:,j(cid:12) (cid:12) 6: end−for ii′ ≤ ii′ ≤ ii′ ≤(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)Xjk=1(U:,j)T(ATA−BTXj=B1)U:,j (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) 7: ∀coi′m∈put[ers],Bse∈rvRerr×tdsesantdissfyAintiig′ Btoi′s=ervqeArrQiˆi1′ii′and server 1 Xj=k1(cid:12)(cid:12) (cid:12)(cid:12) 8: Sinertvheerr1owcosmpapcueteosfthBe atonpdkoustipnugtuslaPr v=ecVtoVrsTV:,1,...,V:,k (U )T ATA BTB U kθ A 2 ≤ :,j 2|| − ||| :,j|2 ≤ || ||F j=1 X(cid:12) (cid:12) (cid:12) (cid:12) γ 0.5. Then we have i [r],1 2γ Q /Qˆ 1+2γ. ATA BTB ATA BTB θ A 2 implies the ≤ ∀ ∈ − ≤ ji ji ≤ l||ast ine−quality.|| ≤ || − ||F ≤ || ||F E((BTB) ) (ATA) i,j i,j − w|L|BhemiPchm′||as2Fa2|ti.s≤fiIfesfεo||rBAaP|l|l2FP,=′ths[eaBnti]sktfhyeianlgsroarnakpn-rkko(vPipd′re)osj=ecakti,ogn|o||oAmdPart′ra|i|nx2Fk-P−k (cid:12)(cid:12)=(cid:12)(cid:12)Xt=r1mXn=1QmAmr,QiˆAmm,j(cid:12)(cid:12)−(ATA)i,j(cid:12)(cid:12) approximation to A: (cid:12)(cid:12) n Qm (cid:12)(cid:12) =(cid:12) A A ( 1) (cid:12) ||A−AP||2F ≤||A−[A]k||2F +2ε||A||2F (cid:12)(cid:12)(cid:12)mX=1n m,i m,j Qˆm − (cid:12)(cid:12)(cid:12) (cid:12) (cid:12) (cid:12)2γ Am,iAm,j (cid:12) Proof: Suppose P∗ provides the best rank-k approxima- ≤ | | m=1 tion to A. We have, 2γ AX A :,i 2 :,j 2 ≤ | | | | AP 2 BP 2 ε A 2 BP∗ 2 ε A 2 Due to x2 2(x y)2+2y2 for any x,y R, || ||F ≥|| ||F − || ||F ≥|| ||F − || ||F ≤ − ∈ AP∗ 2 2ε A 2 = [A] 2 2ε A 2 ≥|| ||F − || ||F || k||F − || ||F E(((BTB)i,j (ATA)i,j)2) − 2E(((BTB) E(BTB) )2)+8γ2 A 2 A 2 ≤ i,j − i,j | :,i|2| :,j|2 r Thus, combining with Lemma 1 and Lemma 2, if matrix 2 E((B B )2)+8γ2 A 2 A 2 B has r = Θ(k2/ε2) rows, the projection matrix P which ≤ t,i t,j | :,i|2| :,j|2 t=1 provides [B] also provides an O(ε) additive error rank-k X k A 2 n A2 A2 approximationto A with constant probability.So, we can just 2(1+2γ)2|| ||F t,i t,j +8γ2 A 2 A 2 compute the projection matrix which providesthe best rank-k ≤ cr t=1 |At|22 | :,i|2| :,j|2 approximation to B. X E( (BTB) (ATA) 2) || − ||F IV. FRAMEWORK FORDISTRIBUTED PCA d = E(((BTB) (ATA) )2) Our protocol implements the previous sampling based i,j − i,j algorithm in a distributed setting. Server t [s] holds a local iX,j=1 matrix At Rn×d and the (i,j)th entry ∈of the global data A 2 n 1 d 8 fmoartrcioxmApuitsi∈ncgomappurtoejdecbtyionAim,ja=trixf(P sstu=c1hAttih,ja)t.AAPfriasmaewloowrk- ≤2(1+2γ)2||cr||F t=1 |At|22 i,j=1A2t,iA2t,j + cr||A||4F X X rank approximationto A is presentePd in Algorithm 1. Here, s 16 ε2 A 4 A 4 is the same as that defined in the previous section but in the ≤ cr|| ||F ≤ 90k2|| ||F distributedmodel. Notice thats may be differentwhenf(.) is Due to Markov’s inequality, different. ε 1 InAlgorithm1,eachrowi′ ofB isscaledviaQˆ butnot Pr( ATA BTB A 2) ii′ || − ||F ≥ 3k|| ||F ≤ 10 Q . However this is not an issue. ii′ Lemma 3. For A, B in Algorithm 1, ATA BTB F O(ε/k) A 2 holds with constant proba|b|ility. − || ≤ Without the sampling stage, the only communication in- || ||F volved is to collect Θ(k2/ε2) rows from s servers. Proof: We set r = 1440k2 where c is a constant Theorem 1. With constantprobability,Algorithm1 outputsP ⌈ ε2c ⌉ satisfyingthecondition: i [n],Q cA 2/ A 2.Assume forwhichAP isanO(ε) additiveerror rank-k approximation ∀ ∈ i ≥ | i|2 || ||F to A. Moreover, if the communicationof samplingand report- For convenience, we call z(a ) the contribution ingQˆ uses words,thecommunicationoftheoverall of S (a). To do the first stepj,∈Swi(ea)wanjt to estimate Z(a) protoc{oi1l,.n..e,ierd}s O(skC2d/ε2+ ) words. and tihe contribution of eachPclass. We can achieve this by C estimatingthesizeofeachclass.Intuitively,ifthecontribution In addition, to boost the success probability to 1 δ, we of Si(a) is “non-negligible”,there are two possible situations: can just run Algorithm 1 O(log(1/δ)) times and ou−tput the 1. j Si(a), j is “heavy”, z(aj)/Z(a) is large; 2. the size ∀ ∈ smaamtreixasPbewfiothremuapxtiomaunmO||(BloPg(||12F/.δT))hefaccotomr.munication is the o“hfeSavi(ya”)ciosolradrignea.teFsotrothgeetfitrhset sciazsee,owf eSij(uas)t.fiFnodr tohuetsaelclotnhde case,ifwerandomlydropsomecoordinates,therewillbesome coordinatesinS (a)survived,andallofthosecoordinateswill V. GENERALIZEDSAMPLER i be “heavy” among survivors. Thus we can find out all the To apply Algorithm 1 for a specific f(), we should “heavy” survivors to estimate the size of Si(a). The second implement the distributed sampler s which can· sample rows step is easier, we will show it in V-D. of a global data A, and row i is sampled with probability In V-B, we propose a protocol which can pick out all the approximately proportional to A 2. An observation is that |Ai|22 = dj=1A2i,j. Thus, the| rio|2w sampling task can be “ohfetahveys”izceooorfdeinaacthesc.laIsnsVa-nCd,Zw(ea)d.eImnoVn-sDtr,atweethceomesptliemteatoiounr converted into an entry sampling task. If an entry is sampled, sampler. then we cPhoose the entire row as the sample. In the entry- sampling task, we want to find a sampler which can sample B. Z-HeavyHitters entries of A such that A is chosen with probability propor- i,j tional to A2 = f( s At )2. Another observation is that A streaming algorithm which can find frequent elements in Algorithmi,j1, weonlty=n1eeid,jtoguaranteetheprobabilitythat whose squared frequency is large compared to the second cro.wThiuiss, icfhothseerneiesxniPsotst laesfsunthcatinonc|zA(ix|22)/a||nAd||a2Fcfoonrstaanctocnstan1t Bmeocmauesnet i(tthpero“vFid2e-svaalulien”eaorfskaetsctrhe,aimt c)ainsbdeesecarsiiblyedcoinnve[2rt1e]d. such that z(x)/c f(x)2 cz(x), then the sampler wh≥ich into a distributed protocol in our setting. We call the protocol samples A with≤probabili≤ty proportional to z(A ) can be HeavyHitters. The usage of HeavyHitters is as follows. used in Algi,ojrithm 1 when computing f. i,j vt ∈Rm islocatedinservert.Letv = st=1vt.Withsuccess probabilityatleast1 δ,HeavyHitters(v,B,δ)reportsallthe Weconsideranabstractcontinuousfunctionz(x):R R coordinates j such th−at v2 v 2/B. SePrver 1 can get reports which satisfies the property : x1,x2 R, if x1→ fromHeavyHitters.Thecjo≥mm| u|2nicationofthisprotocolneeds S|xu2p|p,oxse21/azt(x1R)l≥isxlo22/cazt(exd2)o,nzP(sxer1v)∀er≥t.zL(xet2)a∈,=and zs(0|)at=|an≥0d. O(sB·poly(log(m))log(1/δ)) words. Z(a)= li=∈1z(ai). Pt=1 HeaBvyyHuistitnegrsH(veavyRHmi,ttBer,sδ,)wsheodwenveilnopAalgoprriotthomco2l.called Z- ∈ Theorem 2. There is a protocol which outputs i [l] with P probability(1 ε)z(a )/Z(a)+O(l−C)andreports∈a (1 ε) Algorithm 2 Z-HeavyHitters i approximation±to Z(a) with probability at least 1 O(l−±C), 1: Input: v Rm; B >0; δ >0. where C is a non-negative constant. The commu−nication is 2: Output: D∈ s·poly(log(l)/ε) bits. 3: ∀[t4B∈2[⌈]20frloomg(1a/δp)⌉a]i,rwSiesreveirnd1epsaemndpelnets hfaamshilty :o[fmh]a→sh Notice that the additive error of O(l−C) can effectively f⌈unctio⌉ns. Server 1 broadcasts random seeds. Thus t, all be treated as relative error, since if O(l−C) is an ε-fraction servers know hash . ∀ t of z(aj)/Z(a), then the probability that j is sampled at least 4: Server 1 sets D = . once is negligible. Therefore, this sampler can be applied in 5: for t:=1 20lo∅g(1/δ) do Algorithm 1. In the following, we will show how to sample 6: for e:=→1⌈ 4B2 do⌉ coordinates of a with the desired distribution. 7: Ht,e :=→j⌈ ha⌉sht(j)=e 8: D :=D{H|eavyHitters(v(}Ht,e),B,1/(16B2)) ∪ A. Overview 9: end for 10: end for We give an overview of our sampler. We define Si to be 11: report D the interval [(1+ε)i,(1+ε)i+1). Similar to [15], coordinates of a are conceptually divided into several classes Lemma 4. With probabilityatleast1 δ, j [m] satisfying Si(a)={j | z(aj)∈Si} z(vj) ≥ Z(v)/B, j is reported by Z−-Hea∀vyH∈itters(v,B,δ). The communication cost is O(s poly(Blog(m))log(1/δ)). Here, i Z. Although the number of classes is infinite, there · ∈ are at most l non-empty classes which we care about. To The intuition is that any two coordinates which are both sample a coordinate, our procedure can be roughly divided heavyinZ(v) donotcollideinthe samebucketwithconstant into two steps: probability.Due to property of z(), the “heavy”coordinate P · will be reported by HeavyHitters. 1) Sample a class S (a) with probability approximately i proportional to j∈Si(a)z(aj)/Z(a). Proof: Because hasht is from a pairwise independent 2) Uniformly output a coordinate in S (a). family, for fixed t [ 20log(1/δ) ],j,j′ [m],j = j′, the i P ∈ ⌈ ⌉ ∈ 6 probabilitythat hash (j)=hash (j′) is (4B2)−1. According Algorithm 3 Z-estimator t t to the union bound, for a fixed t, the probability that any two 1: Input: at Rl s ; ε>0. different coordinates j,j′ ∈[m] with z(vj),z(vj′)≥Z(v)/B 2: Output{: Zˆ,∈sˆ−∞},t.=..1,sˆ+∞,List,g() do not collide is at least 3/4. 3: B =40ε−4T3log(l), W =(5120C· 2T2ε−3log(l))2 Claim: For fixed t,e, suppose Ht,e has at most one 4: Initialize: Zˆ =0,sˆ−∞ =0,...,sˆ+∞ =0 coordinate j such that z(v ) Z(v)/B. j is reported by 5: D :=Z-HeavyHitters(a,B,l−20C) j HeavyHitters(v(Ht,e),B,1/(16≥B2)) with probabilityat least 6: ∀j ∈ D, server 1 asks atj for t ∈ [s] to compute aj and 1 1/(16B2). makes sˆ = S (a) D . i i − 7: Server 1 sam| ples ∩g :|[ Cε−1l ] [ Cε−1l ] from a If j is the unique coordinate in the set Ht,e such (20Clog(ε−1l))-wise ind⌈epende⌉nt→fami⌈ly of h⌉ash func- that z(v ) Z(v)/B, then we have (v )2/z(v ) |v(Ht,e)|j22/Z(≥v(Ht,e)) due to the property jP ofjfun≥c- [tiWon]s.is∀jin∈de[p⌈elongd(eCntεl−y1sl)a]m⌉,pele∈d [f⌈rComloga(l)s⌉e]t, ohfj,epa:i[rlw]i→se tion z(). Because Z(v) Z(v(H )), we have (v )2/v·2 z(v )/Z(v) ≥ 1/B. Wt,heen HeavyHit- independent hash functions. Server 1 broadcasts random terjs(v(|H|2 ),≥B,1/(1j6B2)) suc≥ceeds, j is reported. Thus the seeds. t,e 8: for j [ log(Cε−1l) ],e [ Clog(l) ],w [W] do cplraoibmabiilsitytruthea.tDaluleintvooctahteionusnioofnHbeoauvnydH,iftoterrsasfiucxceededt,isthaet 9: Sj∈:=⌈{i | i∈[l],⌉g(i)∈≤2⌈−j⌈Cε−⌉1l⌉} ∈ least 3/4. Therefore, for a fixed round t, the probability that 10: Z-HSeja,ev,ywHi:t=ters{(ia(| i ∈),SBj,,lh−j2,e0(Ci)) = w},Dj,e,w := atollDtheiscaotolredaisnta1te/s2j. DsuucehtothaatCzh(evrjn)of≥fbZo(uvn)d/,Bthearperoadbdabediliitny 11: Dj := e,wDj,Se,jw,e.,wServer1communicateswithother servers to compute a for all p D . cltehoaaostrtdt1hinearteeδs.isjawritohunzd(vtj)∈≥[Z⌈2(0v)lo/gB(1a/rδe)⌉a]ddsuedchintthoatDa,llisthaet 12: 4C2∀εi−2l∈ogS(lZ),,s1ˆi6C:=2pεm−a2xlo(sˆgi(,l2)j∈|S>i(ja)|∩SiD(aj)|)∩ Dj| ≥ − 13: end for C. ESesttimTa=tionColfo|gS(il()a/)ε|+an1d.ZH(eare),C is aconstantdefinedin 14: RejpDorjtaZˆnd:=g(P·).isˆi(1+ε)i, sˆ−∞,...,sˆ+∞, List := D∪ Theorem 2. ⌈If S (a)(1+ε⌉)i εZ(a)/(40T), we say S (a) S i i contributes.De|fineS|(a)tobe≥consideringwhen(1+ε)i+1 [Z(a)/lC,(1+ε)Z(ai)]. Due to T/ε=o(l), for a sufficientl∈y Otherwise, there will be a j∗ such that i′≥i|Sj∗∩Si′(a)| is bounded.Therefore,forfixede,alltheelementswhichbelong large constant C, a contributing Si(a) must be considering. to S (a) can be hashed into differenPt buckets. Meanwhile, Define ZNC(a) to be the sum of all z(a ) satisfying that j i′≥i j (1+ε)i is heavy in S (a)(1+ε)i′+1. Thus, all belongs to a non-contributed Si(a). ZNC(a) is bounded as: elements in S (ai)′<wii|lSljb∗e∩repi′orted| by D . S (a) can j∗ i j∗ i ZNC(a)<l Z(a)/lC + S (a)(1+ε)i+1 thus be estimSated∩. P | | i · | | consid.,noXncontr.Si(a) Proof: Assume happens. Suppose Si(a) contributes. Because l1−C <ε/2 and the numberof consideringclasses is If Si(a) < 16C2Eε−2log(l), due to |Si(a)|(1 + ε)i | | | | ≥ at most T, ZNC(a)<εZ(a). This implies that if S (a) can εZ(a)/(40T), i | | beestimatedforallcontributingSi(a),wecangeta(1+Θ(ε)) (1+ε)i Z(a)/(640C2ε−3T log(l)) approximation to Z(a). ≥ Because B = 40ε−4T3log(l) > 640C2ε−3T log(l), all the In Algorithm 3, we present a protocol Z-estimator which elements in S (a) will be in D when line 5 is executed. outputsZˆ as an estimation of Z(a) and sˆ as an estimation of i i t|oSio(aut)p|.uTththeeofiuntpaultcLooisrdtiannadte.gF(·o)rwciollnvbeenuiesnecdeintoAalngaolryizthemth4e sˆi =|Si(a)∩D|=|Si(a)| algorithm, we define U(a)= nonconsideringSi(a). 8IfC|S2εi(−a2)l|o≥g(1l)6C22ε−−j2∗loSg((la)),t<her1e6eCx2isεt−s2alougn(iql)u.eFjo∗rsiu′chtih,at i Lemma 5 (Proposition 3.1 in [15]). With probability at least ≤ | | ≥ S 1 l−Θ(C), the following event happens: S (a)(1+ε)i εZ(a)/(40T) εS (a)(1+ε)i′/(40T) − E | i | ≥ ≥ | i′ | 1) R S (a) S (a) is considering U(a) , j So, i i [∀log∈(C{ε−1l)|], R 2 2−j S}(∪a){+C2}lo∀g(l∈). 2) I⌈fE(R ) ⌉ C|2ε∩−S2jlo|g≤(l),·2j R| i |=(1 ε)R. 2−j∗ Si′(a) j j | | 3) All the∩inSvoc≥ations of Z-HeavyH|itt∩erSs|succee±d. | | 40Tε−12−j∗ S (a) i ≤ | | Lemma 6. When happens, if S (a) contributes, with prob- 40Tε−1 16C2ε−2log(l) ability 1 l−Θ(C)Ethe following hiappens: ≤=640C2ε−·3T log(l) − 21)) sSˆii(≥a)(1−Dε)|oSri(aj)|. [ log(Cε−1l) ],Si(a) j = 2D5u6e0tCo2Eε−, |3STi′l(oag)(∩l).SjB∗|e≤cau1s2e80tChe2ε−n3uTmlboegr(l)o+f Cco2nlsoigd(elr)in≤g ⊆ ∃ ∈ ⌈ ⌉ ∩S Si(a) Dj = . classes is bounded by T, ∩ 6 ∅ TheintuitionofLemma6isthatifSi(a)contributes,there Si′(a) j∗ 2560C2ε−3T2log(l). aretwocases:if(1+ε)i isverylarge,itwillbereportedinD. | ∩S |≤ i′≥i X For a fixed e, define the event : p,q S (a),p=q, Algorithm 4 Z-sampler F ∀ ∈ i′≥i i 6 hj∗,e(p) = hj∗,e(q). Because hj∗,e is pairwise independent, 1: Input: a′t Rl′ s ; ε>0 Pr( )6is boundedby(2560C2ε−3T2logS(l))2/W =1/4by 2: Output{: coo∈rdina}tet=p1of a′ ¬F a union bound. 3: Zˆ,sˆ−∞,...,sˆ+∞,List,g :=Z-estimator(a′,ε) For i′ <i, because Si(a) contributes, we have: 4: Choose i∗ ∈Z with probability sˆi∗(1+ε)i∗/Zˆ (1+ε)i Si(a) ε(1+ε)i′ Si′(a)/(40T) 65:: iCfhpooissenpostaatinsfyininjegctge(dp)co=ormdiinnaqt∈eL,iostu∩tSpiu∗t(ap′).gO(qth)erwise, | |≥ | | output FAIL. Then, S (a) 40Tε−1S (a)(1+ε)i−i′ i′ i | |≤ | | Thus, contributed. Furthermore, the dimension of a′ is l′ =poly(l). We get the final sampler in Algorithm 4. S (a) (1+ε)i′ | i′ ∩Sj∗| Lemma 7. Withprobability1 l−Θ(C),thefollowinghappens ≤2·40Tε−12−j∗|Si(a)|(1+ε)i+C2log(l)(1+ε)i′ for all considering Si(a′), sˆi−=(1±ε)|Si(a′)|. If pi satisfies ≤4·40Tε−1·16C2ε−2log(l)(1+ε)i g(pi)=minq∈Si(a′)g(q), then pi belongs to List. =2560C2ε−3T log(l)(1+ε)i Proof: Suppose Z-estimator succeeds. If S (a) is grow- i Therefore, conditioned on , , for fixed e, p S (a) ing, then S (a′) contributes. Thus, due to Lemma 6 and i i we have E F ∀ ∈ ∩ Theorem 3, sˆ = (1 ε)S (a′). Point 2 of Lemma 6 also j∗,e,w i i S implies that p Lis±t. If|S (a)|is not growing, (1+ε)i > i i Z(a(Sj∗,e,w)) Zˆ(a)/(5ε−4T3l∈og(l)) > Z(a′)/B. D Si(a′) = Si(a′). ≤ 2560C2ε−3T log(l)(1+ε)i+z(ap) Therefore, pi ∈List and sˆi =|Si(a′)| ∩ Xi′<i Due to Lemma 7, in line 5 of Algorithm 4, p satisfies that 5120C2ε−3T2log(l)(1+ε)i g(p)=min g(q). ≤ q∈Si∗(a′) Because z(ap) Z(a( j∗,e,w))/B, Si(a) j∗ = Si(a) Lemma 8 (Theorem 3.3 in [15]). ≥ S ∩S ∩ D . w j∗,e,w Pr(g(p)= min g(q))=(1 Θ(ε))/S (a′) i S Since e is iterated from 1 to Clog(l) and Dj∗ = q∈Si(a′) ± | | ⌈ ⌉ D , with probability 1 lΘ(−C), S (a) = e,w j∗,e,w − i ∩ Sj∗ 4SSCi(2aε)−2∩logD(lj)∗.. TNhuost,icwehetnhatj =(1 j−∗, εsˆ)i8Cwi2lεl−b2elosge(tl)to b≥e LemSminace7asˆni∗dL=em(m1a±8,ε)Z|S-sia∗m(ap′)le|,rascacmoprdleisngcotoordTinhaetoerpemwit3h, (T1h±eoεr)e|mSi(3a.)W| withithprporboabbaibliitlyityat1le−asltΘ1(−Cl)−.Θ(C), Algorithm3 cporoobrdaibnialitteys(c1on±triΘbu(tεe)d)za(tam′p)o/sZt(Θa(′)ε)±tol−ZΘ((aC′))., tShiencperoibnajebciltietdy outputs Zˆ =(1 Θ(ε))Z(a) and i Z−,sˆ (1+ε)S (a). that Z-sampler outputs FAIL is at most Θ(ε). Thus, we can ± ∀ ∈ i ≤ | i | repeatZ-sampler O(Clog(l)) times and choosethe first non- injected coordinate. The probability that Z-sampler outputs a Proof: Suppose happens. If sˆ is assigned by line 6, E i non-injected coordinate at least once is 1 lΘ(−C). Because L(s1ˆ1i7e.m=±Dmε|uD)ae|S6∩itoa(Sanid)L(|ea.tmh)T|emh≤buaos|,uS5n∀,id(i2a∈oj)f||D.ZZO,jNsˆt∩ihCe≤(Srawi()(i1,asZe)+ˆ,|=sˆε≤i)|(iS1s2ij(a|asSΘs)ji|g(.∩εnB)e)eSdZcia(b(uaays))e|l.ino=ef apZnr(odab′εa)bbi=lyitya(1c(1o±n±stεaΘ)nZ(tε(f)aa)c)zt,(oare,ia)Tc/hhZe(coaor)eomr±di2ln−aiΘtse(sC−hi)o.iwsBnys.aamdpjulesdtinwgiCth ± Actually,Algorithm3canbeimplementedwithtworounds VI. APPLICATIONS of communication. In the first round, all the servers together compute List. In the second round, server 1 checks each A. Gaussian random fourier features element in List and estimates sˆ. i Gaussian RBF kernel [22] on two d-dimensional vectors x,y is defined as D. The sampler Since Zˆ(a) is a (1 ε) approximation to Z(a), the K(x,y)=φ(x)Tφ(y)=e−|x−2y|22 ± coordinate injection technique of [15] can be applied. De- fine considering S (a) to be growing when (1 + ε)i According to the Fourier transformation of K(x,y), i Zˆ(a)/(5ε−4T3log(l)). If S (a) is growing, then server 1 ap≤- i pends εZˆ(a)/(5T(1+ε)i) coordinateswith value z−1((1+ K(x,y)= p(z)eizT(x−y)dz ε)i)to⌈vectora1 andothers⌉erversappendaconsistentnumber ZRd ioffz0−s1t(o(1th+eirεo)iw)ndvoeecstonrost.eSxiniscte, Sz((·)a)ismanusitnbcreeaesminpgtyf,uwnceticoann, where p(z) = (2π)−d2e−|z2|22. Suppose z is a d-dimensional i ignore this class. Thus server t can obtain a vector a′t and randomvectorwitheachentrydrawnfromN(0,1),Asshown global a′ = s a′t. Similar to Lemma 3.2,3.3,3.4 of [15], in [10],[16], estimating E (eizTxe−izTy) by sampling such a t=1 z Z(a) Z(a′) (1+ε)Z(a) and growing S (a), S (a′) is vectorz providesagoodapproximationtoK(x,y).According i i ≤ P≤ ∀ to [10], if the samples are z ,...,z , then φ(x) can be approx- C. M-estimators 1 l imated by In applications, it is possible that some parts of the raw data are contaminated by large amount of noises. Because √2(cos(zTx+b ),...,cos(zTx+b ))T 1 1 l l traditionalPCA is sensitiveto outliers,certainformsofrobust where b1,...,bl ∈ R are i.i.d.samples drawn from a uniform wPCeAcahnavfiendbeaenlodwevraenlokpaepdp.Broyxiamppatliyoinngtoaaprmopaterrixfutnhcattiohnasfh(a·)d, distribution on [0,2π]. a threshold applied to each of its entries, that is, the matrix In the distributed setting, matrix Mi Rn×m is stored A has its (i,j)th entry bounded by a threshold T. It is useful in server i [s]. The global matrix is c∈omputed by M = if one wants to preserve the magnitude of entries except for s Mt. L∈et each entry of Z Rm×d be an i.i.d. sample thosethathavebeendamagedbynoise,whichonewantstocap dratw=1n from N(0,1) and each∈entry of b Rd be an off. ψ-functionsof some M-estimators can be used to achieve iP.i.d. sample drawn from uniform distribution o∈n [0,2π]. An this purpose, and parts of them are listed in table I. Suppose approximatedGaussian RBF kernel expansion of M is A of ψ(x) is one of those functions and the global data A satisfies which i i Ai,j = ψ( st=1Ati,j). Because ψ(x)2 satisfies the property , combiningwith our frameworkand generalizedsampler, it Ai,j =√2cos((MiZ)j +bj) Pworks for cPomputing approximatePCA for such A. However, computing relative error approximate PCA to A may be very For fixed i,j, one observation is hard. Our result shows that at least Ω˜(nd) bits are needed to get relative error when f() is ψ-function of Huber. E(A2 )=E(2cos2((M Z) +b ))=1. · i,j i j j TABLEI. ψ-FUNCTIONSOFSEVERALM-ESTIMATORS Due to Hoeffding’s inequality, when d = Θ(log(n)), the 1procabnaboilbittayinthOat(k∀2i/∈ε2)[nr]o,w|Asio|22f =MΘvi(ad)unisifohrimgh.saTmhpulsinsgeravnedr Huber L1−L2 “Fair” generate Z,b with d = O(log(n)) to compute approximate k·sgn(x) if |x|>k x x PCA on these random Fourier features of M. (cid:26) x if |x|≤k (1+x22)12 1+|xc| B. Softmax (GM) VII. LOWERBOUNDSFORRELATIVEERRORALGORITHMS The generalizedmean functionGM() with a parameterp · of n positive reals x ,...,x is defined as: In this section, several lower bounds for relative error 1 n approximate PCA are obtained. Our results indicate that it is 1 n p1 hardtocomputerelativeerrorapproximatePCAinmanycases. GM(x ,...,x )= xp 1 n n i ! i=1 A. Notations X When p is very large, GM() behaves like max(). If p = 1, en denotes a 1 n binary vector whose unique non-zero GM() is just mean(). We· discuss the followin·g situation: entryicoordinateiso×ncoordinatei.e¯n isa1 nbinaryvector i each ·server i [s] ho·lds a local matrix Mi Rn×d. The with unique zero entry coordinated on i. 1n×denotes a 1 n ∈ ∈ × global data A is formed by vector with n “1”. I is an identity matrix of size n. n A =GM(M1 ,..., Ms ) i,j | i,j| | i,j| B. Lower bounds for p 1. Since server t can locally compute At such that Theorem 4. Alice has A1 Rn×d, Bob has A2 Rn×d. ≥ Let A = f(A1 +A2 ). if∈f(x) = Ω(xp) and p∈> 1, the i,j i,j i,j | | 1 communicationof computing rank-k projection matrix P with At = (Mt )p i,j s i,j constant probability greater than 1/2 such that th1e setting meets the generalized partition modelwith f(x)≡ ||A−AP||2F ≤(1+ε)||A−[A]k||2F Bxepc.aSuose, wf2e(xc)anhoalpdpslythethperosapmerptyling, aolugrogriethnmerailnize[d14s]a,m[1p5le]r. is at least Ω˜((1+ε)−p2n1−p1d1−p4) bits. P also works in this scenario. Furthermore, the communication We prove it by reduction from the L problem [23]. costs of our algorithmdoes not dependon p. Therefore,if we ∞ set p = log(nd), the word size of At is the same of the Theorem 5 (Theorem 8.1 of [23]). There are two players. i,j word size of Mt up to a factor log(nd). But for an arbitrary Alice gets x, Bob gets y, where x,y are length-m vectors. i,j constant c′ (0,1), when n,d are sufficient large, We have x ,y 0,1,...B for all i. There is a promise i i ∈ ∈ { } on the input (x,y) that either i [m], x y 1 or i i GM(M1 ,..., Ms )>c′max(M1 ,..., Ms ) !i, x y = B and j = i, x∀ ∈y |1. −The g|o≤al is to | i,j| | i,j| | i,j| | i,j| ∃ | i − i| ∀ 6 | j − j| ≤ determinewhichcasetheinputisin.Iftheplayerswanttoget canbeheld.Asshownintheresultsoflowerbounds,comput- the right answer with constant probability greater than 3/4, ing relative error approximate PCA for max() is very hard. then the communication required is Ω(m) bits. · B2 Proof of Theorem 4: Without loss of generality, we Therefore,wehaveacontradiction.Becausetheprobability assume f(x) xp. We prove by contradiction. If Alice or that Alice successfully computes P in step 4 is at least 2/3, ≡ | | BobcancomputetheprojectionmatrixP withcommunication according to Chernoff bounds, the probability that more than o˜((1 + ε)−p2n1−p1d1−p4) bits, and the success probability is half of the repetitions in step 6 successfully compute a rank- 2/3, then they can compute L in o(m) bits for m=n d k approximation is at least 1 1 . Due to the claim, if and B =⌈(2(1+ε)2nd4)21p⌉.∞ B2 × btheereequeaxlisttos jj,inthsetnep|(6A1(wr)i)thi,jp−+ro8b(laAobg2(idrli)nt)yi,ja|t ≥leasBt,1and c1will. Assume Alice and Bob have two m-dimensional vectors Then, applying a union bound, if there is a coordin−ate8loigtdhnat x,y respectively,wherem=n d.(x,y)satisfiesthepromise mentioned in Theorem 5. The×goal of the players is to find |xi−yi|≥B, then the probability that the entry will survive until the finalcheck is at least 7/8. Therefore,the playerscan whether there is a coordinate i with x y >= (2(1+ ε)2nd4)21p =B. They agree on the fo|lloiw−ingi|protoc⌈ol: oatutlpeuasttth7e/8r.ight answer for the L∞ problem with probability ⌉ 1) Initially, round r :=0,n :=n Analysisof the communicationcostofthe aboveprotocol: 0 2) Alice arranges x into an n d matrix A′1 and Bob there are at most log n rounds. In each round, there are arranges −y into A′(20) in th×e same way.(0) t1h0e0⌈clons(tlsogodfns)en+din1⌈g⌉ creodpluem⌉titnioninsdeinx csteipn e6a.chCormoubnindinagndwtihthe A′1 0 3) Alice makes A1(r) = 0(r) B I , Bob final check, the total cost is o˜((1+ε)−p2n1−p1d1−p4) bits, but A′2 (cid:18)0 × k−1 (cid:19) according to Theorem 5, it must be Ω˜((1+ε)−p2n1−p1d1−p4). makes A2 = (r) . Therefore, it leads to a contradiction. (r) 0 0 (cid:18) (cid:19) 4) Alice computes the projection matrix P which sat- In the above reduction, a main observation is that Bp isfies that when the protocol succeeds, A P is (r) is large enough that Alice and Bob can distinguish a large the rank-k approximation matrix to A(r), where column. Now, consider the function f(x) = Ω(xp). Let (A(r))i,j =|(A1(r))i,j +(A2(r))i,j|p. B = O((2(1 + ε)2nd4)21p). Then, the above redu|ct|ion still 5) Alice sorts ed+k−1,...,ed+k−1 into works. 1 d+k−1 ed+k−1,...,ed+k−1 such that ed+k−1P i1 id+k−1 | i1 |2 ≥ Theorem 4 implies that it is hard to design an efficient ... ed+k−1P . She finds i [d+k 1] which ≥ | id+k−1 |2 l ∈ − relativeerrorapproximatePCA protocolforf() whichgrows satisfies i d l k. · l ≤ ∧ ≤ very fast. 6) Alicerepeatssteps4-5100 ln(log n)+1 timesand sets c to be the most frequ⌈ent i . d ⌉ Theorem 6. Alice and Bob have matrices A1,A2 Rn×d 7) Alice rearrangesthe entries of clolumn c of A′(1r) into respectively. Let Ai,j = max(A1i,j,A2i,j)(or ψ(A1i,j +∈A2i,j)), d columns(If n < d, she fills d n columns with whereψ(x)isψ-functionofHuber.Fork >1,thecommunica- r r zeros). Thus, Alice gets A′1 −with n /d rows. tion ofcomputingrank-k projectionmatrix P with probability She sends c to Bob. (r+1) ⌈ r ⌉ 2/3 such that 8) Bob gets c. He also rearrangesthe entries of column cintodcolumnsinthesameway.So,hegetsA′(2r+1) ||A−AP||2F ≤(1+ε)||A−[A]k||2F with n /d rows. If A′2 has a unique non-zero ⌈ r ⌉ (r+1) needs Ω˜(nd) bits. entry (A′2 ) , he sends j to Alice and Alice (r+1) 1,j checks whether (A′1 ) + (A′2 ) = B. | (r+1) 1,j (r+1) 1,j| This reduction is from 2-DISJ problem [24]. This result Otherwise,letr :=r+1,n := n /d ,andrepeat r+1 ⌈ r ⌉ givesthemotivationthatwefocusonadditiveerroralgorithms steps 3-8. for f() is GM or ψ-function of M-estimators. · Claim: If i,j s.t. (A1 ) +(A2 ) B,j d, and Theorem 7. [24] Alice and Bob are given two n-bit binary Alice successfu∃lly comp|ute(sr)Pi,ijn step(4r), tih,je|n≥i will≤be equal vectors respectively. Either there exists a unique entry in both l to j in step 5. vectors which is 1, or for all entries in at least one of the vectors it is 0. Iftheywant to distinguishthese two caseswith Assume Alice successfully computes P. Notice that probability 2/3, the communication cost is Ω(n) bits. d+k−1 ed+k−1P 2 = P 2 =k. j d.Ifi =j,wehave t=1 | t |2 || ||F ∀ ≤ l 6 Pthat|ejd+k−1P|22 ≤ k+k1. So, if |(A1(r))i,j+(A2(r))i,j|≥B, we Proof of Theorem 6: Specifically, for the Huber ψ- have: function, we assume ψ(0) = 0,ψ(1) = 1,ψ(2) = 1. Alice and Bob are given (n d)-bit binary vectors x and y A A P 2 respectively.xandy satisfyth×epromiseinTheorem7.Ifthey || (r)− (r) ||F k can successfully compute a projection matrix P mentioned in ≥((A(r))i,j −(A(r)P)i,j)2 ≥(Bp−(k+1Bp+d))2 Theorem 6 with probability 2/3, then they can agree on the following protocol to solve 2-DISJ: >(2(1+ε)√n d d)2 (1+ε)2n d2 r r − ≥ >(1+ε)2nrd≥(1+ε)2X:ramnki(nX)≤k||A(r)−X||2F 1) Initialize round r :=0,n0 :=n 2) Theplayersflipallthebitsinx,yandarrangeflipped >||A(r)−A(r)P||2F x, y in n×d matrices A′(10), A′(20) respectively. A′(1r) 0 Theorem 9. [25] Each of Alice and Bob has an n-bit vector. 3) Alice makes A1 = 1d 0 , Bob makes There isapromise:eithertheinputsareatHammingdistance (r) 0 I less than n/2 c√n or greaterthan n/2+c√n. If they want k−2 − A′2 0 to distinguish the above two cases with probability 2/3, the (r) A2 = 0 0 . communication required is Ω(n) bits. (r) 0 0 4) (LAet )(A(r=))ψi,j((A1=)m+a(xA((2A)1(r)))i),j.,A(Alic2(re)c)io,jm)pu(toesr f(x)Prooxf.oAfsTshuemoeremAl8ic:e WanidthoBuotblohsasvoefxg,eynerality,1s,u1pp1o/sεe2 (r) i,j (r) i,j (r) i,j ≡ ∈ {− } P asmentionedinthestatementofTheorem5.Ifthe respectively. x,y denotes the inner product between x and h i protocolsucceeds,A P istherank-kapproximation y. There is a promise that either x,y < 2/ε or x,y > (r) h i − h i matrix to A . 2/ε holds. Alice and Bob agree on the following protocol to (r) 5) Alice finds l [d] such that (e¯d 0)P =(e¯d 0) distinguish the above two cases: l l ∈ 6) Alicerepeatssteps4-5100 ln(log n)+1 timesand ⌈ d ⌉ 1) Alice constructs (1/ε2+k) (k+1) matrix A1 and sets c to be the most frequent l. Bob constructs A2: × 7) Alice rearrangesthe entries of column c of A′1 into (r) A′1 whosesizeis n /d d.ShesendsctoBob. x ε 0 0 ... 0 (r+1) ⌈ r ⌉× 1 8) Bob also rearranges the entries of column c into d x2ε 0 0 ... 0 columns in the same way. Thus, he gets A′2 . If ... ... ... ... ... (r+1) AjOstett′(2horp+esAr1w3)li-ich8seae.s,anLadeutAnrilqic:u=eeczrhe+ercok1es,nnwtr(hyr+e(t1Ah)e′(:r2r=+(A1⌈)′(n)1r1r+,/j1d,))⌉h1,e,jrsee=pned0ast. A1 = x.ε001.2.ε √.00..2 √2.(00ε.1.+ε) ............ .000.. √2(1+ε) An observation is that r, there is at most one pair 0 0 0 ... of i,j such that max((A1(r)∀)i,j,(A2(r))i,j)(or ψ((A1(r))i,j + y1ε 0 0 ... 0 ε (A2 ) )) = 0. Therefore, the rank of A is at most k. y ε 0 0 ... 0 (r) i,j (r) 2 If Alice successfully computes P, ... ... ... ... ... y ε 0 0 ... 0 ||A(r)−A(r)P||2F ≤(1+ε)||A(r)−[A(r)]k||2F =0 A2 = ε012 0 0 ... 0 0 0 0 ... 0 The row space of P is equal to the row space of A . Notice (r) ... ... ... ... ... that if (e¯d 0) is in the row space of A , for j = i, (e¯d 0) i (r) 6 j 0 0 0 ... 0 cannot be in the row space of A . If Alice succeeds in step (r) 4, she will find at most one l in step 5. Furthermore, if i,j s.t. (A ) =0,j d, (e¯d 0) must be in the row spac∃e of 2) Let A = A1 + A2. Alice computes the rank-k (r) i,j ≤ j projection matrix P such that A . Then l will be equal to j in step 5. Similar to the proof (r) of Theorem 4, according to a Chernoff bound and a union A AP 2 (1+ε) A [A] 2 bound, if there is a joint coordinate, Alice and Bob will find || − ||F ≤ || − k||F the coordinate with probability at least 7/8. 3) Letubethefirstrowof(I P).Letv beu/u . k+1 2 4) Alicecheckswhetherv2 < 1(1−+ε),ifyes,return| t|he Thetotalcommunicationincludesthecommunicationofat 1 2 case x,y >2/ε,otherwise,returnthecase x,y < most 100⌈ln(logdn)+1⌉×logdn times of the computation 2/εh i h i of a rank-k approximationand the communicationof sending − several indices in step 7-8. Due to Theorem 7, the total communicationisΩ(nd)bits.Thus,computingP needsΩ˜(nd) Assume that Alice successfully computes P. Because the rank of A is at most k+1, bits. Finally, the following reveals the relation between lower ||A−AP||2F =||A(Ik+1−P)||2F =|Av|22 =vTATAv bounds of relative error algorithms and ε. Notice that Theorem 8. Alice andBobhave A1,A2 Rn×d respectively. SLuetppAois,ej 1=/εf2(A1i,2jn+,tAhe2i,cjo),mwmhuenriecaft(ioxn)o=∈fcxopm(oprut|ixn|gp)a,rpa6=nk-0k. |x+0y|22ε2 02 00 ...... 00 P with probab≤ility 2/3 such that ATA= 0 0 2(1+ε) ... 0 ε2 ... ... ... ... ... ||A−AP||2F ≤(1+ε)||A−[A]k||2F 0 0 0 ... 2(1ε+2ε) needs Ω(1/ε2) bits. We have A [A] 2 2. Then, || − k||F ≤ To prove this, we show that if relative error approximate k+1 2(1+ε) 2(1+ε) PCA can be done in o(1/ε2) bits communication, then the v2 = (1 v2 v2)<2(1+ε) GHD problem in [25] can be solved efficiently. ε2 i ε2 − 1 − 2 i=3 X We have v2+v2 >1 ε2. When x,y >2/ε, x+y 2ε2 For datasets Caltech-101 and Scenes, we generated matrices 1 2 − h i | |2 ≥ 2+4ε similarto[13]:denselyextractSIFTdescriptorsof64 64pix- × elspatchevery32pixels;use k-meanstogenerateacodebook (1+ε) A [A] 2 =2(1+ε) || − k||F withsize256;andgeneratea1-of-256codeforeachpatch.We >v2(2+4ε)+2v2 >2 2ε2+4εv2 distributedthe1-of-256binarycodestodifferentservers.Each 1 2 − 1 serverlocallypooledthebinarycodesofthesameimage.Thus, So, v12 < 21(1+ε). the global matrix can be obtained by pooling across servers. When x,y < 2/ε, x+y 2ε2 2 4ε When doing pooling, we focused on average pooling (P=1), h i − | |2 ≤ − square-root pooling (P=2), and P-norm pooling for P=5 and (1+ε) A [A] 2 = x+y 2ε2(1+ε) P=20forsimulatingmaxpooling.Finally,weevaluatedrobust || − k||F | |2 >v2 x+y 2ε2+2v2 =2(v2+v2) (2 x+y 2ε2)v2 PCA. To simulate the noise, we randomly changed values of 1| |2 2 1 2 − −| |2 1 50 entriesof the feature matrix of isolet to be extremelylarge >2(1 ε2) (2 x+y 2ε2)v2 − − −| |2 1 and then we arbitrarily partitioned the matrix into different servers. Since we can arbitrarily partition the matrix, a server So, v12 > 2(1−ε22)−−||xx++yy||2222εε22(1+ε) ≥ 12(1+ε). Therefore, Alice may not know whether an entry is abnormallylarge. We used can distinguishthese two cases. The onlycommunicationcost the Huber ψ-function to filter out the large entries. We ran is forcomputingthe projectionmatrixP. Thiscostis o(1/ε2) each experiment5 times and measured the average error. The bits,whichcontradictstotheΩ(1/ε2)bitsofthegapHamming number of servers is 10 for Forest Cover, Scenes and isolet, distance problem. and 50 for KDDCUP99 and Caltech-101. We compared our experimental results with our theoretical predictions. If we Notice in step 1 of the protocol, if we replace x ε, y ε, i i sample r rows, we predictthe additive error will be k2/r. We √2, √2(ε1+ε) with 12xi(2ε)p1, 12yi(2ε)p1, 221p, (√2(ε1+ε))p1 also compared the accuracy when we gave different bounds respectively, the above reduction also holds more generally. on the ratio of amounts of total communication to the sum of local data sizes. The results are shown in Figure 1 and Figure 2. As shown, in practice, our algorithm performedbetter than VIII. EXPERIMENTSAND EVALUATIONS its theoretical prediction. We ran several experiments to test the performanceof our algorithmindifferentapplications,includingGaussianrandom IX. CONCLUSIONS Fourierfeatures[10],P-normpooling[13](square-rootpooling To the best of our knowledge, we proposed here the first [26])androbustPCA.Wemeasuredactualerrorgivenabound non-trivial distributed protocol for the problem of computing on the total communication. a low rank approximation of a general function of a matrix. Our empiricalresultsonrealdatasetsimply thatthe algorithm Setup: We implement our algorithms in C++. We use multiple processes to simulate multiple servers. For each experiment,we giveaboundofthetotalcommunication.That means we will adjust some parameters including 100 10−1 1) The number r of sampled rows in Algorithm 1. 10−1 10−2 2) The number t of repetitions in Algorithm 2 and 10−2 10−3 number of hash buckets. 10−3 10−4 3) APalrgaomriethtemrs3B. , W, and number e of repetitions in 1100−−54 3 6 9 12 rrrrrraaaaaattttttiiiiiioooooo 0000001......52152155,,5,, paap,, prcacreettrcuueddtauadiicclali ttcrriiloo eetrionnsesuunslulttlt 1100−−65 3 6 9 rrrrrraaaaaattttttiiiiiioooooo 10000002......100100,51,51 pa,,,, rppaacetrrccudeettuuaiddcaalii tccrillo ettrriinoo1ees5unnssuultlltt to guarantee the ratio of the amount of total communication ForestCover KDDCUP99 to the sum of local data sizes is limited. We measure both 101 rraattiioo 00..52,5 p, rpereddicitciotionn 101 rraattiioo 00..255, p, prereddicitciotionn aanctduaalctuaadldriteilvaetiveerreorrror||AA−AAPP||2F2−/||AA−[[AA]]k||22F f/o||rAv|a|2Fr- 100 rrrraaaattttiiiioooo 0000....1521,,5, paa, rccaettcuudtuaaicall trrilo eernessuuslluttlt 100 rrrraaaattttiiiioooo 0000....1521,,5, paa, rcacettcuudtaauiclla trrilo eernsseuusullttlt (cid:12) || − ||F || − k||F (cid:12) ious limitations of the(cid:12)ratios, where P is the outpu(cid:12)t rank-k projection matrix of our(cid:12)(cid:12)algorithm. (cid:12)(cid:12) 10−1 10−1 Datasets: We ran experiments on datasets in [27], [10], 10−2 10−2 [13].We choseForestCover (522000 5000Fourierfeatures) and KDDCUP99 (4898431 50 Fou×rier features) which are 10−3 3 6 9 12 15 10−3 3 6 9 12 15 Caltech−101(P=1) Caltech−101(P=2) × mentioned in [10] to examine low rank approximation for Fourierfeatures.Caltech-101(9145 256features)andScenes 101 rrraaatttiiiooo 000...521,5, pp, rrpeerdedidiccittciiootinnon 101 rrraaatttiiiooo 000...2155,, pp, prreerdedidiccittciiootinnon (4485 256features)mentionedin[×13]arechosentoevaluate rraattiioo 00..52,5 a, catcutaula rl erseusultlt rraattiioo 00..52,5 a, catcutaula rl erseusultlt × 100 ratio 0.1, actual result 100 ratio 0.1, actual result approximatePCAonfeaturesofimagesaftergeneralizedmean pooling. Finally, we chose isolet (1559 617), which is also 10−1 10−1 × chosenby[28],toevaluaterobustPCA withtheψ-functionof the Huber M-estimator. 10−2 10−2 Methodologies: For Gaussian random Fourier features, 10−3 3 6 9 12 15 10−3 3 6 9 12 15 we randomly distributed the original data to different servers. Caltech−101(P=5) Caltech−101(P=20)