SubmittedtotheAnnalsofStatistics arXiv:arXiv:0000.0000 A STATISTICALLY AND NUMERICALLY EFFICIENT INDEPENDENCE TEST BASED ON RANDOM PROJECTIONS AND DISTANCE COVARIANCE 7 BY CHENG HUANG†,‡, AND XIAOMING HUO†,‡ 1 0 GeorgiaInstitute ofTechnology‡ 2 Testofindependenceplaysafundamentalroleinmanystatisticaltech- n niques. Among the nonparametric approaches, the distance-based methods a (suchasthedistancecorrelationbasedhypothesestestingforindependence) J havenumerousadvantages,comparingwithmanyotheralternatives.Aknown 1 limitationofthedistance-basedmethodisthatitscomputationalcomplexity 2 canbehigh.Ingeneral,whenthesamplesizeisn,theorderofcomputational complexityofadistance-basedmethod,whichtypicallyrequirescomputing ] E ofallpairwisedistances,canbeO(n2).Recentadvanceshavediscoveredthat M intheunivariatecases, afast methodwithO(nlogn)computational com- plexityandO(n)memoryrequirementexists.Inthispaper,weintroducesa . t testofindependencemethodbasedonrandomprojectionanddistancecorre- a lation,whichachievesnearlythesamepowerasthestate-of-the-artdistance- t s basedapproach,worksinthemultivariatecases,andenjoystheO(nKlogn) [ computationalcomplexityandO(max{n,K})memoryrequirement,where 1 K isthenumber ofrandomprojections.Notethatsavingisachieved when v K < n/logn. Wename our method a Randomly Projected Distance Co- 4 variance(RPDC).Thestatisticaltheoreticalanalysistakesadvantageofsome 5 techniquesonrandomprojectionwhicharerootedincontemporarymachine 0 learning.Numericalexperimentsdemonstratetheefficiencyoftheproposed 6 method,inrelativetoseveralcompetitors. 0 . 1 0 1. Introduction. Testofindependence isafundamental problem instatistics, 7 with many existing work including the maximal information coefficient (MIC) 1 : [21],thecopulabasedmeasures[23,26],thekernelbasedcriterion[8]andthedis- v tance correlation [29, 27], which motivated our current work. Note that the above i X works as well as ours focus on the detection of the presence of the independence, r a which can be formulated as statistical hypotheses testing problems. On the other hand, interesting developments (e.g., [20]) aim at a more general framework for interpretable statistical dependence, whichisnotthegoalofthispaper. Distance correlation proposed by [29] is an indispensable method in test of in- dependence. The direct implementation of distance correlation takes O(n2) time, where n is the sample size. The time cost of distance correlation could be sub- stantial when sample size is just a few thousands. When the random variables are ∗Somecomment †Firstsupporteroftheproject 1 2 C.HUANG&X.HUO univariate,thereexistefficientnumericalalgorithmsoftimecomplexityO(nlogn) [11]. However, for the multivariate random variables, we have not found any effi- cientalgorithmsinexistingpapersafteranextensiveliterature survey. Independence tests of multivariate random variables could have a wide range of applications. Inmany problem settings, asmetioned in [30], each experimental unitwillbemeasuredmultipletimes,resultinginmultivariatedata.Researchersare often interested in exploring potential relationships among subsets of these mea- surements. Forexample, some measurements may represent attributes of physical characteristics whileothers represent attributes ofpsychological characteristics. It may be of interests to determine whether there exists a relationship between the physical and the psychological characteristics. A test of independence between pairs of vectors, where the vectors may have different dimensions and scales, be- comes crucial. Moreover, the number of experimental units, or equivalently, sam- ple size, could be massive, which requires thetest to becomputationally efficient. This work will meet the demands for numerically efficient independence tests of multivariate randomvariables. Thenewlyproposedtestofindependencebetweentwo(potentiallymultivariate) random variable X and Y works as follows. Firstly, both X and Y are randomly projected toone-dimensional spaces.Thenthefastcomputingmethodfordistance covariances between a pair of univariate random variables is adopted to compute for an surrogate distance covariance. The above two steps are repeated for nu- merous times. The final estimate of the distance covariance is the average of all aforementioned surrogatedistance covariances. Fornumericalefficiency,wewillshow(inTheorem3.1)thatthenewlyproposed algorithm enjoys theO(Knlogn)computational complexity andO(max n,K ) { } memory requirement, where K is the number of random projections and n is the sample size. On the statistical efficiency, we will show (in Theorem 4.19) that the asymptotic power of the test of independence by utilizing the newly proposed statistics is as efficient as its original multivariate counterpart, which achieves the stat-of-the-art rates. The rest of this paper is organized as follows. In Section 2, we review the definition of distance covariance, its fast algorithm in univariate cases and re- lateddistance-based independence tests. Section3givesthedetailed algorithm for distance covariance of random vectors and corresponding independence tests. In Section 4, we present some theoretical properties on distance covariance and the asymptoticdistributionoftheproposedestimator.InSection5,weconductnumer- ical examples to compare our method against others in existing literature. Some discussions are presented in Section 6. We conclude in Section 7. All technical proofs as well as formal presentation of algorithms are relegated to the appendix whenappropriate. RANDOMPROJECTEDDISTANCECOVARIANCE 3 Throughoutthispaper,weadoptthefollowingnotations.Wedenotec = π(p+1)/2 p Γ((p+1)/2) and c = π(q+1)/2 as two constants, where Γ() denotes the Gamma function. q Γ((q+1)/2) · We will also need the following constants: C = c1cp−1 = √πΓ((p+1)/2) and p cp Γ(p/2) C = c1cq−1 = √πΓ((q+1)/2).Foranyvectorv,letvt denoteitstranspose. q cq Γ(q/2) 2. ReviewofDistanceCovariance:Definition,FastAlgorithm,andRelated Independence Tests. In this section, we review some related existing works. In Section 2.1, we recall the concept of distance variances and correlations, as well as some of their properties. In Section 2.2, we discuss the estimators of distance covariances and correlations, aswellastheir computation. Wepresent their appli- cations intestingofindependence inSection2.3. 2.1. Definition of Distance Covariances. Measuring and testing the depen- dency between two random variables is a fundamental problem in statistics. The classical Pearson’s correlation coefficient can be inaccurate and even mislead- ing when nonlinear dependency exists. Paper [29] proposes the novel measure– distance correlation–which is exactly zero ifand only if tworandom variables are independent. A limitation is that if the distance correlation is implemented based on its original definition, the corresponding computational complexity can be as highasO(n2),whichisnotdesirable whennislarge. We review the definition of the distance correlation in [29]. Let us consider two random variables X Rp, Y Rq,p 1,q 1. Let the complex-valued ∈ ∈ ≥ ≥ functions φ (), φ (), and φ () be the characteristic functions of the joint X,Y X Y · · · density of X and Y, the density of X, and the density of Y, respectively. Forany function φ, we denote φ2 = φφ¯, where φ¯ is the conjugate of φ; in words, φ | | | | is the magnitude of φ at a particular point. For vectors, let us use to denote | · | theEuclideannorm.In[29],thedefinitionofdistance covariancebetweenrandom variables X andY is φ (t,s) φ (t)φ (s)2 (2.1) 2(X,Y) = | X,Y − X Y | dtds, V Rp+q cpcq t p+1 s q+1 Z | | | | where two constants c and c have been defined at the end of the Section 1. The p q distance correlation isdefinedas 2(X,Y) 2(X,Y) = V . R 2(X,X) 2(Y,Y) V V Thefollowingproperty hasbeenesptablished intpheaforementioned paper. THEOREM 2.1. Suppose X Rp,p 1 and Y Rq,q 1are tworandom ∈ ≥ ∈ ≥ variables, thefollowingstatements areequivalent: 4 C.HUANG&X.HUO (1) X isindependent ofY; (2) φ (t,s) = φ (t)φ (s),foranyt Rp ands Rq; X,Y X Y ∈ ∈ (3) 2(X,Y) = 0; V (4) 2(X,Y) = 0. R Given sample (X ,Y ),...,(X ,Y ), wecan estimate the distance covariance 1 1 n n by replacing the population characteristic function with the sample characteristic function: fori = √ 1,t Rp,s Rq,wedefine − ∈ ∈ n 1 φˆX(t) = eiXjtt, n j=1 X n 1 φˆY(s) = eiYjts, and n j=1 X n 1 φˆX,Y(t,s) = eiXjtt+iYjts. n j=1 X Consequently onecanhavethefollowingestimatorfor 2(X,Y): V φˆ (t,s) φˆ (t)φˆ (s)2 (2.2) 2(X,Y)= | X,Y − X Y | dt ds. Vn Rp+q cpcq t p+1 s q+1 · Z | | | | Notethattheaboveformulaisconvenient todefineaquantity, howeverisnotcon- venientforcomputation, duetotheintegration ontherighthandside.Inthelitera- ture,otherestimateshavebeenintroduced andwillbepresented inthefollowing. 2.2. FastAlgorithm in theUnivariate Cases. Thepaper [16]gives an equiva- lentdefinition forthedistance covariance betweenrandom variablesX andY: (2.3) 2(X,Y) = E[d(X,X )d(Y,Y )] = E[X X Y Y ] ′ ′ ′ ′ V | − || − | 2E[X X Y Y ]+E[X X ]E[Y Y ], ′ ′′ ′ ′ − | − || − | | − | | − | wherethedoublecentered distanced(, )isdefinedas · · d(X,X ) = X X E [X X ] E [X X ]+E[X X ], ′ ′ X ′ X′ ′ ′ | − |− | − | − | − | | − | whereE ,E andEareexpectations overX,X and(X,X ),respectively. X X′ ′ ′ Motivatedbytheabovedefinition,onecangiveanunbiasedestimatorfor 2(X,Y). V RANDOMPROJECTEDDISTANCECOVARIANCE 5 Thefollowingnotations willbeutilized: for1 i,j n, ≤ ≤ a = X X , b = Y Y , ij i j ij i j | − | | − | n n (2.4) a = a , b = b , i il i il · · l=1 l=1 X X n n a = a , and b = b . kl kl ·· ·· k,l=1 k,l=1 X X Ithasbeenproven[28,11]that 1 (2.5) Ω (X,Y) = a b n ij ij n(n 3) − i=j X6 n 2 a b aibi + ·· ·· − n(n 2)(n 3) · · n(n 1)(n 2)(n 3) − − i=1 − − − X is an unbiased estimator of 2(X,Y). In addition, a fast algorithm has been pro- V pose[11]fortheaforementionedsampledistancecovarianceintheunivariatecases with complexity order O(nlogn) and storage O(n). We list the result below for reference purpose. THEOREM 2.2 (Theorem 3.2 & Corollary 4.1 in [11]). Suppose X1,...,Xn andY ,...,Y R.Theunbiased estimator Ω defined in(2.5)canbecomputed 1 n n ∈ byanO(nlogn)algorithm. Inaddition,asabyproduct,thefollowingresultisestablishedinthesamepaper. COROLLARY 2.3. Thequantity a b nk,l=1akl nk,l=1bkl ·· ·· = n(n 1)(n 2)(n 3) n(n 1)(n 2)(n 3) − − − P− −P − canbecomputedbyanO(nlogn)algorithm. Wewillusethe aboveresult inourtestofindependence. However, asfaraswe know, inthemultivariate cases, there does not existany workonfast algorithm of the order of complexity O(nlogn). This paper will fillin this gap by introducing anorderO(nKlogn)complexity algorithm inthemultivariate cases. 6 C.HUANG&X.HUO 2.3. Distance BasedIndependence Tests. In[29]anindependence test is pro- posedusingthedistance covariance. Wesummarizesitbelowasatheorem, which servesasabenchmark. Ourtestwillbealignedwiththefollowingone,exceptthat weintroduced a new test statistic, which can be more efficiently computed, and it hascomparable asymptotic properties withtheteststatistic thatisusedbelow. THEOREM 2.4 ([29], Theorem 6). For potentially multivariate random vari- ables X andY,aprescribed levelα ,andsamplesizen,onerejects theindepen- s denceifandonlyif n 2(X,Y) Vn > (Φ−1(1 αs/2))2, S − 2 where 2(X,Y)hasbeendefinedin(2.2),Φ()denotethecumulativedistribution Vn · function ofthestandard normaldistribution and n n 1 S = X X Y Y . 2 n4 | i − j| | i − j| i,j=1 i,j=1 X X Moreover,letα(X,Y,n)denotetheachievedsignificanceleveloftheabovetest.If E[X + Y ] < ,thenforall0 < α < 0.215, onecanshowthefollowing: s | | | | ∞ lim α(X,Y,n) α , and s n ≤ →∞ sup lim α(X,Y,n) : (X,Y)= 0 = α . s X,Y n→∞ V n o Note that the quantity 2(X,Y) that is used above as in [29] differs from the Vn one that will beused in our proposed method. Asmentioned, weuse theabove as an illustration for distance-based tests of independence, as well as the theoretical (orasymptotic) properties thatsuchatestcanachieve. 3. NumericallyEfficientMethodforRandomVectors. Thissectionismade oftwocomponents.Wepresentarandom-projection-based distancecovariancees- timator that will be proven to be unbiased with a computational complexity that is O(Knlogn) in Section 3.1. In Section 3.2, we describe how the test of inde- pendence can be done by utilizing the above estimator. For user’s conveniences, stand-alone algorithms arefurnished intheappendix. 3.1. Random Projection Based Methods for Approximating Distance Covari- ance. Weconsider howtouseafastalgorithm forunivariate randomvariables to compute or approximate the sample distance covariance of random vectors. The RANDOMPROJECTEDDISTANCECOVARIANCE 7 mainideaworksasfollows:first,projectingthemultivariateobservations onsome random directions; then, using the fast algorithm to compute the distance covari- ance ofthe projections; finally, averaging distance covariances from different pro- jectingdirections. More specifically, our estimator can be computed as follows. For potentially multivariate X ,...,X Rp,p 1 and Y ,...,Y Rq,q 1, let K be a 1 n 1 n ∈ ≥ ∈ ≥ predetermined numberofiterations, wedo: (1) Foreachk(1 k K),randomlygenerateu andv fromUniform( p 1) k k − ≤ ≤ S and Uniform( q 1), respectively. Here p 1 and q 1 are the unit spheres − − − S S S in Rp and Rq, respectively. Uniform( p 1) is a uniform measure (or distri- − S bution)on p 1. − S (2) LetutX and vtY denote the projections of X and Y to the spaces that are k k spannedbyvectoru andv ,respectively. Thatiswehave k k utX = (utX ,...,utX ), andvtY = (vtY ,...,vtY ). k k 1 k n k k 1 k n NotethatsamplesutX andvtY arenowunivariate. k k (3) Utilizethefast(i.e.,orderO(nlogn))algorithmthatwasmentionedinThe- orem2.2tocomputefortheunbiased estimatorin(2.5)withrespecttoutX k andvtY.Formally,wedenote k Ω(k) = C C Ω (utX,vtY), n p q n k k whereC andC havebeendefinedattheendofSection1. p q (4) The above three steps are repeated for K times. The final estimator is the average: K 1 (3.1) Ω = Ω(k). n K n k=1 X Toemphasize the dependency of the above quantity with K, wesometimes useanotationΩ , Ω . n,K n SeeAlgorithm1intheappendixforastand-alonepresentationoftheabovemethod. InthelightofTheorem2.2,wecanhandily declarethefollowing. THEOREM3.1. ForpotentiallymultivariateX1,...,Xn RpandY1,...,Yn ∈ ∈ Rq,theorderofcomputational complexity ofcomputing theaforementioned Ω is n O(Knlogn)withstorageO(max n,K ),whereK isthenumberofrandompro- { } jections. The proof of the above theorem is omitted, because it is straightforward from Theorem2.2.ThestatisticalpropertiesoftheproposedestimatorΩ willbestudied n inthesubsequent section (specifically inSection4.4). 8 C.HUANG&X.HUO 3.2. Testof Independence. Bya later result (cf. Theorem 4.19), wecan apply Ω in the independence testing. The corresponding asymptotic distribution of the n teststatisticΩ canbeapproximated byaGamma(α,β)distribution withαandβ n given in (4.6). Wecan compute the significant level of the test statistic by permu- tation and conduct the independence test accordingly. Recall that we have poten- tially multivariate X ,...,X Rp and Y ,...,Y Rq. Recall that K denotes 1 n 1 n ∈ ∈ thenumberofMonteCarloiterations inourprevious algorithm. Letα denotethe s prescribed significance leveloftheindependence test.LetLdenotethenumberof random permutations thatwewilladopt. Wewouldliketotestthenull hypothesis —X andY areindependent—against itsalternative. RecallΩ isourproposed 0 n H estimator in (3.1). The following algorithm describes an independence test which applies permutations togenerate athreshold. (1) For each ℓ,1 ℓ L, one generates a random permutation of Y: Y⋆,ℓ = ≤ ≤ (Y⋆,...Y⋆); 1 n (2) Usingthe algorithm in Section 3.1,one can compute the estimator Ω asin n (3.1) for X and Y⋆,ℓ; denote the outcome to be V = Ω (X,Y⋆,ℓ). Note ℓ n undertherandom permutations, X andY⋆,ℓ areindependent. (3) Theabovetwostepsareexecutedforallℓ = 1,...,L.Onerejects ifand 0 H onlyifwehave 1+ L I(Ω > V ) ℓ=1 n ℓ > α . s 1+L P SeeAlgorithm 2intheappendix forastand-alone description. One can also use the information of an approximate asymptotic distribution to estimate a threshold in the aforementioned independence test. The following describes such an approach. Recall that we have random vectors X ,...,X 1 n ∈ Rp,p 1andY ,...,Y Rq,q 1,the numberofrandom projections K,and 1 n ≥ ∈ ≥ aprescribed significance levelα thathasbeenmentionedearlier. s (1) Foreachk(1 k K),randomlygenerateu andv fromuniform( p 1) k k − ≤ ≤ S anduniform( q 1),respectively. − S (2) Usethefastalgorithm inTheorem2.2tocomputethefollowingquantities: Ω(k) = C C Ω (utX,vtY), n p q n k k S(k) = C2C2Ω (utX,utX)Ω (vtY,vtY), n,1 p q n k k n k k auk bvk (k) (k) Sn,2 = Cpn(n·· 1), Sn,3 = Cqn(n·· 1), − − where C and C have been defined at the end of Section 1 and in the last p q RANDOMPROJECTEDDISTANCECOVARIANCE 9 equation, theauk andbvk aredefinedasfollows: ·· ·· auk = ut(X X ), bvk = vt(Y Y ), ij | k i − j | ij | k i− j | n n auk = auk, bvk = bvk. kl kl ·· ·· k,l=1 k,l=1 X X (3) Fortheaforementionedk,onerandomlygeneratesu andv fromuniform( p 1) ′k k′ S − anduniform( q 1),respectively. Usethefastalgorithmthatismentionedin − S Theorem2.2tocomputethefollowing. Ω(k) = C2Ω (utX,utX), Ω(k) = C2Ω (vtY,v tY). n,X p n k ′k n,Y p n k ′k whereC andC havebeendefinedattheendofSection1. p q (4) Repeat the previous steps for all k = 1,...,K. Then we compute the fol- lowingquantities: K K K 1 1 1 Ω = Ω(k), S¯ = S(k), S¯ = S(k), n K n n,1 K n,1 n,2 K n,2 k=1 k=1 k=1 X X X K K K 1 1 1 S¯ = S(k), Ω = Ω(k) , Ω = Ω(k) , n,3 K n,3 n,X K n,X n,Y K n,Y k=1 k=1 k=1 X X X 1 S¯2 S¯2 n,2 n,3 (3.2) α= , 2 K 1Ω Ω + 1S¯ K− n,X n,Y K n,1 1 S¯ S¯ n,2 n,3 (3.3) β = . 2 K 1Ω Ω + 1S¯ K− n,X n,Y K n,1 (5) Reject if nΩ + S¯ S¯ > Gamma(α,β;1 α ); otherwise, accept 0 n n,2 n,3 s H − it. Here Gamma(α,β;1 α ) is the 1 α quantile of the distribution s s − − Gamma(α,β). The above procedure is motivated by the observation that the asymptotic distribu- tionoftheteststatisticnΩ canbeapproximated byaGammadistribution, whose n parameters can be estimated by (3.2) and (3.3). A stand-alone description of the aboveprocedure canbefoundinAlgorithm 3intheappendix. 4. Theoretical Properties. In this section, we establish the theoretical foun- dation of the proposed method. In Section 4.1, we study some properties of the randomprojectionsandthesubsequent averageestimator.Thesepropertieswillbe needed in studying the properties of the proposed estimator. We study the prop- erties of the proposed distance covariance estimator (Ω ) in Section 4.2, taking n 10 C.HUANG&X.HUO advantage of the fact that Ω is a U-statistic. It turns out that the properties of n eigenvalues of a particular operator plays an important role. We present the rele- vant results in Section 4.3. The main properties of the proposed estimator (Ω ) is n presented inSection4.4. 4.1. Using Random Projections in Distance-Based Methods. In this section, we will study some properties of distance covariances of randomly projected ran- domvectors.Webeginwithanecessary andsufficientcondition ofindependence. LEMMA 4.1. Suppose u and v are points on the hyper-spheres: u p−1 = ∈ S u Rp : u = 1 andv q 1.Wehave − { ∈ | | } ∈ S random vectorsX Rp andY Rq areindependent ∈ ∈ ifandonlyif 2(utX,vtY) =0, foranyu p 1,v q 1. − − V ∈ S ∈ S The proof is relatively straightforward. We relegate a formal proof to the ap- pendix.Thislemmasindicatesthattheindependence issomewhatpreservedunder projections. Themaincontribution oftheaboveresult istomotivateustothinkof usingrandomprojection, toreducethemultivariaterandom vectorsintounivariate random variables. As mentioned earlier, there exist fast algorithms of distance- basedmethodsforunivariaterandom variables. Thefollowingresultallowsustoregardthedistancecovariance ofrandomvec- tors of any dimension as an integral of distance covariance of univariate random variables,whicharetheprojectionsoftheaforementionedrandomvectors.Thefor- mulas in the following lemma provides foundation for our proposed method: the distance covariances inthemultivariate casescanbewrittenasintegrations ofdis- tance covariances in the univariate cases. our proposed method essentially adopts the principle of Monte Carlo to approximate such integrals. Weagain relegate the prooftotheappendix. LEMMA 4.2. Suppose uand v arepoints onunit hyper-spheres: u p−1 = ∈ S u Rp : u = 1 and v q 1. Let µ and ν denote the uniform probability − { ∈ | | } ∈ S measure on p 1 and q 1, respectively. Then, we have for random vectors X − − S S ∈ Rp andY Rq, ∈ 2(X,Y) = C C 2(utX,vtY)dµ(u)dν(v), p q V V p−1 q−1 Z S ×S whereC andC aretwoconstants thataredefined attheendofSection 1.More- p q over,asimilarresultholdsforthesampledistance covariance: 2(X,Y) = C C 2(utX,vtY)dµ(u)dν(v). Vn p q Vn p−1 q−1 Z S ×S