Draftversion February2,2008 PreprinttypesetusingLATEXstyleemulateapjv.10/10/03 ESTIMATING N-POINT CORRELATION FUNCTIONS FROM PIXELIZED SKY MAPS H. K. Eriksen1 and P. B. Lilje1 Institute ofTheoretical Astrophysics,UniversityofOslo,P.O.Box1029Blindern, N-0315Oslo,Norway A. J. Banday Max-Planck-Institutfu¨rAstrophysik,Karl-Schwarzschild-Str.1,Postfach1317, D-85741GarchingbeiMu¨nchen,Germany and 4 K. M. Go´rski2 0 JPL,M/S169/327, 4800OakGroveDrive,Pasadena, CA91109 0 Draft versionFebruary 2, 2008 2 ABSTRACT n a We develop,implementandtesta setofalgorithmsforestimating N-pointcorrelationfunctions from J pixelizedskymaps. Thesealgorithmsareslow,inthesensethattheydonotbreaktheO(NN )barrier, pix 3 andyet, they arefast enoughfor efficientanalysis ofdata sets up to severalhundredthousandpixels. 2 The typical applicationof these methods is Monte Carlo analysis using severalthousand realizations, andtherefore we organizeour programsso that the initialization costis paid only once. The effective 2 costis then reducedto a few additions per pixel multiplet (pair, triplet etc.). Further, the algorithms v waste no CPU time on computing undesired geometric configurations, and, finally, the computations 1 are naturally divided into independent parts, allowing for trivial (i.e., optimal) parallelization. 3 Subject headings: methods: statistical — cosmic microwavebackground 8 0 1 1. INTRODUCTION validwhenthedistancebetweentheaggregatedparticles 3 is much smaller than the scale of interest. It is therefore 0 Since the introduction of correlation functions into of limited use if the primary interest lies in small scales, / modern cosmology by e.g., Totsuji & Kihara (1969) and h which often is the case for CMB data. Peebles (1973), such functions have proved to be useful p The main motivation behind this work is analysis of in a large variety of situations. A few typical applica- - new high-resolution CMB maps. In this case, as in sev- o tions are the study of the distribution of galaxies and eral other applications, the data are not consisting of r clusters of galaxies in the universe (e.g., Connolly et al. t point sets (e.g., positions of galaxies) but in values of a s 2002;Maddox,Efstathiou & Sutherland 1996;Dalton et a al. 1994; Croft, Dalton, & Efstathiou 1999; Lilje & Efs- function (like temperature, polarization, or shear) given : for all pixel positions on a map covering part of or the v tathiou1988;Frieman&Gaztan˜aga1999;Jing&B¨orner full sky, where some fixed pixelization scheme has been i 1998),theanalysisofthegravitationallensingshear(e.g., X used to divide the area of concern into pixels. Further, Bernardeau, van Waerbeke, & Mellier 2003; Takada & when analyzing such maps, the main interest is usually r Jain 2003), or measurements of non-Gaussianity in the a incomparingastatistic(e.g.,acorrelationfunction)esti- cosmic microwave background (e.g., Eriksen, Banday, & matedonamapoftherealskywithalargeMonteCarlo G´orski 2002; Kogut et al. 1996). set of the same statistic estimated on several thousand Unfortunately, higher order N-point correlation func- random realizations based on some theory, e.g., Gaus- tions are also notorious for being computationally ex- sian fluctuations with a given power spectrum. Hence, pensive. In general the evaluation of an N-point cor- relation function scales as O(NN ), and therefore the one must estimate the correlation function for a set of pix many thousand similar maps with the same pixelization required CPU time soon becomes too large to handle. to do the analysis of interest. To remedy this situation several “fast” algorithms have Thekd-treeapproach(e.g.,Mooreetal.2001)isbased been proposed, in particular for analyzing discrete par- on a similar idea to the idea of aggregating particles on ticle data sets (such as galaxy catalogs). One commonly smallscales. Itorganizesnearbyparticles(orpixels)into used method is to aggregate particles on small scales, a hierarchy of bounding boxes, and uses this hierarchy and treat each of these aggregations as a single particle for rapidly discarding uninteresting pixel sets. However, (e.g.,Davisetal.1985;Kaiser1986). This methodeffec- it is not obviousthat the kd-tree approachis well suited tivelycorrespondstosmoothingthedataset,andisonly forestimatingcorrelationfunctionsfromapixelizedmap (which completely fills the k-dimensional space), unless Electronicaddress: [email protected] Electronicaddress: [email protected] bin widths much greater than the pixel size is desired. 1 AlsoatCentreofMathematicsforApplications,Universityof (Normally one sets the bin width roughly equal to the Oslo,P.O.Box1053Blindern,N-0316Oslo,Norway pixel size in order to obtain optimal resolution.) Electronicaddress: [email protected] 2 AlsoatWarsawUniversityObservatory,AlejeUjazdowskie4, A third approach relies on the Fourier-transform, and 00-478Warszawa, Poland thisapproachdrasticallyspeedsupthealgorithmswhen- Electronicaddress: [email protected] 2 Eriksen et al. everthe FastFourierTransform(FFT) is applicable(for ration). The examplesgivenatthe end ofthis paper are a two-point application, see, e.g., Szapudi, Prunet, & in fact chosen to correspond with that analysis. Colombi [2001a]). However, this method is not very at- tractive for higher-order correlation functions. In the 2. OVERVIEW three-pointcase one replacesan O(N3 ) algorithmwith AnN-pointcorrelationfunctionisdefinedastheaver- pix averysmall prefactor(multiplythreenumberstogether) age product of N temperatures5, as measured in a fixed with anO(l6 )algorithmwith anextremelylarge pref- relative configuration on the sky, max actor (compute several Wigner 3-j symbols; see, e.g., Ganguietal.[1994]). Four-pointfunctions areobviously C (θ ,...,θ )= ∆T(nˆ )···∆T(nˆ ) . (1) N 1 k 1 N (cid:28) (cid:29) even more expensive. Thus, the advantage of Fourier- methods for estimating high-order correlation functions Herenˆ ,...,nˆ spananN-pointpolygonwithgeometric 1 N is so far unclear. parameters,θ ,...,θ ,onthesky,and∆T(nˆ)isthefield 1 k InthisPaperwedescribeanewsetofalgorithmswhich value in the direction given by the unit vector nˆ. allows computation of any (small) subset of the general In all common applications we assume isotropy, and N-pointcorrelationfunctionsfromapixelizedmapofthe then the N-point correlation functions are only depen- whole celestial sphere, or a part of the celestial sphere, dent on the shape and size of the N-point polygon, and with up to severalhundredthousandpixels. These algo- notits particularpositionororientationonthesky. The rithmsareextremelywellsuitedforMonteCarlostudies, smallest number of parameters which uniquely deter- sinceweorganizetheprocessessothatthe verysubstan- mines such a polygon is k =2N −3. tialamountofinitializationCPUtimeisspentonlyonce On a pixelized map, the N-point correlationfunctions for the whole ensemble of sky maps, and not for each areestimatedassimpleaveragesoverallpixelmultiplets individual map. For ensembles of more than about one (pairs,triplets,quadruplesetc.) satisfyingthegeometric thousandrealizations,mostoftheCPUtime istherefore constraints of the particular configuration, spent on adding pixel values together, not on geometric wi ···wi ·Ti···Ti coTmhpeutgaetnioenrsa.l idea behind these algorithms is to re- CN(θ1,...,θ2N−3)= Pi 1 wiN···w1i N. (2) i 1 N place CPU-intensive inverse cosine operations by much The pixel weights, w , may bePadjusted to account for, less CPU-intensive if-tests,at the cost ofincreasing the i e.g.,noiseorboundaryeffects. Notealsothatwealways, memory requirements. First we compute the distances but often implicitly, bin our correlationfunctions with a between any two pixels in the map and store this infor- givenbinwidth,∆θ. Apixelpairofangularseparationθ mationinasetoftablesoptimizedforfastsearches(and isthendefinedto belongthe then’thbinif(n−1)∆θ ≤ compression,ifdesirable). Thesetablescanbecompared θ <n∆θ. Asseenfromequation(2),theevaluationofan withasetofcompasses,inthesensethateachcolumnof N-point correlation function is in principle very simple. atabledrawsoutacircleonthesphereofagivenradius, However,before we can multiply and add the individual centered on a given pixel. Then, by searching through pixel values, we need to determine all the correct pixel two different columns for equal entries, we generate tri- multiplets. That is less than trivial, and constitutes the angleswith the desiredsize and shape,and, if necessary, main theme of this Paper. addtwo or more suchtrianglesto produceN-pointmul- Our starting point is the brute force method for esti- tiplets. In fact, our method is equivalent to ruler and matingN-pointcorrelationfunctions. Ifwewanttoesti- compass construction of triangles. mate the three-point correlationfunction for only a sub- A similar algorithm for estimating N-point correla- set of all possible configurations, e.g., for all equilateral tion functions for point sets (e.g., galaxies) in three- triangles,thebruteforcemethodistogothroughtheset dimensional space has recently been developed by Bar- ofallpixeltripletsinthemap(resultinginanO[N3 ]al- riga & Gaztan˜aga (2002). However, there are several pix gorithm),andcompute thethree distancesbetweeneach important differences between the two methods. First, pixel pair. Each of the three distances is computed by our methods are designed particularly for Monte Carlo taking the inverse cosine of the the dot product of the simulations, in that we organize the information so that two pixel unit vectors. Knowing that one inverse cosine initialization costs are paid only once. Second, we use operationtakesabout40times asmanyCPUcyclesasa the “compasses” to determine all three edges in the tri- multiplication, we see that not only does the brute force angle, while Barriga & Gaztan˜aga (2002) only use them algorithmscale as O(N3 ), but its prefactoris also very for constraining two of the edges. Finally, our meth- pix high. However,themosttimeconsumingshortcomingof ods are particularly designed to take advantage of the HEALPix3 nestedpixelizationscheme(Go´rski,Hivon,& the brute-force approach is that we have to go through everysinglepixeltripletexistinginthemap,eventhough Wandelt 1999), although they can (at a considerable ef- we may only be interested in a very small subset of all ficiency cost) be generalized to any pixelization. configurations. Thus, most of the CPU time is wasted A first application of a preliminary version of these on computing uninteresting information. algorithmsto estimate the three- andfour-pointcorrela- tion functions of the COBE-DMR maps was presented Ourfirsttaskistoproduce“compasses”forpickingout therightpixels. Havingdecidedonanumberofexternal by Eriksen et al. (2002), and a thorough application of thealgorithmstotheWMAP4 (Bennettetal.2003)data parameters, such as pixel resolution (parameterized by will shortly be published (Eriksen et al. 2004, in prepa- 5 Since our own applications are to CMB data, especially with the goal of studying non-Gaussianities, we here concentrate on 3 Availablefromhttp://www.eso.org/science/healpix CMB temperature maps, but ∆T can of course be replaced by 4 WilkinsonMicrowaveAnisotropyProbe anyscalarquantity. Estimating N-Point CorrelationFunctions from Pixelized Sky Maps 3 i Table α j γ −table β−table i k j α T Co T γ β able mmon element ableγ β k β−table k γ −table Geometric view Algorithmic view Fig. 1.—Trianglesarefoundbysearchingthroughcolumnsofthetwo-pointtables pairwise: Ifwewanttofindatrianglewithedges of lengthα,β andγ,welookupapixelpair,iandj,inthetablecorrespondingtotheangleα,andletthesetwopixelsdefinethetriangle’s baseline. Next,welookupthei’thcolumnoftheβ-table,andthej’thcolumnoftheγ-table. Anycommonelementsinthecolumnswill correspond to an acceptable third vertex. Obviously, scanning through any such column corresponds to drawing a circle on the sky, and themethodisthereforeequivalenttoconstructingtrianglesusingapairofcompasses. N of HEALPix), separationbin width, ∆θ =π/N , triangleofthedesiredsizeandshape. Thissearchisfully side bin and total number of pixels in the data set, N (usually equivalent to drawing out circles on the sky, and look- pix by introducing a mask), we make “compasses” by going ing for intersections. The correspondence between the through the set of all pixel pairs, compute the distance geometric and algorithmic views is illustrated in Figure betweenthetwopixelsineachpair,anddeterminewhich 1. separation bin the pair belongs to. We then insert the Oncethisprocesshasbeenperformedforallbasepairs pair into a special table (in the following called a two- in the α-table, we have also found all triangles of the point table) for that bin, a table which is optimized for correct shape and size. At this point we may therefore fast searches; the i’th column of the k’th table contains forget all about expensive geometric computations, and allpixelswithadistanceintheinterval[(k−1)∆θ,k∆θ) concentrate on multiplying and adding the pixel triplets frompixelnumberi,sortedaccordingtoincreasingpixel together as fast as possible. number. Inotherwords,thei’thcolumnofthek’thtable Let us now go through each step of the underlying is a “compass” that draws out a circle on the HEALPix algorithms in detail, paying particular attention to real- sphere with radius (k−1/2)∆θ and width ∆θ, centered worldproblems,suchashowtodealwithlimitedphysical onpixelnumber i. These two-pointtables arecomputed RAM and CPU-time. once and for all, and stored in individual files for later 3. CONSTRUCTINGTHE“COMPASSES” use. Alreadyatthispointwecanestimatethetwo-point correlationfunctionwithunprecedentedefficiency,essen- The first step in our N-point correlation algorithm is tially performing just one addition for each pixel pair. to construct a set of two-point distance tables. Each of The particular algorithm for this is described in §5. these tables corresponds to a set of “compasses” with However,ourmaingoalis to estimate higher-orderN- fixed radius. point correlation functions, and then we have to find all For our idea to be of practical use, the two-point ta- triangles, quadrilaterals etc. on the sky of some given bles must meet two requirements: first of all, each table geometric configuration. Here it is worth noting that must be small enough to comfortably fit into the phys- once we are able to construct triangles, all higher-order ical memory of the computer. Second, the tables must structures may be generatedsimply by putting triangles facilitate fast searches for “neighboring” pixels (defined together. Therefore we here focus on how to construct as pixels within a ring of a given radius and width) for triangles. any given center pixel. As noted previously, construction of triangles is per- Since we always bin our correlation functions with a formedbytheuseof“compasses”: supposethatwewant givenbinwidth∆θ,thefirstrequirementismetbysplit- to find all triangles with edge lengths of α, β and γ. We ting the full Npix ×Npix distance matrix up according then pick out one pair of pixels, i and j, from the two- to these bins, simply by letting the bin width be small pointα-table,andlookupthei’thcolumnintheβ-table enough. Thus, all pixel pairs belonging to the same bin andthej’thcolumnintheγ-table. Anycommonentries are collected into one table, which later is stored as a in these two tables will have the correct distances from separate file. thetwobasepixels,andtogetherthethreepointsspana Thenextquestionishowtoorganizethesetables. The important point is to be able to efficiently draw out a 4 Eriksen et al. Two-point table nr. 1 have a map with 100000 pixels, and we want to gener- 0 1 2 0 1 2 ateacompletesetoftwo-pointtables. Doing itstraight- 0 0 1 2 forwardly(i.e.,bykeepingallinformationstoredinmem- 6 8 6 oryatonce),wewouldneedatleast1000002×4bytes≈ 1 1 0 1 8 Row numbers of 0's 10GB of memory, and few current computers have that 2 2 1 0 much physical RAM available. On the other hand, this 3 1 2 2 number is not prohibitive in terms of disk storage. The 4 3 2 1 Two-point table nr. 2 question is therefore how to generate the two-point ta- bles. Once this operation is completed, we can store 5 1 3 3 0 1 2 them on a hard drive, compress them if desirable, and Row numbers of 1's 6 0 1 0 1 2 1 then readonly a few at a time when actually estimating 7 2 2 3 3 6 4 the correlationfunctions. Weutilize theHEALPixpixelizationwithnested pixel 8 0 0 1 5 8 ordering. Thispixelizationhasthehighlydesirableprop- erty that any low-resolution pixel may be divided into Binned distance matrix four high-resolution pixels, where the pixel numbers of Fig. 2.— The two-point tables are constructed by first com- thehigh-resolutionpixelsaregivenby4i+k,ibeing the putingthedistancesbetween anytwopixels,andthenstoringthe pixelnumberofthelow-resolutionpixelandk =0,...,3. corresponding bin numbers in a large Npix×Npix array. Then In short the generation process can be described as a thetwo-pointtablesareextractedfromthismainarraybycollect- ingallentrieswiththesamebinneddistanceintoseparate tables, three-step operation: maintainingthesortedorderoftherow(ie.,pixel)numbers. First we generate a set of tables with low resolution, choosinga resolutionlow enoughto keepall information simultaneously in the computer’s physical memory. Second, each table is expanded to a higher resolu- circle around any given center pixel. That is, given any tion by replacing the low-resolution pixel numbers by position on the sky, we must be able to easily look up the high-resolution pixel numbers. Since each mother all pixels with a given distance from that position. For pixel has four children, the output array is four times this purpose we choose a simple two-dimensional array, as wide and deep as the input array. All pairs in the in which each column corresponds to one single ring on output array have distances in approximately the cor- the sky, centered on the pixel p0, listed in the zeroth rect range, and therefore the output two-point table is row. Thus, all other pixels, pj, in that column satisfies almost a valid high-resolution table. However, since the the relation, θmin ≤arccos(pˆ0·pˆj)<θmax, where pˆi are boundary between two rings is smoothed when the reso- unit vectors andθi are the angular limits of the bin. We lution is increased, one may find that a number of pixel will later be searching for identical entries in two differ- pairs actually belong to the neighboring bins after ex- entcolumns in orderto generatetriangles,andtherefore pansion. This has to be accounted for, and to do so we we already at this point sort the neighbors of each col- sort the pixel pairs in the the high-resolutionoutput ta- umnaccordingtopixelnumber. ThisfacilitatesO(Nrow) ble according to actual bin number, and output these as searches, as opposed to O(N2 ) for unsorted columns. incomplete two-point tables. row In most experiments we introduce a mask, either in Finally,aftersubjectingalllow-resolutiontablestothis order to exclude noise-contaminatedpixels, or simply to operation,wecomplete theexpansionprocessbycollect- limit our region of interest. Whatever reason, we incor- ing all partial tables of the same output bin number porate such a mask already at the construction level, by into complete two-point tables. These two last expan- only including accepted pixels in the tables. Thus, the sion steps are repeated as many times as necessary to zerothrowisnothingbutalistofallacceptedpixels,and obtain the desired resolution. all pixels in the table are guaranteed to be accepted by In the following three subsections, we describe each of themask. Notonlydoesthisremovemanyredundantif- these steps in more detail. The first stepis illustrated in tests later on, but more importantly, it makes sure that Figure2whilethetwolaststepsareillustratedinFigure the algorithms and disk usage scale as O(N2 ), rather 3. pix thanO(N4 ),whereN isthenumberofaccepted pix- side pix 3.1. In-memory generation els. For small regions in a high-resolution map, the dif- ference is crucial. Assume first that we actually do have enough mem- The remaining details are mostly of cosmetic na- ory available to simultaneously store all N2 pixel pairs pix ture. First, since the two-point tables are simple two- in the computer’s RAM, i.e., that there are no memory dimensional arrays, there will be a small amount of un- restrictions. In that case the procedure is quite simple. usedspaceinmanycolumns. This extraspaceispadded First we allocate an N ×N array, and run through pix pix with -1’s, distinguishing those entries from valid pixel the set of all pixel pairs. For each pair we compute the numbers. Secondly, we also store all external parame- angular distance θ between the two pixels i and j, and ij ters (N , N , ordering, bin number etc.) together mark both the j’th row of the i’th column and the i’th side bin with the two-pointtable inanexternalfile, making each row of the j’th column by the integer part of θ /∆θ. ij fileanindependententity,inthespiritofobject-oriented Once this process is completed, the desired two-point programming. tables can easily be extracted from the main array — Although the two-point tables are quite simple to de- the k’th table is found by scanning each column for the scribe,theyaresurprisinglycomplicatedtogenerate,the number k, while storing the row numbers of these en- fundamental problem being their size. Suppose that we triesintheoutputarray. Notethattheoutputtwo-point Estimating N-Point CorrelationFunctions from Pixelized Sky Maps 5 Low-resolution two-point tables 2 89898989 10101010 11111111 1) Substitute pixel numbers High-resolution, but inaccurate two-point tables 2) Split expanded tables according to bin number Accurate, but incomplete two-point tables 3) Merge partial tables Complete and accurate high-resolution two-point tables Fig. 3.—Expansionofthetwo-pointtablesiscarriedoutintwosteps. Firsteachentryoftheoriginal,low-resolutiontableissubstituted with a4×4 sub-matrix, containing the pixel numbers 4i+k, where iis the original pixel number and k =0,...,3. Then this matrix is splitintoasetofpartialtwo-pointtables,accordingtothecorrectbinnumbersofeachpixelpair. Finally,thepartialtwo-pointtablesare mergedintocomplete, outputtwo-pointtables ofhigherresolution. table is automatically sorted by this method, maintain- pixel. For the same reason the search should not be im- ing the O(N2 ) scaling of the algorithm (rather than plemented by a QuickSearch-algorithm, but rather with pix O[N2 logN ] as would be the case if explicit sorting a straight-forward linear method. While the former is pix pix superiorfor generalsearches,the latter has considerably was necessary). lessoverheadper step, andis thereforemuchfaster ifwe As isindicatedinFigure2,wedo notcountthe center only want to move a few steps, as is the case here. pixel as a valid neighbor to itself in the zeroth bin. Or The above trick is in fact very useful in most two- speaking in terms of correlation functions, we never let point correlation function algorithms, and if the nested a pixel be multiplied with itself. For most experiments HEALPix scheme is employed, one can often reduce the the noise in each pixel is quite high, but, fortunately, it total CPU time by 60-70%. Another advantage is that is also usually independent from pixel to pixel. In order the bin limits may be set arbitrarily, and therefore the to eliminate this noise bias, we exclude powers of single methodseamlesslyaccommodatesforalternativebinning pixel values from the correlation functions. This is a schemes. In particular, Gauss-Legendre binning (ie., general problem for all quadratic estimators. placing the bins at the roots of the Legendre polynomi- Let us also make one note on how to compute the als)isoftenusefulifonewantstointegratethetwo-point binned distance between two pixels efficiently: the stan- correlationfunctiontofindthepowerspectrum(Szapudi dard method is to take the dot product of the two et al. 1991b). pixel center vectors, compute the inverse cosine of that product, divide the angular distance by the bin width, 3.2. Expanding a two-point table and keep the integer part. This method is quite slow The output from the above process is a completely because of the inverse cosine operation which requires validsetoftwo-pointtables,but,unlessmassiveamounts about 40times the CPU time of a multiplication. When of memory is available, they only describe a low resolu- using the HEALPix nested pixelization, a faster alter- tion map. A computer with 1GB of RAM can typically native is available by means of a linear search: one only handle up to 15000 pixels through the above de- simply allocates an array, binlim, of length N + 1 bin scription, and a super-computer with shared memory of which contains the cosine of the bin limits, such that ∼ 100GB up to about 100000 pixels. Usually we there- binlim[i] = cos([i − 1]dθ) and binlim[N bin+1] = forehavetoexpandthestructuresetsegmentbysegment cos(N dθ). Then the correct bin number, bin, may bin in order to obtain the desired resolution and pixel num- be found by searching for the array position defined by ber. binlim[bin]≤cosθ <binlim[bin+1]. ij As mentioned above, increasing the resolution of a The crucial point here is that the bin position must HEALPix map is particularly simple in the nested pix- be stored between consecutive searches: since the pixels elization—eachmotherpixelisreplacedbyfourchildren j and j +1 are very close to each other on the sky in pixels having pixel numbers given by 4i+k, where i is the nested HEALPix pixelization, the distance between the pixel number of the mother pixel and k = 0,...,3. i and j+1 will be almost the same as the distance be- Theexpansioninresolutionisthereforeeasilyperformed tween i and j. Therefore one does not have to move bysubstitutingeachentryoftheoriginaltwo-pointtable more than a few steps to find the correct bin for each by a 4×4 sub-array (the zeroth row has to be treated 6 Eriksen et al. individually, though). However, since pairs that orig- from both pixel i and j, and therefore the three pixels inate from the same mother pixel may belong to dif- together span an equilateral triangle. Once we find such ferent high-resolution bins, we cannot simply substitute a triplet we store it in an auxiliary array for later use. the low-resolution pixel numbers by the above method, We then keep scanning until the bottom of one column and hope that the resulting array will be a valid high- is reached. Note that since the columns are sorted ac- resolution two-point table. cording to pixel number, the searches scale as O(N ), row Usually we also want to increase the number of bins rather than as O(N2 ) in the unsorted case. row during such an expansion, considering that a higher res- For our ranges of pixel numbers in the maps and bin olution supports a smaller bin width. For these two widths, we typically find between two and six valid pix- reasons we have to compute all angular distances once els in each such scan. If we are only interested in scalar again to determine to which bins the various pairs be- valuedsymmetricthree-pointcorrelationfunctions,allof long. Based on this information we extract a set of par- these triplets are acceptable, and in such cases we store tial two-point tables, similar to what was done in the all of them for later use. However, if we want to es- first, low-resolution step, and store the resulting partial timate for instance the four-point correlation function, tables on file, labeled by two numbers, bin from and asymmetric three-point correlation functions or spin-2 bin to. field correlation functions (for polarization maps), we need to know the orientation of the triangle, or in other 3.3. Merging the partial tables into complete words, we need to know whether we traverse the trian- high-resolution tables gleclockwiseorcounter-clockwisewhenlistingthepixels The output from the above process is a set of arrays as {i, j, l}. In these cases we introduce the convention characterized by the set of parameters {Nfrom, Nto , that the two first pixels form the base line and the third side side Nfrom, Nto , bin from, bin to}. The final task is to pixel is positioned above the base line. In other words, bin bin (nˆ ×nˆ )·nˆ >0. collect all partial tables with the same bin to (contain- i j l The aboveprocedure results in a 3×n array,in which ing pixel pairs in the same output bin, but originating eachcolumnisanacceptablepixeltripletforthe current from different low-resolution bins) into complete two- geometric configuration (in practice we organize the the pointtables. Thisoperationobviouslyamountstomerg- array slightly differently, to accommodate for fast sum- ing the partial array columns, while maintaining the al- ming over the third pixel — see §5 for details). At this ready sorted relationships, and inserting the final result point we can estimate the three-point correlation func- into a large output array. Since the partial tables are tionby multiplying the three pixelstogether,andsumit already sorted according to pixel number, no additional all up. However, if we want to estimate the four-point sortingisrequired. Thisisactuallytrueforalltheabove correlationfunction we need to go one step further. steps–noexplicitsortingisnecessarytogeneratesorted We find the simplest example of a more complicated output two-point tables. geometry when we want to estimate the rhombic four- Atthis pointwehaveobtainedasetofhigh-resolution pointcorrelationfunction. Thisconfigurationconsistsof two-pointtables storedondisk files. And alreadyatthis two equilateral triangles “glued” together on one edge. pointwemayestimatethetwo-pointcorrelationfunction We can therefore use our list of equilateral triangles to extremelyefficiently. However,ourmaingoalisto study find the set of all such quadruples. The idea here is the higher-order correlation functions, and we therefore simply that the two triangles must have the same base proceed to the construction of triangles by means of the line,butinreversedorder(sincetheymusthaveopposite two-point tables, before turning to applications. orientation). Thus, we must search for pairs of triangles 4. CONSTRUCTINGN-POINTMULTIPLETSONTHE for whichi =j andj =i , where i is the firstvertex 1 2 1 2 1 HEALPIXSPHERE of the first triangle, j is the second vertex of the first 1 Our algorithm for constructing triangles on the triangle, etc. The four required pixels are then given by HEALPix sphere is fully equivalent to ruler and com- the two common vertices and the two third pixels. pass construction of triangles (i.e., where an equilateral This time we store the sets of four pixels in a 4×n triangle is constructed by first drawing a base line, then array, listed so that the two first entries of each column fromagivenfirstvertexonthislinedrawingacirclewith forms the base line, the third pixel lies above the base thegivenlengthasradius,andthendrawinganewcircle line and the fourth lies below the base line. withthesameradiusfromthesecondvertexattheinter- Note that any N-point polygon may be built up by section between the line and the first circle, finding the triangles, and the same ideas may therefore be used third vertex of the triangle at the intersection between for computing any N-point correlation function. In the the two circles), the only difference is that the physical above examples we have focused on polygons for which compass is replaced with the two-point tables. alledges have the same length. However,the same algo- Wanting tofind allequilateral triangleshavinga given rithmcaneasilybeusedtoalsoproducegeneraltriangles, pixeliasonevertexandedgesgivenbyagivenbinnum- and thereby general quadrilaterals. Suppose, e.g., that ber k, the first step is to look up the i’th column of the we want to find all triangles with edges given by the bin k’th two-point table. All entries in this column are lo- numbers {k,l,m}. In that case we would use the k’th cated in the correctdistance from pixel i. Next, we pick two-point table to find pixel pairs, i and j, which forms an arbitrary pixel, j, from this column. Together these the base line of the triangle. Then we would simulta- two pixels constitute the base line of the triangle. Now neously scan the i’th column of the l’th two-point table we scan the i’th and the j’th column simultaneously, and the j’th column of the m’th two-point table to find searching for identical entries. Any common entries in the desired third pixels. the two columns will be located in the correct distance Estimating N-Point CorrelationFunctions from Pixelized Sky Maps 7 5. CALCULATINGTHECORRELATIONFUNCTIONS If Nthird = 2, we here have to perform two multipli- We start with the case where we want to estimate the cat(cid:10)ions an(cid:11)d two additions for each base pair, a total of fouroperations,whileinthe“unrolled”organization,the two-pointcorrelationfunctionforseparationgivenbybin corresponding numbers are six multiplications and two number k, for which we already have constructed a two- additions. Thus the total CPU time is roughly halved pointtableasdescribedabove. Ifthenumberofdifferent by introducing the linear organization. The same trick pairs with separation in this bin is N , then the two- pair is even more powerful when estimating four-point corre- point correlationfunction is estimated by lationfunctions,sinceweinthatcasecansumoverboth 1 Npix−1 the upper and the lower groups of third pixels. Having, C (k)= ∆T(table[0,i])· ∆T(table[j,i]). e.g.,twopixelsinbothgroups(orinotherwords,having 2 Npair Xi=0 Xj<i four different quadruples), would require twelve multi- plications and four additions for each base pair in the (3) “unrolled” organization, but only three multiplications Explicitly, we sum up all field values in each column, and three additions in the linear one. and multiply that sum with the zeroth row value. It is Notealsothatthislinearorganizationisverysimpleto difficulttoimagineamoreefficientmethodforestimating construct, it falls out naturally from the method we use thetwo-pointcorrelationfunctionthanthis;webasically to construct the triangles in the first place where we fix perform only one addition for each pixel pair. Note also ourinterestontwobaselinepixels,andscanthroughthe that we only sum over pairs for which i < j in order to two corresponding columns for third pixels. Thus, pro- avoid double counting, and thus the total CPU time is ducingthelinearorganizationisonlyamatterofstoring reduced by a factor of 2. the results from the scans appropriately. For Monte Carlo studies the initialization cost is paid once only, andin such casesthe aboveequationis where 6. ADDITIONALFEATURES the massive amounts of CPU time are spent. Although Beforebenchmarkingthealgorithms,webrieflydiscuss our method intrinsically is quite slow, being an O(N2 ) pix a few ideas of interest to anyone who wants to use these method, it is for this reason probably as fast as a two- algorithmsinpractice,andwhichareimplementedinour point correlation function algorithm will ever be, with- software. Firstweshowhowtooptimizethe cacheusage out introducing either Fourier-methods or large-scale in the case of Monte Carlo studies, and then we briefly smoothing. commentonhowtoparallelizethesealgorithms. Finally, Letusnowtakealookatthehigher-orderedcorrelation we show how to compress the two-point tables in order functions, and suppose that for any given configuration to save disk space. wehavealreadycomputedanN×n arraycontainingall pixel multiplets satisfying the binned geometric require- 6.1. Mask-based pixel numbering and cache usage ments. Here N is the order of the correlation function Fromtheprevioussectionsitshouldbeclearthatthese and n is the number of pixel multiplets. With such a N-point algorithms are very close to being optimal for table the correlation function is estimated by Monte Carlo simulations. In applications where the ini- n tialization cost is small compared to the computational 1 C (k)= ∆T(table[1,i])···∆T(table[N,i]). costs, there is not much left to gain from an algorithmic N n Xi=1 point of view — we have removedall geometry from the (4) computations,and allthat is left is to multiply field val- When the bin width is very small, there is typically ues together as fast as possible, and sum it all up. How- onlyonepolygoncorrespondingtoeachbaseline,andin ever, there are two main issues left to consider, cache such cases it is not possible to accelerate equation (4). optimizationandparallelization. Let us startby looking However, when the bin width is comparable to or larger at the cache usage. thanthepixelsize,oneoftenfindsthatfortwogivenbase In a Monte Carlo application, we want to estimate line pixels there may be several acceptable third pixels. the two-point correlation function for, say, 1000 differ- If we want to estimate the three-point correlation func- ent maps. The naive approach is simply to loop over tion in such cases, we may organize our arrays a little the maps, applying equation (3) repeatedly. However, differently: rather than listing each triplet individually, this has two negative effects. First of all, the CPU has we may first list the two base line pixels, then the num- to move the two-point table information from the main ber of third pixels, and finally list all those third pixels. memoryintothecache1000times. Secondly,thenumber Obviously, a one-dimensional array is more suitable for of hits per memory page in the map array may be less this organizationthan the original two-dimensional one, thanoptimal,sinceconsecutiverowsmaycontainwidely since the number of third pixels may vary strongly from separated pixel numbers. On the other hand, both the set to set. fact that the columns of the two-point tables are sorted, If the average number of third pixels is larger than and the nested HEALPix pixelization, improve the situ- about 1.5 (as often is the case in real-life applications), ation. this linear organization will both result in a smaller If we have considerable amounts of physical memory tripletarray,andspeedupthecomputations,sinceequa- available, we can do better by analyzing all maps simul- tion (4) is replaced by taneously. In this case we store all 1000 maps row-wise inatwo-dimensionalarray,sothateachcolumncontains 1 Nbase Ntihird themapvaluesofonesinglepixel. Withthislargesuper- C (k)= ∆Ti∆Ti· ∆Ti . (5) array we can loop over the maps in the innermost loop, 3 n 1 2 3,j Xi=1 Xj=1 rather than in the outermost. Thus we are guaranteed 8 Eriksen et al. almostperfectcacheusage. Also,weonlyneedtolookup 6.3. Compression of the two-point tables the two-point table once, further reducing the memory If disk space is a limiting factor, one may be discour- traffic. aged by the substantial size of the two-point tables. In Thequestionisthen,dowehaveenoughmemoryavail- such cases one may wish to implement file compression. abletostoreall1000maps? Foramapof,say,threemil- One common algorithm for this purpose is the Huffman lion pixels, the answer is clearly “no”, since that would (1952) coding method. The idea behind this algorithm require about 12 GB of memory, assuming single pre- istoassignshortbitstringstofrequentlyoccurringsym- cision floating point numbers. On the other hand, for bols (i.e., integers in our setting), and long bit strings maps with less than 50000 pixels we only need 400 MB, torarelyoccurringsymbols. The averagenumber ofbits and that is not a large amount for current computers. per symbol is thus minimized. Applied to our two-point Even with our algorithms, estimation of N-point cor- tables, it appears at first sight as though little is gained; relation functions is CPU intensive, and maps of sev- since any pixel number has to occur somewhere in each eralmillionpixelsareoutofreachfortoday’scomputers. column, each number is more or less equally frequent in Thus,CPUtime is the limiting factor,ratherthanavail- eachtwo-pointtable. However,one property of our two- ablememory. Forthis reasonweoften chooseto analyze point tables still makes Huffman coding most efficient, high-resolution maps region by region (by the introduc- namely the fact that each column is sorted according to tionofmasks),whereeachregioncontainsnomorethan, increasing pixel number. Therefore, if we subtract the e.g., 50000 pixels6. In these cases we can discard all the values in two consecutive rows, we get a small positive unused map information, and retain the included pixels number, and the total histogram of the table is strongly only. skewed toward low values. This new, transformed array Explicitly, we now introduce a new pixel ordering may be compressed very efficiently. scheme,countingrelativetothe mask ratherthanto the A simple way of taking advantage of this idea is to resolutionparameterN . Themapsareconvertedfrom side do the following: first we transform the original array thestandardHEALPixnumberingtothemask-basedor- by subtracting consecutive rows, and store the resulting deringbyextractingallincludedpixels,whileconserving arrayinanexternalfile. Thenwecompressthatfileusing their relative order. Thus, the first included pixel is as- anyavailablecompressionutility(whichusuallyrelieson signedthenewpixelnumber1,thesecondincludedpixel a combination of Huffman and run-length coding), e.g., is assignedthe new pixelnumber 2 etc. By storingthese gzip. This method routinely reduces the required disk conversion rules in an external table, we may easily go space by 80-90%for standard combinations of N and side backandforthbetweenmask-basedandmap-basedpixel N . bin ordering. In order to optimize the cache usage for high- In the current implementation there are special rou- resolution maps, we convert all maps and two-point ta- tinesforreadingandwritingtwo-pointtables. Compres- bles to mask-based pixel ordering, and then proceed as sion may therefore be enabled simply by adding a few before. appropriate lines in each of those routines (subtracting consecutive rows, making a system call to gunzip/gzip 6.2. Parallelization of computational routines etc.), leaving the rest of the source code unchanged. The final issue to consider from an efficiency point of 7. BENCHMARKINGTHEALGORITHMS view is parallelization. There are two different cases, We haverunseveraltests to determine whatresources the geometric computations and the actual correlation arerequiredforapplicationoftheabovealgorithms. The function calculations. parameters of these tests mimic those used in the small- The latter part is quite trivial, considering that we scaleWMAPanalysisofEriksen et al.(2004),inthatwe already in the algorithmic design have divided the prob- let our regionof interest be a disk on the HEALPix sky, lem into completely separate parts — usually we want ′ pixelizedatN =512,andwithabinwidthof7.2. The to compute many bins of each correlation function, and side only difference is that we let the radius of the disk vary then we let each processor work on its own geometric ◦ ◦ ◦ between 5 and 15 (in steps of 2.5) in order to see how configuration. This obviouslyresults inperfectspeed-up the algorithms scale with respect to number of pixels. (i.e., doubling the number of processors halves the total These disks correspond to pixel numbers between 5940 CPU time). and53464,andweshouldthereforegetagoodpictureof Parallelizationof the constructionof the two-pointta- how the algorithmsbehave in real-worldsituations. The bles is not quite as simple, at least not for the very first computer used in the following tests is a Compaq EV6 step,inwhichwegenerateafullN ×N table. How- pix pix Alpha server. ever,one possibilityis to leteachprocessorworkonsep- We start by generating a set of two-point tables for arate columns, and then merge the partial results into each disk, and measure the CPU time and disk space one complete array in the end. Once this task is per- spent in this process. Next we estimate the two-, three- formed, the problem is once again divided into separate and four-point correlation functions from 100 simulated parts, and the expansion steps may be trivially paral- maps, and find the number of pixel multiplets (pairs, lelized, like the N-point calculations. triplets and quadruples), the CPU time per realization 6 Thismethod was recently usedby Eriksenetal.(2004) inan and initialization cost as a function of Npix. analysis of the WMAP data. The largest scales were probed by 7.1. Wall-clock time and disk space for generating the first degrading the maps from Nside = 512 to Nside = 64, and then computing each correlation function in 1◦ bins on a set of two-point tables complementary hemispheres. The smaller scales were studied by partitioningtheskyinto10◦disks,oneachofwhichthethree-point The results from the two-point table generation tests functionwasestimated. are shown in Figure 4. In panel (a) we see that the Estimating N-Point CorrelationFunctions from Pixelized Sky Maps 9 10000 Total time CPU time 10 IO time c) n time (se 1000 ace (GB) o p ati k s mput Dis 1 o C 100 0.3 0.6 1 3 5 0.6 1 3 5 Number of pixels (104) Number of pixels (104) (a) (b) Fig. 4.—a)Thetimerequiredforgeneratingasetoftwo-pointtablesfordisksofvaryingsize. NotethattheI/Otime(i.e.,diskactivity) isofthesamemagnitudeastheCPUtime,reflectingthemassiveamountsofdatathatareprocessedintheseoperations. Neitherofthese functions follows a power law. b) The disk space requirements for storing the two-point tables. This function follows a nearly perfect O(N2 )relation,withaneffectiveindexof1.99. pix time spent on this process does not follow the expected computations; a data set of 50000 pixels requires less O(N2 ) relationship. In fact, the CPU time scales as than 20 GB of space, a trivial amount of disk space for pix O(N1.78), while the input/output time (i.e., disk activ- current computers. If we want to include as many pix- pix els as 200000,we wouldneed 160GB, andthen perhaps ity) doesn’t even follow a power law. The explanationis compression might be needed, as described in the previ- the same in both cases: each of these quantities follows a relation of the generalform, T =a +b N +c N2 . ous section. Then the disk space requirements would be i i i pix i pix lessthan30GB,andonceagainwellwithinthe limits of Algorithmic overhead and initialization costs are typi- current computers. cally linear in N , and these are likely to dominate for pix We may therefore conclude that the first step in the a small number of pixels. In other words, b ≫ c. What two-point table strategy, the construction of the two- we see in Figure 4(a) is the manifestation of this. For point tables, do not pose a serious problem neither in the relatively modest number of pixels we consider, we terms of wall-clock time nor disk space, when analyzing do not reach the asymptotic N2 region, and therefore pix data sets which contain fewer than a few hundred thou- the effective scaling coefficient in this region is smaller sand pixels. than 2. Thisresultmayseemsomewhatcounter-intuitive;nor- 7.2. Estimating correlation functions mallyalowscalingcoefficientis interpretedina positive Finally, we check that the amount of time spent direction. Thatis notthe casehere. Rather,itmaysug- on actually computing the correlation functions is not gestthatthecurrentimplementationissub-optimal,and prohibitively large. This experiment is carried out thatthereis roomfor improvements. However,itshould also be noted that a truly O(N2 ) implementation can by measuring both the initialization and the addi- pix tion/multiplication time for each of the two-, three- and probably only be devised if one is able to store the full four-point correlation functions, for one particular con- N ×N distance matrix in the physical RAM simul- pix pix figurationoneachofthedisksdescribedabove. Thecon- taneously,andinthat caseourmethods aresuperfluous. ◦ figurationswerearbitrarilychosentobethe3 two-point, Theimportantconclusion,however,isthatthecurrent equilateral three-point, and rhombic four-point configu- algorithmsareinfactsufficientlyefficienttohandle data rations. setswithatleastafew hundredthousandpixels,consid- The results are shown in Figure 5. In panel (a) we see ering that it only takes about an hour to generate a set the numbers of pairs, triplets and quadruples contribut- of two-point tables for 50000 pixels. ingtothebinasafunctionofN . Notethatthesenum- In Figure 4(b) the corresponding disk space usage is pix bers increase almost linearly with the number of pixels shown. Here we do see a virtually perfect O(N2 ) scal- pix in the data set, with best-fit power-law indexes of 1.15, ing, a result which should come as no surprise. The lin- 1.23 and 1.34, respectively. This is an intuitive result; ear overheadfor storing the two-pointtables is basically doubling the area of interest also doubles the number of just a few extra rows (for storing the pixel centers and polygons that fit into it. The deviation from a perfect paddingextra spacewith-1’s),andthis is normallyneg- linear relationship is caused by boundary effects. ligible compared to the total number of “neighboring” Giventhis linearrelationship,we alsoexpectthe CPU pixels. time required for multiplying and adding the pixel mul- Notealsothatdiskspaceisnotacriticalfactorinthese tiplets together to increase linearly with N . In Figure pix 10 Eriksen et al. 12 240 240 Two-point function Three-point function Two-point function 6Number of pixel multiplets (10) 48 TFohurere-p-pooinint tf ufunncctitoionn Initialization time (sec)11628000 Four-point function CPU time per 1000 realizations (sec)11628000 TFohurere-p-pooinint tf ufunncctitoionn 0 1 2 3 4 5 6 0 1 2 3 4 5 6 00 1 2 3 4 5 6 Number of pixels (104) Number of pixels (104) Number of pixels (104) (a) (b) (c) Fig. 5.— a)The number ofpixel multiplets as afunction of Npix forthe three differentfunction types. Note that these numbers scale almost linearlywith Npix, boundary effects make up the difference. b) The CPU time required forgenerating the set of equilaterals and rhombi,andc)theCPUtimeforcomputingoneconfigurationof eachN-pointcorrelationfunctionfor1000realizations. 5(b) we see that this is the case. It is important to note Carlorealizationsona disk containingabout50000pix- that the CPU time for computing any single bin of any els, with 100 configurations (distance bins) in the two- N-point correlation function increases linearly with the point correlation function, 500 different configurations number of pixels in the data set. in the three-point correlation function, and 1000 differ- It is well known that the evaluation of a general N- ent configurations in the four-point correlationfunction. point correlation function scales as O(NN ). However, We assume that the time for computing each of these pix ◦ this relation only applies to the computation of the full configurationsis constant and equal to the 3 configura- correlation function, i.e., when including all NN pixel tions. Whilethisobviouslyisnotstrictlytrue(morepixel pix multiplets with all possible multiples of the bin size as multipletsareassociatedwithlargerconfigurations),this lengths of the edges. In most applications, this is found particular size is a good average of what is used in real- unfeasible, and one uses only a small subset of all avail- world CMB analysis. ablegeometricconfigurationsoftheN vertices. Inthose First we have to generate the two-point tables for the casesthe computationalscalingsare less severe;the cost disk, a process which is performed only once. The total for computing one single bin increases only proportion- wall-clock time for this is about 3 hours, and we need ally with the number of pixels in the data set, and that about 20 GB of disk space to store the tables. Next, time is not radically different for the two-, three-, and about 1 GB of RAM is required for storing the maps, four-pointcorrelationfunctions. Onlythenumberofdif- and an additional few hundred MB for generating the ferent available configurations change from correlation triangles and quadrilaterals. This amount of disk space function to correlation function, not the cost for com- and RAM is not a major obstacle for current worksta- puting one given separation bin. tions or super-computers. InFigure 5(b)and5(c)we see thatfor a dataset with Next,theestimationofthetwo-pointcorrelationfunc- 50000 pixels it takes about 90 seconds to find all equi- tionstakesabout100·5·30sec≈5hours,whilethethree- lateral triangles with 3◦ edges, and 200 seconds to find point configurationsrequire 500·(150sec+5·150sec)≈ all rhombi. In general it takes about twice as long to 125hours. Finally, the four-point configurations need find all quadruples as finding all triangles, an obviously 1000·(200sec+5·220sec) ≈ 400hours. None of these logical result — quadrilaterals are constructed by merg- numbers are prohibitively large, and we can therefore ing two triangles. However, it is worth noting that this conclude that the proposed algorithms are able to han- extracostcanbeavoidedwhencomputingquadrilaterals dle real-world data sets using reasonable computational constructed from isosceles triangles, since one then does resources. not have to worryabout the orientationof the triangles. 8. CONCLUSIONS Asmentionedintheintroduction,ourmethodsarede- We have described a set of algorithms for estimating signed for Monte Carlo studies, and in Figure 5(c) the N-pointcorrelationfunctionsfromapixelizedmap. The CPUtimeforanalyzing1000realizationssimultaneously fundamental idea is first to locate all pixel multiplets is shown, excluding initialization. We see that for a (pairs, triplets, quadruples etc.) corresponding to one 50000pixeldatasetittakesabout30secondstoestimate givengeometricconfiguration,storetheseinatable,and one two-point function bin, 150 seconds for one three- then finally use that information to efficiently add and point configuration and 210 seconds for one four-point multiply the pixel values together. The pixel multiplets configuration. For the higher-ordercorrelationfunctions arefoundbysearchingthroughasetoftwo-pointtables, these numbers are about the same as the initialization motivatedbyrulerandcompassconstructionoftriangles. costs, and we therefore conclude that it is crucial to or- The construction of the two-point tables only take a ganize our programs so that the initialization costs are small amount of CPU time (typically a few hours), and paid once only. since this operation is performed only once, it is com- A typical example of an N-point correlation function pletely negligible compared to the following operations. analysis could be the following: We analyze 5000 Monte The storage requirements of these tables are well within