Reverse Nearest Neighbor Heat Maps: A Tool for Influence Exploration Yu Sun∗, Rui Zhang§∗, Andy Yuan Xue∗, Jianzhong Qi∗, Xiaoyong Du† ∗Department of Computing and Information Systems, University of Melbourne †Renmin University of China and Key Laboratory of Data Engineering and Knowledge Engineering, MOE, China Email: ∗{sun.y, rui.zhang, andy.xue, jianzhong.qi}@unimelb.edu.au †[email protected] 6 Abstract—We study the problem of constructing a reverse 1 nearest neighbor (RNN) heat map by finding the RNN set of 0 everypointinatwo-dimensionalspace.BasedontheRNNsetof 2 apoint,weobtainaquantitativeinfluence(i.e.,heat)forthepoint. b Theheatmapprovidesaglobalviewontheinfluencedistribution e in the space, and hence supports exploratory analyses in many F applications such as marketing and resource management. To construct such a heat map, we first reduce it to a problem 2 calledRegionColoring(RC),whichdividesthespaceintodisjoint regions within which all the points have the same RNN set. We ] (a) RNNheatmap (b) Satellite map B then propose a novel algorithm named CREST that efficiently solves the RC problem by labeling each region with the heat Fig.1. RNNHeatMapofNewYorkCity D value of its containing points. In CREST, we propose innovative of everypointin the space. Comparingto existing studies [8], . techniquestoavoidprocessingexpensiveRNNqueriesandgreatly [10],[22],[26],[31]whichgiveonlythepointsorregionswith s c reduce the number of region labeling operations. We perform thehighestinfluence,theRNNheatmapenablesexploringthe [ detailedanalysesonthecomplexityofCRESTandlowerbounds influenceofthewholespacewhileconsideringqualitativefac- of the RC problem, and prove that CREST is asymptotically tors at any instant during the exploration.Consider a scenario 2 optimal in the worst case. Extensive experiments with both real whereRNN heatmapsareusedtoassist selectinglocationsof v and synthetic data sets demonstrate that CREST outperforms 9 alternative algorithms by several orders of magnitude. self-pickup and drop-offservice points for couriercompanies. 8 Let O be the potential clients and F be the existing service 3 I. INTRODUCTION points. For simplicity, let the size of the RNN set measure the 0 influence, i.e., heat (although any other functions related to 0 In market analysis, urban design, and facility placement, the RNN set can be used). Fig. 1(a) shows such an RNN heat . we often need to select a suitable location for new facilities mapforthe New YorkCity,whose satellite imageis shownin 2 such as a warehouse or a hospital. Emerging event-based 0 Fig.1(b).Thedarkerregionsindicatehigherheatvalues.Such social networks such as Meetup and Whova also need to 6 a heat map will allow the exploration of influential regions selectanappropriatelocationsuitablefortheevent-participant 1 while considering qualitative factors as discussed above. Note : arrangement.These problemsare called the location selection that regions with high influence values do not necessarily v problem, which is usually a multi-criteria decision making correspondtoregionsofhighclientdensitybecauseweneedto i X process involving various quantitative and qualitative factors. consider the competition from existing facilities. For example A quantitativefactor usually consideredis the influenceof the r in Fig. 2, the upper left corner has the highest client density, a location, which is commonlymeasured by the reverse nearest but the most influential and the 4th influential regions are in neighbor(RNN)setofthelocation[12],[22],[26].Giventwo themiddle,denotedbythetwograyrectangles(the2ndandthe setsofpointsO andF,theRNNsetofalocationpisasubset 3rd most influential regions are also in the middle near these of O that are closest to p among all the points in F. There two but too small to be visible). Without the RNN heat map, are manyways to measure the influence of p by the RNN set. it is very difficult or impossible to explore all these different Straightforwardmeasuresconsideronlythesizeortotalweight choices and make well-informed decisions. of the set [6], [26], [31]. Other measures consider various attributes of the data points in O and F, such as the capacity To construct such a heat map, we need to obtain the constraint [16], [22], social relationship [19], [29], etc. While influence value of every point in the space. We call such a we can model the quantitative factors precisely by numbers, problem the RNN heat map (RNNHM) problem: we can not do the same to many qualitative factors such as Definition 1 (RNN Heat Map Problem): Given two sets of the area safety, demographic composition and convenience of points O and F and a distance metric in a two-dimensional publictransportation.Somefactorsindecision-makingarealso space, the RNN set of a point q (q ∈/ F) is a subset of O that vague and imprecise, which are subject to decision maker’s have q as their nearest neighbor comparing with other points judgments. To assist decision making based on quantitative in F. Given any influence measure, which is a real-valued measures while still allowing subjective judgments based on function on the RNN set, associate each point in the space qualitative measures and other factors, we introduce the RNN with its influence value, i.e., the heat value. heat map, which shows the influence (quantitative measures) Sincethenumberofpointsinthespaceisinfinite,tosolvethe §Corresponding author. Copyright (cid:13)c 2016IEEE {o3}{o3,o4}{o1,o4}{o1,o2,o4} existing facilities f1 o{4o4} {o1} 3.0 o3 f2 1.0 {o2} {o1,o3} o1 o2 1.0 clients {o1} {o1,o3,o4} {o1,o2} (a) (b) Asuperimposition (c) Aheatmap Fig.2. Clientdensity Fig.3. AnexampleoftheRCproblem RNNHMproblem,wefirstreduceittoaproblemcalledRegion Besides not being able to compute the RNN heat map for Coloring (RC), which divides the space into disjoint regions, genericinfluencemeasures,asuperimpositionalsocannotsup- within which all the points have the same RNN set (detailed port interactive post-processing operations such as selectively in Section III). We use Fig. 3(a) to illustrate the RC problem showingregionswith heatvaluesabovea thresholdor regions withL .Forsimplicityandease ofpresentation,wewillfirst having the top-k heat values, whereas these operationscan be ∞ discusshowtoobtainsuchregionsandcomputetheirinfluence easily applied as post-processing of our proposed techniques, values with the L metric, and then extend the techniques whichaim toobtaintheRNN setofeveryregionin thespace. ∞ to L and L metrics. In Fig. 3(a), let O = {o ,o ,o ,o }, 1 2 1 2 3 4 In this paper,we investigatealgorithmsto efficiently solve represented by the black dots, and F = {f ,f }, represented 1 2 the RNN heat map problem. In some applications such as by the small red squares. For each point o in O, we draw a taxi-sharing,theheatmapmaychangeasclientsmovearound “circle” called the NN-circle with o being the center and the and need to be recomputed frequently. Therefore, an efficient distance to o’s nearest neighbor (NN) being the radius, which algorithmtotheRNNHMproblemiscrucial.Astraightforward is a square with the L metric. The NN-circles partition the ∞ approachsuchasemployingagridtodividethespaceandthen spaceintoseparateregions.Itcanbeprovedthatallthepoints using the cells to fit the regions has difficulties in finding the in such a region have the same RNN set. The RC problem is right granularity and suffers from low efficiency. When the to obtain the influence of each such region in the space. influence measure involves a large amount of attributes such Notethatifwemeasuretheinfluencesimplybythesizeof as the capacities of taxis and connections of clients, it can the RNN set, we can build the heat map by a superimposition also be very expensive to compute [22]. To overcome these of the NN-circles, i.e., overlap/overlay of translucent NN- challenges,weproposeaninnovativealgorithmnamedCREST circles as shown in Fig. 3(b). A darker region suggests more (Constructing RNN hEat map with the Sweep line sTrategy) NN-circles overlapping there and hence a higher influence. which efficiently solves the RNNHM problem. Through a However, for a more generic influence measure than the size detailed analysis, we prove that CREST is asymptotically (or a weighted sum of the RNNs), the heat map can not optimal in the worst case. CREST is also generic in the be achieved by such a simple superimposition. For example, sensethatitappliestoanyinfluencemeasurecomputablefrom considerataxi-sharingscenario[14]wheretheheatmapassists RNN sets and can easily support interactive post-processing taxi drivers to decide the next pick-up locations. In Fig. 3(a), operations as described above. The main contributions of this letO be potentialpassengers,e.g.,usersoftaxibookingapps, paper are summarized as follows. and F be taxis. Assume that taxi drivers make more profits • We propose the RNN heat map problem, which com- when taking together multiple passengers whose destinations putes a heat map showing the distribution of RNN- are close, say within onekilometer.Letthe data pointso , o , 1 2 basedscorestosupporteffectiveexploratoryanalyses. and o connected by an edge denote such passengers. Under 4 such setting, the influence of a location becomes the number • We propose an innovative algorithm named CREST of connected passengers in the RNN set. We build the heat which efficiently solves the RNN heat map problem. map as shown in Fig. 3(c). We can see that there is only one Thealgorithmutilizestwonoveltechniquestorespec- darkest region, which has an influence value of 3.0 since its tively avoid processing any RNN queries and greatly RNN set is {o ,o ,o } and there are three edges connecting reduce the times of influence computation. 1 2 4 o1, o2 ando4.Incomparison,thesuperimpositionasshownin • We carefully analyze the complexity of CREST and Fig. 3(b) creates two darkest regions, both have an influence lower bounds of the RC problem, and prove that value of 3.0, one with the RNN set {o1,o3,o4} and the other CREST is asymptotically optimal in the worst case. {o ,o ,o }. Under the measure that favors connected data 1 2 4 • We also conductextensiveexperimentswith both real points, the RNN set {o ,o ,o } only has an influence value 1 3 4 andsyntheticdatasets.Theresultsconfirmthesuperi- of 1.0, which is not a good choice for picking up passengers. ority of CREST by showing that CREST outperforms Another example is that in the previous courier company alternativealgorithmsbyseveralordersof magnitude. scenario, all the service points have a capacity limit (e.g., the storage space). Taking these attributes into account, the Theremainderofthispaperisorganizedasfollows.SectionII influence of a location will dependnotonly on the size of the reviews related work. Section III formalizes the problem. RNNsetbutalsoonitsservingcapacity1.Thesuperimposition SectionIVdiscussesabaselinealgorithm.SectionVdescribes will not be able to handle such influence measures. the CREST algorithm. Section VI analyzes the complexity. 1TheinfluenceofalocationpiscomputedbyP min{c(f),|R(f)|}, Section VII extends CREST to other settings. Section VIII wherec(f)isthecapacityandR(f)theRNNsetofff∈F[2∪2{].p} shows the experiments and Section IX concludes the paper. II. RELATEDWORK only pairwise intersections of line segments or rectangles, whileCRESTcomputestheoverlapsandrelativecomplements RNN Query. The RNN query is introduced by Korn et of multiple circles, squares, and axis-aligned line segments, al. [12]. Yang et al. [28] proposed the Rdnn-tree (a variant which are much more challenging. ii) In order to efficiently of R-tree) to process the RNN query. Maheshwari et al. [15] compute the RNN sets, besides the line status, CREST need presentadatastructureforansweringthemonochromaticRNN to memorize the RNN sets of previousevents. This requiresa query by utilizing a persistent search tree [18]. The structure delicate design to minimize the overheadand achieve optimal first obtains the NN-circles enclosing a query point in the x performance. BO does not have such optimization. dimension and then among these retrieved NN-circles locates theface(region)enclosingthepointinthey dimension.These algorithmsfocuson computingthe RNN set of a single query III. PROBLEM FORMULATION point.None of them directly appliesto the RNNHM problem. We firstintroducebasic conceptsinSectionIII-Aandthen In the RNNHM problem, the aim is to compute the RNN set reduce RNNHM to the Region Coloring (RC) problem in for every point in the space all at the same time, and the Section III-B. Frequently used symbols are listed in Table I. challenge is to avoid the expensive RNN computation. The TABLEI All Nearest Neighbor (ANN) [7] operation takes as input two FREQUENTLYUSEDSYMBOLS discrete and finite sets of points and computes for each point in the first set the NN in the second set. For the RNNHM Symbol Meaning problem,however,we needto obtainRNN sets foressentially O the set of clients infinitepointsinacontinuousspace.Therefore,thetechniques for ANN do not apply. F the set of facilities n the number of data points in O Influence Measures based on RNN Sets. Various influ- ence measuresbased on the RNN set have been studied.Korn C(oi) the NN-circle of oi ∈O etal. [12]proposetousethesize (orsumofweights)ofRNN xi (resp. yi) the left (resp. lower) side of C(oi) sets as the influence value. To find the optimal points whose xi (resp. yi) the right (resp. upper) side of C(oi) RNN sets are of the maximum size (influence), Cabello et el the l-th event al. [6] propose the maximization problem MaxCov and they x the x-coordinate of e l l solve the problem by finding the depth of an arrangement of I(l) the line status between el−1 and el disks.Wongetal.[26]solveMaxCovbythedevisedMaxOver- y the t-th element in a line status t lap algorithm.Huanget al. [11] and Xia et al. [27] investigate hyt−1,yti two consecutive elements in a line status finding such points in a given set. Sun et al. [21], [22], [23] additionally consider the capacity constraints of such points rlt the rectangle [xl−1,xl,yt−1,yt] R(·) the RNN set of an object and study how to achieve a global influence maximization instead of a local maximization. Qi et al. [17] define the influencebasedontheaveragedistancebetweenapointandits A. Preliminaries RNNs.AsRNNHMappliestoageneralmeasure,theRNNHM We consider two types of RNN queries: the bichromatic problem can be viewed as a generalized version of the above andmonochromaticRNN queries.Intheformertype,thedata problemsandthereforethesolutionofRNNHMcanbeadapted tosolvethese problems.However,theirsolutionsdonotapply points and their NNs belong to two different sets O and F. to RNNHM, since the special properties exploited in these In the latter type, they are from the same set, i.e., O = F. problems do not present in RNNHM. Let d(p,q) be the distance between two points p and q. We consider three different distance metrics: L , L , and L . ∞ 1 2 RNN Variations. RNNHM is a variantof the RNN query. We start with solving the bichromatic RNNs with L metric ∞ There are also many other studies on variations of the RNN because the bichromatic type is generic and L is simpler. ∞ query. For instance, Lu et al. [13] investigate reverse spatial and textual nearest neighbor queries, in which both location RNN Query. InbichromaticRNNs, we are giventwo sets and textual descriptions are considered in the distance metric. O and F. The set O can be considered as (the locations of) Similarly, Sun et al. [24] consider temporal aggregates of clients while F as (the locationsof)facilities. The clientsfind location-basedsocialnetworkcheck-insinthedistancemetric. their NNs from the facility set. The RNN set of a point f in Zhang et al. [30] design indexes utilizing modern memory F, denoted by R(f), consists of the points in O that have f hierarchies to speed up such query processing. Ali et al. [1] as their NN, i.e., study approaches to continuous retrieval of the query objects. R(f)={o∈O|∀f′ ∈F :d(o,f)≤d(o,f′)}. She et al. [19], [20] devise algorithmsto arrangesocialevents to proper users using RNN sets. These problems are quite Fora pointq notin F,we obtainitsRNN setR(q)byadding different from RNNHM and the proposed algorithms cannot q into the facility set F and computing R(q) as above. be adapted to solve the RNNHM problem. Nearest Neighbor Circle (NN-circle). An NN-circle of a Sweep Line Strategy. The sweep line strategy is a quite point o, denoted by C(o), is a circle with o being the center genericapproachto handlinggeometricobjects. The Bentley– and the distance from o to its NN being the radius. With Ottmann(BO)algorithmusesthisstrategytocomputeintersec- the L distance metric, the distance between two points is ∞ tionsoflinesegments[4] orrectangles[5].The BO algorithm the maximum difference between their coordinates among all and the proposed CREST algorithm compute very different dimensions,i.e.,d(p,q)=max{|p −q |,|p −p |}in a two- x x y y problems and are different in many aspects. i) BO computes dimensionalspace,wherethesubscriptsdenotethecoordinates vertex f q 1 3 o1 q2 edge o 2 q 1 C(o ) C(o ) 1 1 C(o2) face C(o2) Fig.6. Boundingedge Fig.7. Sideextension Fig.4. NN-circles Fig.5. Arrangement Definition 2 (Region Coloring): Given a set of NN-circles, in the x and y dimensions, respectively. Hence, NN-circles Region Coloring is to label each region in the arrangementof are of a square shape. (With L and L , the NN-circles are 1 2 the NN-circles based on the RNN set of any point contained of diamondand circularshapes, respectively.)For example,in in the region. Fig. 4, set O consists of two points o and o . Set F consists 1 2 of one point f . The NNs of o and o are both f . The NN- 1 1 2 1 circles C(o1) and C(o2) are the two squares. IV. ABASELINE ALGORITHM B. Problem Reduction A simple approachto the RC problem is to pick a point p inside each region, use a point enclosure query to obtain the Ourgoalistodrawtheheatmapofagivenspacebasedon NN-circles that enclose p, obtain the RNN set and label the theRNN setsofthe points.In acontinuousspace,thenumber region.However,pickingapointineachregionisanexpensive ofpointsisinfinite,whichmakesfindingtheRNNsetofevery operation. This is because it requires computing an exact point infeasible. To overcome this, we reduce RNNHM to an representationofeachregioninthearrangement,whichmeans equivalent problem Region Coloring (RC), which divides the everyedgeboundingaregionneedstobecomputed(cf.Fig.6) space into regions and “colors” (i.e., associates) every region and hence has a very high complexity (O(n2logn) [9]). To witha“heat”(i.e.,influencevalue).We dividethespaceusing avoid such complicated computations, we extend the sides of the NN-circles as follows. The arrangement (i.e., layout) of eachNN-circletoletthemspanacrossthewholearrangement, the NN-circles, as illustrated in Fig. 5, forms a planar graph, as shown in Fig. 7. By doing so we form a grid over the which also induces a subdivision of the space. We use the arrangement, where each grid cell can be easily located. We notions in planar graphs such as vertices, edges and faces scan the grid cells and compute the RNN set for the centroid (as illustrated in Fig. 5) in the arrangement directly. In the of each cell, which solves the RC problem. An alternative arrangement,each face represents a unique region, which is a way is to use a regular grid where each cell has the same maximal connected subset of the space that does not contain size. However, it is difficult to determine a proper cell size to a vertex or an edge (e.g., the gray region in Fig. 5). In each guaranteethateachcellfallsinexactlyoneregionunlesseach region,all the points have the same RNN set. If two points of pointistreatedasacell,whichagainisimpossibletocompute. a regionhavedifferentRNN sets, theremustexistatleast one To efficiently compute the RNN set of a point, instead of NN-circlethatonepointliesinsidebuttheotherdoesnot;this checking each NN-circle to test whether it encloses a certain means one side of the NN-circle must cut the region, making point, we build an index that supports point enclosure queries itnolongera regionbydefinition.TheRNN set ofeachpoint fortheNN-circles.We usetheS-tree[25]forease ofanalysis, in the region consists of the centers of the NN-circles that althoughotherspatialindexessuchastheR-treemaybeused. enclose the region. For example, in Fig. 5, points q and q 1 2 lie in the region enclosedby NN-circles C(o ) and C(o ), and Algorithm Complexity. Let n = |O| denote the number 1 2 they have the same RNN set {o ,o }. For point q , its RNN of NN-circles, and m denote the number of grid cells. There 1 2 3 setis{o },whichisdifferentfromthatofq orq .Therefore, are at most 2n extended sides vertically or horizontally, thus 1 1 2 q must lie in a different region. Note that the opposite does m = O((2n)2) = O(n2). To obtain the grid cells, it takes 3 not hold, i.e., different regions may have the same RNN set. O(2×2nlog2n) =O(nlogn) time to sort the sides. It then We formalize the above facts with the following proposition. takesO(nlog2n)timeto buildanS-treeindexandO(logn+ α)timetoprocessapointenclosurequery[25],whereαisthe Proposition 1: The points in the same region of the subdi- number of NN-circles returned. Let λ be the maximum size vision formed by the arrangement of the NN-circles have the of the RNN sets in the arrangement. The time complexity of same RNN set. the baseline algorithm is O(nlog2n+mlogn+mλ). Since For RNNHM, each region can be used to represent all the we consider a general influence measure, which can be any points it contains. To associate each point with a heat, it function with any computationalcost, in the analysis we only suffices to color each region with the heat of the points it count the number of times of influence computation, i.e., m contains. Since the influence is computed straightforwardly in the above complexity. We further derive a bound for m as based on the RNN set, in the following discussion, we do not follows. Let r be the number of regions formed by n NN- distinguish the process of outputting the RNN set of a region circles. It can be proved by Euler characteristic that r is and the process of computing and outputting the influence betweenΘ(n)andΘ(n2).Inparticular,whenthenNN-circles value. We will simply use the term “labeling a region” to do not intersect with each other, r =n+1=Θ(n); when the denote the two processes. Assuming that the NN-circles are n NN-circles are placed as shown in Fig. 8, where they all alreadyprecomputed(thereareefficientalgorithmstocompute have the same side length n and the ith NN-circle is centered andmaintain the NN-circles[12]), we define the aboveregion at point (i,i), r = n2 −n+2 = Θ(n2). Since r ≤ m and coloring problem as follows. m=O(n2), we obtain Θ(r)≤m≤Θ(n2). y y {o ,o } 2 2 1 3 (cid:0)(cid:1)(cid:0)(cid:1)(cid:0)(cid:1)(cid:0)(cid:1)(cid:0)(cid:1)(cid:0)(cid:1)(cid:0)(cid:1)(cid:0)(cid:1) C(o2) o (cid:0)(cid:1)(cid:0)(cid:1)(cid:0)(cid:1)(cid:0)(cid:1)(cid:0)(cid:1)(cid:0)(cid:1)(cid:0)(cid:1)(cid:0)(cid:1) y1 2y1 (cid:0)(cid:1)(cid:0)(cid:1)(cid:0)(cid:1)(cid:0)(cid:1)(cid:0)(cid:1)(cid:0)(cid:1)(cid:0)(cid:1)(cid:0)(cid:1) y2,y3 y3 (cid:0)(cid:1){o(cid:0)(cid:1)(cid:0)(cid:1),o(cid:0)(cid:1)}(cid:0)(cid:1)(cid:0)(cid:1)(cid:0)(cid:1)(cid:0)(cid:1) (cid:0)(cid:1)(cid:0)(cid:1)1(cid:0)(cid:1)(cid:0)(cid:1)2 (cid:0)(cid:1)(cid:0)(cid:1)(cid:0)(cid:1)(cid:0)(cid:1) y2 y2 Fig.8. Worstcase (cid:0)(cid:1)Fi(cid:0)(cid:1)g.(cid:0)(cid:1)9.(cid:0)(cid:1)S(cid:0)(cid:1)et(cid:0)(cid:1)ch(cid:0)(cid:1)an(cid:0)(cid:1)ging C(o1) o1 o y y 3 The S-tree index for point enclosure queries occupies 1 1 C(o ) 3 the most space, which is O(nlog2n) [25]. Thus, the space y 3 complexity is O(nlog2n). We summarize the above analysis s x with the following theorem. x x x x x x x x x 1 2 3 4 5 6 7 8 9 Theorem 1: ThebaselinealgorithmfortheRCproblemstops Fig.10. Exampleofevents andlinestatus in O(nlog2n + mlogn + mλ) time and uses O(nlog2n) RNN sets of adjacent regions the base sets and cache them space, where m is the number of grid cells and λ is the for obtaining the RNN sets of newly swept regions. We also maximum size of the RNN sets. devisetechniquestoconstrainthenumberofcachedbasesets. Limitations of the algorithm. One drawback of the Powered by these techniques, we achieve a highly efficient baseline algorithm is that it needs to process point enclosure algorithm to the RNNHM/RC problem,which is provedto be queries,whichisreflectedinthenlog2nandmlogntermsin asymptotically optimal in many cases (cf. Section VI). Next thetimecomplexity.Anotherdrawbackisthatitfurtherdivides we detailed these techniques. the regions into multiple grid cells, which means a region in the original arrangement will be labeled multiple times. The A. Concepts and Notation number of these grid cells m may increase quadratically with theincreaseofn(closertoΘ(n2)).Alargemmeansweneed Events.Letxi(resp.yi)andxi (resp.yi)bethex-(resp.y- toprocessalargenumberofpointenclosurequeriesandlabela )coordinatesoftheleftandright(resp.lowerandupper)sides largenumberof gridcells, whichsignificantly deterioratesthe of NN-circleC(oi), respectively.The distinctx-coordinatesof efficiency.Weaimtoreducem(thenumberoftimesofregion theverticalsides(ofalltheNN-circles)arestoredinascending labeling)tothenumberofregionsinthearrangement,whichis order in a queue,which is called the eventqueue and denoted optimalin theRC problem.Therefore,wehavetwodirections by Qe. The elements in Qe are called the events or event for improvements: (i) to avoid point enclosure queries, and points. For convenience, we refer to the coordinates of the (ii) to reduce the numberof times regionlabeling.We present sides simply as sides when the contextis clear. We denote the our CREST algorithm which achieves these two goals in the lth event(i.e., the lth ejected element from Qe) by el and the following section. x-coordinateofel byxl.Notethedifferencebetweenanevent coordinatex andtheside coordinatesofC(o )(i.e.,x orx ). i i i i V. THE CREST ALGORITHM They may have an equal value, but different semantics. We employ the classic sweep line strategy [4], [9] (cf. Line Statuses. Let sx be the x-coordinate of the sweep Section II) to avoid forming a large number of cells to be line. We say that the line cuts NN-circle C(oi) if and only labeled as done by the grid dividing strategy. We let a line if sx ∈ (xi,xi] (i.e., it is in the horizontal range of C(oi)). sweep from the left to the right of the space, and store By definition, the NN-circles that are cut by the line remain informationabout the NN-circles that are currently cut by the the same between two consecutive events (including their sweep line. We call such information the line status, and say positions). Let C(ol1),C(ol2),...,C(oln(l)) be the n(l) NN- that an event is triggered when the line status changes. As circlescutbythelinewhenitsweepsfromel−1 toel.We sort illustrated in Fig. 10, we use the distinct vertical sides of thehorizontalsides(notonlycoordinates)yl1,yl1,yl2,yl2,..., the NN-circles as event points (i.e., x1,x2,...,x9), and the yln(l),yln(l) of these NN-circles in ascending order (ties are sorted horizontal sides of the NN-circles as the line status. broken arbitrarily), and use the sorted list as the line status Every pair of adjacent vertical sides and horizontal sides between events e and e , which is denoted by I(l) = l−1 l forms a subregion to be labeled. We notice that some of ky ,...,y ,...y ,...,y k, a,b,c,d ∈ {1,2,...,n(l)}. la lb lc ld these subregions come from the same original region formed For example, in Fig. 10, the current line status is I(3) = by the NN-circles, and hence do not require the RNN set ky ,y ,y ,y k. For convenience, we denote by y the ith 1 2 1 2 i and influence computations repetitively. We use the change element in the line status, and hence the line status between intervals to avoid labeling such regions multiple times. We e and e is l−1 l avoid the RNN computation with point enclosure queries by utilizing the factthatthe RNN set of a regioncan be obtained I(l)=ky1,y2,...,y2n(l)k, y1 ≤y2 ≤...≤y2n(l). efficientlybymodifyingtheRNN sets oftheadjacentregions. For example, in Fig. 9, if the RNN set of the lower region is Pair andSubregion.Anytwoconsecutiveelementsinthe {o ,o } and the boundarybetween the two regions is formed line status is termedas a pair, which is denotedby hy ,y i. 1 2 t−1 t bytheupperandlowersidesofC(o )andC(o ),denotedbyy We denote by hy ,y i∈I(l) that the pair hy ,y i comes 2 3 2 t−1 t t−1 t andy ,respectively,thenwecanimmediatelyobtaintheRNN fromthelinestatusI(l).Wedenoteby[x,x′,y,y′]arectangle 3 set{o ,o }ofthe upperregionbyremovingo from{o ,o } whose diagonallyopposite cornersare (x,y) and (x′,y′) with 1 3 2 1 2 and then adding o to {o }. We call the already-computed x < x′ and y ≤ y′. When y = y′, [x,x′,y,y′] is in fact a 3 1 horizontallinesegment.Foreaseofpresentation,wetreatitas It is easy to observe that if we have obtained the RNN set a special rectangle. We denote by (x ,y ) ∈ [x,y,x′,y′] that R(hy ,y i) of a valid pair, we can start from y (which 0 0 t−1 t t point (x ,y ) is in rectangle [x,y,x′,y′]. Here the rectangle is the first element among elements of the same value) and 0 0 is open (i.e., (x ,y ) ∈ [x,y,x′,y′] iff x ∈ (x,x′) and use R(hy ,y i) as the base set for the validpair hy ,y i 0 0 0 t−1 t t′−1 t′ y ∈ (y,y′)) and no point is in the special rectangle. When immediatelynexttoit.Inthisway,wecanobtaintheRNNset 0 the line sweeps from e to e , the x-coordinate x of ofeveryvalidpair(inonelinestatus)withasingletraversalof l−1 l l−1 event e is strictly less than that of e . This forms a the line status. Continuingwith the aboveexample in Fig. 10, l−1 l rectangle [x ,x ,y ,y ] between the two events. In this for pair hy ,y i, we use R(hy ,y i) = {o } as base set, l−1 l 1 2n(l) 2 3 1 2 1 rectangle, each pair hy ,y i∈I(l) forms a small rectangle encounter y , add o , and stop with R(hy ,y i) = {o ,o }. t−1 t 2 2 2 3 1 2 [x ,x ,y ,y ]. The small rectangle has no vertex or edge For pair hy ,y i, we remove o from {o ,o } and stop with l−1 l t−1 t 3 4 1 1 2 in it, which makes it a connected subset of a region. We R(hy ,y i)={o }. 3 4 2 call each small rectangle a subregion, and denote by rt the l one formed by pair hyt−1,yti ∈ I(l). We denote by R(rlt) C. Reducing the Number of Times of Region Labeling the RNN set of the points in subregion rt, or simply by l 1)LocatingtheChangeInterval: Withtheaboveapproach, R(hy ,y i) when the line status is clear. t−1 t we obtain the RNN sets and label the corresponding regions between two events e and e . We then move the sweep B. Avoiding Point Enclosure Queries l−1 l line forward across e and label regions between e and e . l l l+1 We obtain the RNN set of each subregion by finding the Crossing e , we obtain a new line status I(l+1). We notice l NN-circles enclosing it. When the line sweeps from el−1 to that some of the pairs in I(l) and I(l + 1) represent the el, the subregions between el−1 and el are enclosed by the same regions (not subregions) even though they are formed NN-circles in the x dimension if and only if these NN-circles by different NN-circles. For example, in Fig. 10, between e 2 are cut by the line. Therefore,we only need to check whether and e , I(3) = ky ,y ,y ,y k, while between e and e , 3 1 2 1 2 3 4 these NN-circles enclose the subregions in the y dimension, I(4) = ky ,y ,y ,y ,y ,y k. The pair hy ,y i ∈ I(3) and 3 1 2 3 1 2 2 1 which can be easily achieved by checking the line status. We new pair hy ,y i ∈ I(4) represent the same region. Besides 3 1 use the followinglemma to show the RNN set of a pair in the new pairs, a pair also represents the same region if it exists line status. Due to space limitation, we omit the proofs of the in both I(l) and I(l +1) and the RNN sets of the pair in lemmas in this section. the two line statuses are the same (e.g., hy ,y i in the above 1 2 Lemma 1: ∀hy ,y i∈I(l), the RNN set R(rt) of subre- example). The reason is that, by Lemma 1, the RNN set of a gion rt = [x ,t−x1,yt ,y ] is an empty set if yl = y or pair is changed if and only if the pair is entirely enclosed by l l−1 l t−1 t t−1 t an NN-circle that is inserted into (i.e., newly cut) or removed a set consists of the centers of the NN-circles that are cut by the line and enclose rt in the y dimension, i.e., R(rt) is from (i.e., no longer cut by) the line. When the RNN set of l l a pair does not change, the two subregions formed by the ∅ if yt−1 =yt, pairmustbeconnected(andhencerepresentthesameregion), (cid:26) {oi |xi <xl ≤xi and yi ≤yt−1 <yt ≤yi} if yt−1 6=yt. since no side of NN-circles separates them. To reduce the numberoftimesofregionlabeling,weshouldavoidprocessing By Lemma1,we canobtaintheRNN set R(hy ,y i) of pairs representing the same regions, i.e., only some of the t−1 t a pairas follows.When y =y , the RNN set is empty.For newly formed pairs and the pairs that exist in both line status t−1 t convenience, we call such pairs invalid pairs and the others whose RNN sets are changed should be processed. We use (with y < y ) valid pairs. For a valid pair, we check the the following lemma to precisely locate the pairs that need t−1 t elements in the line status in the range of (−∞,y ]. Since to be processed when only one NN-circle is changed in (i.e., t−1 the elements are sorted in ascending order, y (resp. y ) inserted into or removed from) the line. t−1 t of a valid pair must be the last (resp. first) element among Lemma 2: When a line status I(l) is changed into a new elements of the same value. Thus, we only need to check line status I(l′) because an NN-circle C(o )=[x ,x ,y ,y ] elements from the beginning of the line status to the first c c c c c is newlyornolongercutbythesweep line,i.e.,y andy are element (inclusive) of the pair. Starting with an empty set, c c inserted into or removed from I(l), we only need to process which is called the base set and denoted by R, if an element the pairs in the following set is a lower side, we add the center of the corresponding NN- circle to R, otherwise we remove the center from R. When {hy ,y i∈I(l′)|y ≤y <y ≤y }. t−1 t c t−1 t c reaching the second element (exclusive) of the pair, we stop and R is the RNN set of the pair. For example, in Fig. 10, By Lemma 2, the pairs that need to be processed are located the line status is I(3)=ky1,y2,y1,y2k. For pair hy1,y2i, y1 within a range. We call such a range a changed interval and is the only elementwe encounteredin the checkingrange and denote it by [y ,y ]. Note that y and y are coordinate ci cj ci cj hence R(hy1,y2i) = {o1}. We formally describe the above values,notline elements.When the line triggers(i.e.,crosses) approach with the following corollary. an event, multiple NN-circles are inserted into or removed from the line, and hence several (initial) changed intervals Corollary 1: ∀ ∈ hy ,y i ∈ I(l) with y 6= y , the RNN set R(rt) of a stu−b1regtion rt = [x ,xt−,y1 ,yt] can are created. We cannot process such changed intervals one l l l−1 l t−1 t by one, since they may intersect and affect each other. We be obtained by checking elements y for i = 1 to t−1 and i maintaining the set R(rt) as follows needtomergetheintersectedchangedintervals.Whenmerging l two changed intervals, we need to be careful about the line o is removed from R(rt) if y is y , elements that are of the same value so that no regions are k l i k (cid:26) ok is added into R(rlt) if yi is yk. labeled repeatedly. Specifically, any two changed intervals 7 respectively.InI(3),C(o )isinsertedandthechangedinterval 2 is [y ,y ]. The element immediately preceding the changed C(o2) o2 6 o4 C(o4) interv2al2is y1. We obtain {o1} as the base set with key 2× 1−1 = 1, and keep records (2×2−1 = 3,{o ,o }), (2× 1 2 5 8 1=2,{o }) and (2×2=4,∅) for hy ,y i, hy ,y i and y , 2 2 1 1 2 2 3 4 respectively. We now have (1,{o }), (2,{o }), (3,{o ,o }) 1 2 1 2 C(o1) o1 and (4,∅) cached for future use. 2 o 3 C(o ) D. The Algorithm 3 1 We now present the detailed steps of CREST, as summa- rized in Algorithm 1. We first obtain the event queue Q by x x1 x2 x3 x4 x5x6x7x8 x9 storingtheverticalsidesoftheNN-circlesinQx inascending Fig.11. Exampleofchanged intervals order (line 4). The sides are stored in a way such that for each side, we can directly obtain the NN-circle to which it [yci,ycj] and [yci′,ycj′] with yci ≤ yci′ are merged into a belongs and whether it is the left or right side. We then new one [y ,max{y ,y }] if y ≥ y . After merging, ci cj cj′ cj ci′ we only need to handle separated changed intervals, which Algorithm 1: The CREST algorithm can be processed individually. For example, in Fig. 11, when Input: An arrangement of n NN-circles the line crosses x3, the grey area is processed, in which Output: A subdivision with each region labeled C(o3) is inserted into the line. The pairs (i.e., subregions), 1 T ←∅ ⋄ the index structure for the horizontal sides for conveniencedenotedby 1, 2, and 3, need to be processed, 2 U ←∅ ⋄ the changed NN-circles between events which are in changed interval [y ,y ]. When the line crosses 3 P ←∅ ⋄ the cached RNN sets 3 3 x , C(o ) is removed and C(o ) is inserted. Only pairs 4, 5, 4 Qx ← the vertical sides of the NN-circles in ascending order 64and 71 in [y ,y ] (merged f4rom [y ,y ] and [y ,y ]) are 5 for each element v in Qx do processed. Wh1en 4the line crosses x ,1C(o1 ) is rem4ove4d and 6 C(oi)← the NN-circle to which v belongs 5 2 7 Add C(oi) into U onlypair 8, the onlyone in [y ,y ], is processed.The validity 2 2 8 if v is a left side then of the sweep line strategystill holdswith the aboveapproach, 9 Insert yi, yi into structure T since a simple induction will show that all pairs in all line 10 else statuses are properly processed. 11 Delete yi, yi from structure T 12 Remove the corresponding records from P 2)Caching and Retrieving Base Sets: To obtain the RNN 13 if the next element v′ of Qx equals v then set within a changed interval, an efficient way is to use the 14 Continue the outer for-loop RNN set of the pair that is immediately preceding of the 15 Merge the changed intervals of the NN-circles in U changed interval as the base set. Such a pair must be a valid 16 Delete all elements in U pair, since the changed interval includes the line elements 17 for each separated changed interval do whose values are equal to the interval boundary. Therefore, 18 Find the starting element st and ending element ed we cache (only) the RNN set of each valid pair in the line 19 R← Retrieve the base set from P status. We index the RNN set of a pair with its first element. 20 for each element y between st and ed do Specifically,ifthepair’sfirstelementisthelower(resp.upper) 21 C(oj)← the NN-circle to which y belongs 22 if y is the lower side then side of C(o ), we assign the RNN set a key 2i−1 (resp. 2i). i 23 R← add oj into the base set When the RNN set of a pair is changed, the record in the 24 p←2j−1 index is also updated accordingly (for elements of the same 25 else value, the recordis always maintained only at the last one for 26 R← remove oj from the base set efficient access and space saving). In this way, the base set 27 p←2j for a changed interval is the record of the element that is one 28 if y is greater than the next element y′ then positionaheadofthechangedinterval.(Incasethatthechange 29 Label the region represented by pair hy,y′i intervalis atthe end of the line status, we also keep an empty with set R set for the last element of a line status.) When we process 30 P[p]←R several separated changed intervals in ascending order, it is guaranteed that such a record is always available and up-to- process the elements in Qx one by one (line 5). If an element is a left (resp. right) side, we insert (resp. remove) the two date.Specifically,lety betheelementwhoserecordweneed. t horizontalsizesoftheNN-circlecorrespondingtotheelement Ifnosuchelementy exists,thebasesetisanemptyset,since t thechangedintervalmustbeatthebeginningofthelinestatus. into (resp. from) a balanced search tree T in which the data are stored in the doubly linked leaf nodes (e.g., a B+-tree) Ify istheboundaryofaprecedingchangedinterval,thensuch t arecordisalreadyupdatedandreadytouse,otherwisey must (lines 6-14). When the next element in Qx is greater than the t exist in the last line status and the record is also available. currentone, we processthe event. The structureT nowstores the informationof the currentline status. We obtain separated We use an example to illustrate the above approach. In changedintervalsbymergingthey-coordinatesoftheinserted Fig. 10, I(1) is empty, I(2) = ky ,y k, and [y ,y ] is the and removed sides in this event (line 15). For each changed 1 1 1 1 changed interval which is at the beginning of I(2). We thus interval(line17),welocatethestartingandendingelementsin useanemptysetasthebaseset,andkeeptherecords(2×1− T (line18),andretrievethebasesetfromtheRNNsetrecords 1,{o }) for hy ,y i, and (2×1,∅) for y (the last element), (line 19), which are stored in a random access data structure 1 1 1 1 a5 a3 a1 a4a2a6 a6 such as an array. We then sequentially check the elements in a2 eaadcdhocrhraenmgoevdeitnhteercvoarlre(lsipnoen2d0in).gFdoartaepacohintel(eim.e.e,noti,)wfreomeiththeer a4 o4o2o6 base set to obtain RNN sets of the valid pairs and label the o1 a1 regions (lines 21-29). To facilitate efficient insert, delete and o3 copy operations on the base set, we keep the data points in C(o1) a3 o5 a linked list and store pointers to the nodes in the linked list a5 withanadditionalrandomaccessdatastructureindexedbythe Fig.12. Multilabelling Fig.13. Reduction data points. The RNN sets we obtained are also dynamically equal to the number of regions in the arrangement which is recordedtosupportthefuturebasesetretrieval(line30).After in turn greater than or equal to the number of NN-circles n. all changed intervals are processed, we eject the next element Therefore,the time complexityof CREST is O(nlogn+kλ). in Q and repeat the above steps until Q is empty. x x Shortly(in Section VI-B) we will provethat k=Θ(r), where r is the number of regions in the arrangement. VI. COMPLEXITY ANALYSIS The space required by the queue Q and structure T is x A. Complexity of CREST O(n). The space requiredby cachingthe RNN sets is O(nλ), andthestorageofthebasesetrequiresO(n+λ)space.Overall, We analyze the time complexity of CREST following the space complexity of CREST is O(nλ). the steps in Algorithm 1. We sort the 2n vertical sides of the n NN-circles in O(nlogn) time. When we process the events, each horizontal side is inserted into and then deleted B. Bounding the Times of Region Labeling in CREST from the structure T once. Therefore, there are at most 2n From the above analysis, we can see that the number of elements in T, and the 2×2n insertions and deletions can times of region labeling k largely decides the performance of be done in O(nlogn) time. To merge the changed intervals CREST. In CREST, we successfully avoid labeling the same at an event, we can first sort them in lexicographical order regionmultipletimesindifferentlinestatuses.Althoughrarely and then obtain the merged result with a linear scan. This happens,multi-labelingstill existswithin the same line status. requires O(βlogβ +β) = O(βlogβ) time, where β is the For example,in Fig. 12, at the event of the left side of C(o ), number of changed intervals at the event. Since each NN- 1 thegrayregionislabeledsixtimes.Despitethemulti-labeling, circle can only be a changed interval twice, the total number we show with the following lemma that the number of times of changed intervals in all events is O(n). Thus, the overall of regionlabelingin CREST andthenumberof regionsin the time required for merging the changed intervals is bounded arrangement, up to a constant factor, are asymptotically the by O( βlogβ) = O(logn β) = O(nlogn). For each same (as a function of n). mergedPchangedinterval,weobPtainitsstartingelementinT in O(nlogn+λ)time,whereλisthemaximumsizeoftheRNN Lemma 3: k = Θ(r), where r is the number of regions in setsinthearrangement.ThisisbecausewefirstsearchinT in the arrangement. O(logn) time to obtainan elementy whose valueis equalto i Proof: Let v, e, and c be the number of vertices, edges the lower endpointof the interval. Starting fromy , we obtain i and connected components in the arrangement, respectively. the starting element by checking backward (to the beginning We call the number of edges boundinga region the degree of of T) untilthe elementsare less than y . This proceduretakes i the region, and the number of edges incident to a vertex the O(2λ)=O(λ)time,sinceλisthemaximumsizeoftheRNN degree of the vertex. In an arrangement of squares, there are setsandthereareatmostλuppersidesandλlowersidesthat only2-,3-,and4-degreevertices,whicharedenotedbyv ,v , are of the same y-coordinate. Symmetric analysis applies to 2 3 and v , respectively.In CREST, the number of times a region obtaining the ending element. We have only O(n) changed 4 islabeledcannotbegreaterthanthedegreeoftheregion,since intervals, and thus obtaining starting and ending elements can each time the region is labeled we need a distinct valid pair bedoneinO(nlogn+nλ)time.Wethenprocesstheelements which requires at least one of the edges bounding the region. betweenthem.Wefirstretrieveabaseset,whichtakesatmost Therefore, k is less than or equal to the sum of degrees of O(λ) copying time. Thus, it takes O(nλ) time to obtain base all regions which equals 2e, i.e., k ≤2e. In the arrangement, sets for O(n) changed intervals. For each element between we also have v = v +v +v , 2e = 4v +3v +2v and thestartingandendingelements,weeitheraddintoorremove 2 3 4 4 3 2 v−e+r−c=1(Eulercharacteristic).Combiningthesethree fromthebasesetitscorrespondingdatapoint(i.e.,o ).Ittakes i equations,we obtainr =v +v /2+c+1. We then havethat at most O(λ) time for the adding or removing operations to 4 3 obtain an RNN set for a valid pair. This is because to get an k ≤2e=4v +3v +2v ≤6(v +v /2+c+1)+2v =6r+2v . 4 3 2 4 3 2 2 RNN set of size α by changing an RNN set of size α , at t s mostα datapointsareremovedandα datapointsareadded, The number of 2-degree vertices is less than or equal to 4n s t which takes O(α +α ) = O(λ) time. We denote by k the andhencelessthanorequalto4r, sinceeachsquaremakesat s t number of valid pairs, and hence the time for obtaining the most four 2-degree vertices and n ≤ r. Therefore, it follows RNN sets for k valid pairs is bounded by O(kλ). For each that valid pair, we record its RNN set and label its corresponding k ≤6r+2v ≤6r+8n≤14r. 2 region, and this takes O(kλ) time. Obviously, r ≤ k, and hence r ≤ k ≤ 14r, which completes Putting all things together, we have that CREST stops in the proof. O(nlogn+nλ+kλ) time. Since k denotes the number of times of region labeling in CREST, k must be greater than or We conclude the above analysis with the following theorem. y 2 Theorem 2: The CREST algorithm solves the (bichromatic) RCprobleminO(nlogn+rλ)timewithO(nλ)space,where y1 r and λ are the number of regions and the maximum size of o the RNN sets in the arrangement, respectively. 2 o 1 y C. A Lower Bound of the RC Problem 2 We show that Ω(nlogn+rλ∗) is a lower bound of the y 1 RC problem (in the algebraic computation tree model) [3], where λ∗ is the average size of RNN sets in the arrangement. x x x x x x x x x x x 1 2 4 5 7 16 17 20 28 29 31 When rλ∗ is the dominating term, at least the RNN sets of Fig.14. CRESTwithL2 distancemetric all regions need to be output, the above bound is a trivial Therefore, it follows that lower bound. Thus, we only need to show that it requires Ω(nlogn) operations even without considering the output n3+2n n n3+2n n λ λ∗ = = · ≥ = . cost. This bound is proved by the reduction from the point 3(n2−n+2) 3 n3−n2+2n 3 3 distinctness problem to a special case of the RC problem. Since λ∗ ≤ λ, it follows that λ = Θ(λ∗), which indicates Definition 3 (Element Distinctness): Given real numbers CREST is overall asymptotically optimal. a ,...,a ∈ R, determine whether or not there is a pair i, j 1 n with i6=j and ai =aj. VII. RNNHM IN OTHERSETTINGS Weshowthattheelementdistinctnessproblemcanbereduced We show how CREST solves the RNNHM problem with to the RC problem in linear time. For each real number ai, the monochromatic RNNs, L1, and L2 distance metrics. we create a point (a ,a ), i = 1,2,...,n in the plane. We i i then build a square C(o ) with point (a ,a ), i = 2,...,n i i i A. Monochromatic RNNs and point (a ,a ) being the diagonally opposite corners and 1 1 o beingthecenter. Anexampleof suchreductionisshownin CRESTdirectlyappliestothemonochromaticRNNs,since i Fig. 13. These squares form an arrangement of NN-circles in O and F being the same set does not affect the computation a two-dimensionalspace. We use this arrangementas inputto of the NN-circle and these NN-circles still form a planar anyalgorithmthatsolvestheRCproblem.Acorrectalgorithm subdivision with axis-aligned edges as in the bichromatic outputs exactly n RNN sets (including the empty set) if and RNNs. By Korn et al. [12], an RNN set contains at most only if the elementsare distinct. The reason is thateach RNN six points for monochromatic RNN queries, which means set corresponds to only one region, and there are n regions λ = O(1). Therefore, by Theorem 2, the time complexity of (including the exterior face) in the arrangementif and only if CREST for the monochromatic RNNs is O(nlogn+r) and the elements are distinct. It has been proved that the element the space complexity is O(n). distinctness problem has a lower bound Ω(nlogn) [3] (in the algebraiccomputationtree model),which impliesthatRC has B. RNNHM with L1 Distance a lower bound Ω(nlogn) without the output cost. Therefore, Ω(nlogn+rλ∗) is a lower bound of the RC problem. In two-dimensionalspaces, the L1 distance can be viewed as equivalent to the L distance by rotation and scaling. ∞ Specifically, with the L distance, NN-circles are of diamond 1 D. Optimality of CREST shape. If we rotate (around the origin) the coordinate system counter-clockwise by π/4, diamonds become squares. Each From the time complexity of CREST and lower bound of point (x,y) in the original system has a corresponding point the RC problem, CREST is asymptotically optimal in terms (x′,y′) in the rotated system with x′ = xcosθ − ysinθ, of the number of times of region labeling (i.e., influence y′ = xsinθ +ycosθ and θ = π/4. In the rotated system, computation) in all cases, since k =Θ(r). CREST directly applies. The transformation takes O(n) time In the following cases, we show that the upper bound and the overall time and space complexities stay unchanged. O(nlogn+rλ) of CREST is also tight, which indicates that CREST is overallasymptoticallyoptimal.Forthe boundto be C. RNNHM with L2 Distance tight, it is sufficient to show that λ=Θ(λ∗). With the L distance, the NN-circlesare of circularshape. 2 Case (i). When the clients and facilities are relatively Theyformaplanarsubdivisionwithcurvededges,asshownin uniformly distributed such that λ is bounded by a sufficiently Fig.14.CRESTstillappliesinsuchasubdivisionbutrequires large constant C (which dependson |O|), since λ∗ ≤λ, λ∗ is modifications as follows. We use the x-extreme points of |F| circles(instead ofverticalsides ofsquares)as eventpoints.In also boundedby C. Thus, λ=Θ(λ∗)=O(1). An exampleis thelinestatus,weusethearcsegmentsofcirclesbetweentwo that none of the n squares intersects any other ones (or only consecutiveeventsaslineelements(insteadofhorizontalsides a few of them overlap). ofsquares).Foreachlineelementy (i.e.,anarcsegment),we i Case (ii). When λ is unbounded, we show with the worst assigntwovaluesys andyl,whicharethesmallestandlargest i i case illustrated in Fig. 8 that λ = Θ(λ∗) also holds when y-coordinates of y between two consecutive events e and i l−1 everysquare intersects all the other ones. In this arrangement, e , respectively.Line elementy is less than y iff (i) ys <ys l i j i j λ=n, and we have that r =n2−n+2 and r·λ∗ = n3+2n. or (ii) ys = ys and yl < yl or (iii) ys = ys,yl = yl and 3 i j i j i j i j TABLEII ym < ym, where ym and ym are the y-coordinates of y and REALDATASETS i j i j i y at xl−1+xl, respectively. We include intersection points as Name Size Description j 2 eventpoints(e.g,x4 andx7 inFig.14).Thisisbecausethearc NYC 128,547 points-of-interest in New York City segmentsofNN-circlesswitchpositionsatintersectionpoints. LA 116,596 points-of-interest in Los Angeles We also use thecenterofeachNN-circleaseventpoints(e.g., x5 and x29 in Fig. 14) to guarantee that each line element a special case of the bichromatic RNNs. We use L1 and L2 is y-monotone (i.e., strictly increasing or decreasing in the distance metrics since they are used more often than L∞ in y-dimension). Before processing an event, we update values real-world scenarios, and L1 and L∞ are equivalent in two- ys and yl for each line element y regardless of whether it is dimensional spaces. We uniformly sample from the data sets i i i relatedtotheevent.Thisupdateisrequiredinordertomaintain toobtaintheclientsetO andthefacilitysetF.Allalgorithms aproperorderinthelinestatusbecausethearcsgoupordown areimplementedusingC++andtheexperimentsareconducted betweeneventsandtheirupperandlowervalueschangeevenif on a desktop computer with a 3.4GHz Intel i7-2600CPU and theyareirrelevanttotheevents.Notethatsuchanupdatedoes 8GB main memory. not change the relative order of the line elements irrelevantto the events, and thus can be completed in linear time. A. Showcasing Real-World Heat Maps Apart from the above modifications, CREST remains the In the first set of experiments, we show the RNN heat same.Specifically,ifaneventpointate istheleftboundary mapsfortwocities:NewYorkCityandLosAngeles.Foreach l−1 point of C(o ), we insert y and y into the line status with data set NYC and LA, we uniformly sample 20,000 points c c c y l = y s = y and y s and y l being the lower and upper as the clients and 6,000 points as the facilities, since in real c c oc c c worldscenariosthenumberofclientsisusuallylargerthanthe y-coordinates where e intersects C(o ), respectively. We also l c number of facilities. For simplicity, we measure the influence createa changeintervalwith y andy . If aneventpointisan c c by the size of RNN sets, although any other function on the intersection point, we obtain the relevant line elements (arcs RNN sets may be used. Fig. 1(a) and Fig. 1(b) (in Section I) incidenttotheintersectionpoint)inthelinestatus,switchtheir show the RNN heat map and the satellite map of New York positions and create a change interval with these elements. City (within latitude and longitude ranges [40.50, 40.95] × If an event point is the right boundary point or center of [−74.15, −73.70]), while Fig. 15(a) and Fig. 15(b) show the an NN-circle, we remove or update the two line elements heatandsatellitemapsofLosAngeles(within[33.82,34.17]× corresponding to the NN-circle. We do not create change [−118.47, −118.12]), respectively. Comparing the heat and intervalsforeitherofthesetwotypesofeventpoints,sinceno satellite maps, we can see thatthey are closely geographically pair is between the removed elements and updating the line correlatedasexpected.Forinstance,themountainandseaarea elements by the centersis only to keep them y-monotone.We have few clients or facilities, and hence have very low heat. then merge and handle change intervals as before. We can easily explore regions of various influences to help Complexities. In the worst case (as shown in Fig. 8), various decision making applications such as those described CREST runs in O(n3) time with the L2 distance, since there in the motivatingexamples.If the decisionmakerisinterested can be as many as O(n2) events and for each event we need in any specific area, she can zoom in to see more details. to update O(n) line elements. However, the worst case com- plexity is much lower than that of an existing algorithm [22], which suffers from an exponential running time in the worst case. The algorithm was proposed to obtain regions with the maximum influence value, but it could be adapted to solve the RC problem if we remove its pruning techniques. The algorithm [22] follows the filter and refine paradigm by enumeratingall possible regionsand then checkingtheir exis- tence. For example,when C(o ) intersects C(o ) and C(o ), it 1 2 3 enumeratestheregionsoˆoˆoˆ,oˆoˆo¯,oˆo¯oˆ,oˆo¯o¯,where 1 2 3 1 2 3 1 2 3 1 2 3 oˆ means inside C(o ) and o¯ means outside C(o ), and then i i i i (a) HeatmapforLA (b) Satellite mapforLA checks whether such regions really exist. In our experiments Fig.15. Real-world heatmap (in Section VIII), CREST constantly outrunsthe algorithm on data sets of various settings. B. Performance of CREST with L1 Distance In this set of experiments, we compare the running time VIII. EXPERIMENTS of three algorithms: the baseline algorithm (BA), the CREST Inthissection,weexperimentallyevaluatetheperformance algorithm with only the RNN computation optimization, de- of CREST. We use both real and synthetic data sets. Two noted by CREST-A, and the CREST algorithm with both real data sets, NYC and LA, contain points-of-interest in RNN computation optimization and repetitive region labeling New York City and Los Angeles, respectively (we obtain the optimization. We cannot evaluate the effect of the latter op- data sets from the authors of [2]). Table II lists the details timization alone, since it is built upon the former optimiza- of the real data sets. We also generate two synthetic data tion. We compute the influence by (i) the size of RNN sets sets, Uniform and Zipfian, which contain points of uniform and (ii) the function considering the capacity constraints of and Zipfian distributions, respectively.The skew coefficientin facilities[22](describedintheIntroduction),respectively.The Zipfian distribution is set to 0.2. In the experiments, we use resultsofthelatterfunctionareconsistentwiththoseusingthe the bichromatic RNNs since the monochromatic type is just size and hence are omitted due to space limitation.