TRANSACTIONSONKNOWLEDGEANDDATAENGINEERING 1 Probabilistic Reverse Nearest Neighbor Queries on Uncertain Data Muhammad Aamir Cheema, Xuemin Lin, Wei Wang, Wenjie Zhang and Jian Pei, Senior Member, IEEE Abstract—Uncertain data is inherent in various important applications and reverse nearest neighbor (RNN) query is an important query type for many applications. While manydifferent types of queries have been studied on uncertain data, there isno previous workonansweringRNNqueriesonuncertaindata.Inthispaper,weformalizeprobabilisticreversenearestneighborquerythatisto retrievetheobjectsfromtheuncertaindatathathavehigherprobabilitythanagiventhresholdtobetheRNNofanuncertainquery object. We develop an efficient algorithm based on variousnovel pruning approaches that solvesthe probabilistic RNN queries on multidimensionaluncertaindata.Theexperimentalresultsdemonstratethatouralgorithmisevenmoreefficientthanasampling-based approximatealgorithmformostofthecasesandishighlyscalable. IndexTerms—QueryProcessing,ReverseNearestNeighborQueries,UncertainData,SpatialData. F 1 INTRODUCTION twoblocks.However,theresultsprovidedbytheconventional GIVENasetofdatapointsP andaquerypointq,areverse queries may not be meaningful. There are two major limita- tions for conventional queries on such data1. nearestneighbor query istofind every pointp∈P such that dist(p,q) ≤ dist(p,p′) for every p′ ∈ (P −p). In this 1) The conventional queries do not consider the locations of houses within each residential block. This affects quality paper, we formalize and study probabilistic RNN query that of the reported results. For instance, if the distance between is to find the probable reverse nearest neighbors on uncertain centroidsoftworesidentialblocksisusedasdistancefunction, data with probability higher than a given threshold. the closest block of A is B (in other words the person living Uncertain data is inherent in many important applications in A is not the RNN of someone living in Q). However, if such as sensor databases, moving object databases, market the locations of houses within each block are considered, we analysis, and quantitative economic research. In these ap- find that for most of the houses in A, the houses in Q are plications, the exact values of data might be unknown due closer than the houses in B. For example, the distance of a to limitation of measuring equipment, delayed data updates, 1 to every house in Q is less that its distance to any house in incompleteness, or data anonymization to preserve privacy. B. Similarly, the distance of a to every house in Q is less Usually an uncertain object is represented in two ways: 2 thanitsdistancetob .Whichmeans,apersonlivinginAhas 1) using a probability density function [4], [6] (continuous 1 high chances to be the RNN of some person living in Q. case) and 2) using all possible instances [22], [17] each 2) Conventional queries do not report the probability of with an assigned probability (discrete case). In this paper, we objects to be the answer (an object is either a RNN or not investigate discrete cases. a RNN). On the other hand, probabilistic reverse nearest ProbabilisticRNNquerieshavemanyapplications.Consider neighbor queries provide more information by including the theexampleinFig.1,wherethreeresidentialblocksA,B and probability of an object to be the answer. For example, a Qareshown.Thehouseswithineachblockareshownassmall probabilistic reverse nearest neighbor query reports that the circles. The centroid of each residential block is shown as a probability of a person living in block A to become the RNN hollow triangle. For privacy reasons, we may only know the of a person living in Q is 0.75 according to the possible residential blocks in which the people live (or zipcode) but worldsemantics(seeexample1).Thistypeofresultsaremore we do not have any information about the exact addresses of meaningful and interesting. their houses. We can assign some probability to each possible location of a person in his residential block. e.g; the exact Probabilistic RNN queries have applications in privacy location of a person living in A is a with 0.5 probability. preserving location-based services where the exact location 1 Conventional queries on these residential blocks may use of every user is obfuscated into a cloaked spatial region [16]. distance functions like the distance between the centroids of However, the users might still be interested in finding their reverse nearest neighbors. We can model this problem to finding probabilistic reverse nearest neighbor by assigning • Muhammad Aamir Cheema is with the School of Computer Science and Engineering,UniversityofNewSouthWales,Australia. confidence level to some possible locations of every user E-mail:[email protected] within his/her respective cloaked spatial region. Probabilistic • XueminLin,WeiWangandWenjieZhangarewiththeUniversityofNew RNN queries may also be useful to identify similar trading SouthWalesandNICTA. E-mails:{lxue,weiw,zhangw}@cse.unsw.edu.au • JianPeiiswithSimonFraserUniversity,Canada. 1.Otherdistancefunctionslikemaximumdistance,minimumdistanceand E-mail:[email protected] aggregateddistancealsohavetheselimitations. TRANSACTIONSONKNOWLEDGEANDDATAENGINEERING 2 trends in stock markets. Each stock has many deals. A deal and the probability threshold. Although in this paper, we (transaction) is recorded by the price (per share) and the focusondiscretecasewhereeachobjectisrepresentedby volume (number of shares). For a given stock s, clients may some possible probable instances, our pruning rules can beinterestedinfindingallotherstocksthathavetradingtrends be applied to the continuous case where each uncertain moresimilartosthanothers.Insuchapplication,wecantreat object is represented by a probability density function. each stock as an uncertain object and its deals as its uncertain • To better understand performance of our proposed ap- instances. There are a number of other applications for the proach, we devise a baseline exact algorithm and a queries that consider the proximity of uncertain objects [4], sampling-based approximate algorithm. Experiment re- [6], [14] and the applications of RNNs on uncertain objects sults on synthetic and real datasets show that our algo- are very similar. rithm is much more efficient than the baseline algorithm and performs better than the approximate algorithm for a H most of the cases and is scalable. b b 2 a1:q1 2 a 2 The rest of the paper is organized as follows: In Section 2, 2 Dist(A,B) we formalize the problem and present the preliminaries and Dist(Q,A) notations used in this paper. Our proposed pruning rules B a q2 Ha2:q1 a q2 are presented in Section 3. Section 4 presents our proposed 1 1 algorithm for answering probabilistic reverse nearest neigh- A q1 q1 bor queries. Section 5 evaluates the proposed methods with b b 1 Q 1 extensive experiments and the related work is presented in Section 6. Section 7 concludes the paper. Fig. 1: An example of a Fig. 2: Any point in shaded probabilistic RNN query area cannot be RNN of q 2 PROBLEM DEFINITION AND PRELIMINARIES 2.1 ProblemDefinition Probabilistic RNN query processing poses new challenges in designing new efficient algorithms. Although RNN query GivenasetofdatapointsP andaquerypointq,aconventional processing has been extensively studied based on various reverse nearest neighbor query is to find every point p ∈ P pruning methods, these pruning techniques either cannot be such that dist(p,q)≤dist(p,p′) for every p′ ∈(P −p). directly applied to probabilistic RNN queries or become inef- Now we define probabilistic reverse nearest neighbor ficient.Forexample,theperpendicularbisectorsadoptedinthe queries.ConsiderasetofuncertainobjectsU ={U1,...,Un}. state-of-the-art RNN query processing algorithm [20] assume Each uncertain object Ui consists of a set of instances that objects are spatial points. In contrast, uncertain objects {u1,...,um}.Eachinstanceuj isassociatedwithaprobability hapavpelyianrgbittrhaeryprsuhnaipnegsroufletsheoinr uthneceirntsatianncreeglieovnesl.oInf uandcdeirtitoainn, pPujmj=c1aplluejd =app1e.arWaneceaspsruombaebitlhitayt wthiethptrhoebacboilnitsytraoinftetahcaht objectsisextremelyexpensiveaseachuncertainobjectusually instance is independent of other instances. A possible world has a large number of instances. W = {u1,...,un} is a set of instances with one instance Another unique challenge in probabilistic RNN queries from each uncertain object. The probability of W to appear is is that the verification of candidate objects usually incurs P(W) = Qni=1pui. Let Ω be the set of all possible worlds, substantial cost due to large number of instances in each then PW∈ΩP(W)=1. uncertainobject.Byverification,wemeancomputingtheexact The probability RNNQ(Ui) of any uncertain object Ui to probabilityofanobjectbeingtheRNNofthequeryandtesting be the RNN of an uncertain object Q in all possible worlds whetheritqualifiestheprobabilisticthresholdornot.Notethat can be computed as; instances from objects that are close to the candidate objects also need to be considered in the verification phase. RNN (U )= p ·p ·RNN (u) (1) Q i X q u q In this paper, we formalize the problem of probabilistic (u,q),u∈Ui,q∈Q RNNqueriesonuncertaindatausingthesemanticsofpossible RNN (u) is the probability that an instance u ∈ U is the worlds.WepresentanewprobabilisticRNNqueryprocessing q i RNN of an instance q ∈ Q in any possible world W given framework that employs (i) several novel pruning approaches that both u and q appear in W. exploitingtheprobabilitythresholdandgeometric,topological and metric properties. (ii) a highly optimized verification RNN (u)= (1− p ) (2) q Y X v method that is based on careful upper and lower bounding V∈(U−Ui−Q) v∈V,dist(u,v)<dist(u,q) of the RNN probability of candidate objects. Given a set of uncertain objects U and a probability Our contributions in this paper are as follows: threshold ρ, problem of finding probabilistic reverse nearest • Tothebestofourknowledge,wearethefirsttoformalize neighbors of any uncertain object Q is to find every uncertain the problem of probabilistic reverse nearest neighbors object U ∈U such that RNN (U )≥ρ. based on the possible worlds semantics. i Q i • Wedevelopefficientqueryprocessingalgorithmofprob- Example 1: Consider the example of Fig. 1 where the un- abilistic RNN queries. The new method is based on non- certain objects A, B and Q are shown. Assume that the trivialpruningrulesespeciallydesignedforuncertaindata appearance probability of each instance is 0.5. According TRANSACTIONSONKNOWLEDGEANDDATAENGINEERING 3 to Equation (2), RNN (a ) = 1 because a is closer to need verification. In this section, we present several pruning q1 1 1 q than it is to b or b . Also RNN (a ) = 1 − 0.5 rules from the following orthogonal perspectives: 1 1 2 q1 2 because dist(a2,b2) < dist(a2,q1). Note that b1 does not • Half-space based pruning that exploits geometrical prop- affect the probability of a2 to be the RNN of q1 because erties (Section 3.1) dist(a2,b1) > dist(a2,q1). Similarly, RNNq2(a1) = 1 and • Dominance based pruning that exploits topological prop- RNNq2(a2)=0.5. According to Equation (1), RNNQ(A)= erties (Section 3.2) (0.5×0.5×1)+(0.5×0.5×1)+(0.5×0.5×0.5)+(0.5× • Metric based pruning (Section 3.3) 0.5×0.5) = 0.75. RNN probability of B can be computed • Probabilistic pruning that exploits the probability thresh- similarly and RNNQ(B)=0.25. If the probability threshold old (Section 3.4) ρ is 0.7, then the object A is reported as result. (cid:4) 3.1 Half-spacePruning 2.2 Preliminaries Consider a query point q and a filtering object U that has Thefilter-and-refineparadigmiswidelyadoptedinprocessing n instances {u ,u ,...,u }. Let H be the half-space RNNqueriesinspatialdatabases.Theideaistoquicklyprune 1 2 n ui:q between q and u . Any instance u∈/ U that lies in ∩n H away points which are closer to another point (usually called i i=1 ui:q has zero probability to be the RNN of q because by the filtering point) than to the query point. The state-of-the-art property of H , u is closer to every u than to q. pruningruleisbasedonperpendicularbisector[20].Itconsists ui:q i of two phases: the pruning phase and the verification phase. Example 2: Consider the example of Fig. 2 where the bi- Hence, some objects are used to filter other objects and sectors between q and the instances of A are drawn and the 1 are called filtering objects. Objects that cannot be filtered half-spaces H and H are shown. Intersection of the a1:q1 a2:q1 are called candidate objects. The pruning in RNN query two half-spaces is shown shaded and any point that lies in processinginvolvesthreeobjects,thequery,thefilteringobject the shaded area is closer to both a and a than q . For this 1 2 1 and a candidate object. We use RQ, Rfil and Rcnd to denote reason, b2 cannot be the RNN of q1 in any possible world.(cid:4) thesmallesthyper-rectanglesenclosinguncertainqueryobject, filtering object and candidate object, respectively. Thispruningisveryexpensivebecauseweneedtocompute Table 1 defines the symbols and notations used throughout intersection of all half-spaces H for every u ∈U. Below ui:q i this paper. wepresentourpruningrulesthatutilizetheMBRoftheentire filteringobject,R ,toprunethecandidateobjectwithrespect fil to a query instance q or the MBR of uncertain query object TABLE 1: Notations Q. Notation Definition U anuncertainobject ui ith instanceofuncertainobjectU 3.1.1 PruningusingRfil andaninstanceq Bx:q aperpendicularbisectorbetweenpointxandq Hx:q ahalf-spacedefinedbyBx:q containingpointx First we present the intuition. Consider the example of Fig. 3 Hq:x ahalf-spacedefinedbyBx:q containingpointq where we know that the point p lies on a line MN but we do Ha:b∩Hc:d intersectionofthetwohalf-spaces not know the exact location of p on this line. The bisectors P[i] valueofpointP intheith dimension RU minimum bounding rectangle (MBR) enclosing all between q and the end points of the line (M and N) can be instancesofanuncertainobjectU usedtoprunetheareasafely.Inotherwords,anypointthatlies in the intersection of half-spaces H and H (grey area) M:q N:q can never be the RNN of q. It can be proved that whatever be 3 PRUNING RULES the location of point p on the line MN, the half-space Hp:q always contains H ∩H . Hence any point p′ that lies in Although the pruning for RNN query processing in spatial M:q N:q H ∩H would always be closer to p than to q and for databases has been well studied, it is non-trivial to devise M:q N:q this reason cannot be the RNN of q. pruningstrategiesforRNNqueryprocessingonuncertaindata. For example, if we na¨ıvely use every instance of a filtering H object to perform bisector pruning [20], it will incur a huge N:q H H computation cost due to large number of instances in each H P:q M:q p:q N O uncertain object. Instead, we devise non-trivial generalization M p N R fil ofbisectorpruningforminimumboundingrectangles(MBRs) M P of uncertain objects based on a novel notion of normalized H H N:q H half-space. M:q O:q Verification is extremely expensive in probabilistic RNN q q query processing because, in order to verify an object as probabilistic RNN, we need to take into consideration not onlytheinstancesofthisobjectbutalsotheinstancesofquery Fig. 3: The exact location of Fig. 4: Any point in shaded objectandothernearbyobjects.Henceitisimportanttodevise the point p on line MN is area cannot be RNN of q in efficient pruning rules to reduce the number of objects that not known any possible world TRANSACTIONSONKNOWLEDGEANDDATAENGINEERING 4 Basedontheaboveobservation,belowwepresentapruning rule for the case when the exact location of a point p is Hp:N N H’P:A O R unknown within some hyper-rectangle Rfil. M q N M 2 P c PRUNINGRULE 1: LetRfil beahyper-rectangleandq bea p' H’M:B c iqtuhercyorpnoeirnot.fFRoran),ydpisotin(pt,pq)th>atmlieasxidnisTt(2i=pd,1RHCi):qa(nCdithisutshpe c HM:B A R1 B fil fil H D C cannot be the RNN of q. p H' p:q Hp:M H p:N P:A ThepruningruleisbasedonLemma4thatisprovedinour Fig. 5: Any point in dotted Fig. 6: Antipodal corners technical report [5]. area can never be RNN of q and normalized half-spaces ConsidertheexampleofFig.4.Anypointthatliesinshaded area is closer to every point in rectangle R than to q. Note fil that if Rfil is a hyper rectangle that encloses all instances of corners2 if for every dimension i where C[i] = R1L[i] then thefilteringobjectUi thenanyinstanceu∈Uj,j6=i thatliesin C′[i] = R2H[i] and for every dimension j where C[j] = 2d H canneverbetheRNNofq inanypossibleworld. R1 [j] then C′[j]=R2 [j]. Fig. 6 shows two rectangles R1 Ti=1 Ci:q H L andR2.ThecornersDandOareantipodalcorners.Similarly, otherpairsofantipodalcornersare(B,M),(C,N)and(A,P). 3.1.2 PruningusingRfil andRQ Antipodal Half-Space: A half-space that is defined by the bisector between two antipodal corners is called antipodal Pruning rule 1 prunes the area such that any point lying in it half-space. Fig. 6 shows two antipodal half-spaces H and canneverbetheRNNofsomeinstanceq.However,thepoints M:B H . in the pruned area may still be the RNNs of other instances P:A Normalized Half-Space: Let B and M be two points in of the query. Now, we present a pruning rule that prunes the hyper-rectangles R1 and R2, respectively. The normalized area using R and R such that any point that lies in the fil Q half-spaceH′ isaspacedefinedbythebisectorbetweenM pruned area cannot be the RNN of any instance of Q. M:B andB thatpassesthroughapointcsuchthatc[i]=(R1 [i]+ L Consider the example of Fig. 5 where the exact location of R2 [i])/2 for all dimensions i for which B[i] > M[i] and L thequerypointq onlineMN isnotknown.Unfortunately,in c[j] = (R1 [i]+R2 [j])/2 for all dimensions j for which H H contrast to the previous case of Fig. 3, the bisectors between B[j] ≤ M[j]. Fig. 6 shows two normalized (antipodal) half- p and the end points of the line MN do not define the area spaces H′ and H′ . The point c for each half-space is that can be pruned. If we prune the area H ∩H (the M:B P:A p:M p:N also shown. The inequalities (3) and (4) define the half-space grey area), we may miss some point p′ that is the RNN of H and its normalized half-space H′ , respectively. q. Fig. 5 shows a point p′ that is the RNN of q but lies in M:B M:B the shaded area. This is because the half-space Hp:q does not d d (B[i]−M[i])(B[i]+M[i]) contain Hp:M∩Hp:N.This makes thepruning using Rfil and X(B[i]−M[i])·x[i]<X 2 (3) R challenging. i=1 i=1 Q Note that if H is moved such that it passes through the d p:N point where H intersects H then H ∩H would (B[i]−M[i])·x[i]< p:q p:M p:M p:N X be contained by H . We note that in the worst case when p i=1 p:q (R1 [i]+R2 [i]) lies infinitesimally close to point M,Hp:q and Hp:M intersect d L L , if B[i]>M[i] each other at point c which is the centre of line joining p and X(B[i]−M[i])×(R1 [i]2+R2 [i]) M. Hence, in order to safely prune the area, the half-space i=1 H H , otherwise Hp:N should be moved such that it passes through the point 2 (4) c.ThepointcisshowninFig.5.Ahalf-spacethatismovedto Note that the right hand side of the Equation (3) cannot be the point c is called a normalized half-space and a half-space smallerthantherighthandsideofEquation(4).Forthisreason H that is normalized is denoted as H′ . Fig. 5 shows p:N p:N H′ ⊆H . H′ in broken line and H′ ∩ H (the dotted shaded MB MB p:N p:N p:M Now, we present our pruning rule. area) can be safely pruned. The correctness proof of the above observation is lengthy PRUNINGRULE 2: Let RQ and Rfil be two hyper- thoughitisquiteintuitive.Thusweomititfromthepaper.The rectangles. For any point p that lies in 2d H′ , interestedreadersmayreadtheproofinourtechnicalreport[5] mindist(p,R ) > maxdist(p,R ) where HT′ i=1 isCin:Cori′- for a more general case when both the query and data objects malizedhalf-sQpacebetweenC (thefiitlh cornerofCthi:eCri′ectangle i are represented by hyper-rectangles in d dimensional space R ) and its antipodal corner C′ in R . fil i Q (Lemma5).Beforewepresentourpruningruleforthegeneral case that uses 2d half-spaces to prune the area using hyper- The proof of correctness is non-trivial and can be found in Lemma 5 in our technical report [5]. rectangles R and R , we define the following concepts: Q fil AntipodalCorners:LetC beacornerofrectangleR1and 2.RL[i](resp.RH[i])isthelowest(resp.highest)coordinateofahyper- C′ be a corner in R2, the two corners are called antipodal rectangleRinith dimension TRANSACTIONSONKNOWLEDGEANDDATAENGINEERING 5 H’P:A H’M:B H’N:C H’P:A R Rem H’M:B Rreectman1glaenRdeRmembe2c.omNeosteeqtuhaalt taotRacnnyd,stthaegecliipfptihneg breymonthanert MN Rfil OP Rem2 cnd H’N:C bcliispepcitnogrs(liisneno1t0)n.eeded so Rcnd is returned without further Rem 3.2 DominancePruning A B 1 R H’ We first give the intuition behind this pruning rule. Fig. 9 Q O:D D C H’ shows another example of pruning by using pruning rule 2 in O:D twodimensionalspace.Thenormalizedhalf-spacesaredefined Fig. 7: Any point in shaded Fig. 8: Clipping part of the such that if Rfil is fully dominated3 by RQ in all dimensions area can never be RNN of candidate object Rcnd that then allthe normalized antipodal half-spaces meet atpoint Fp any q ∈Q can not be pruned asshowninFig.9.WealsoobservethatforthecasewhenRfil is fully dominated by R , the angle between the half-spaces Q that define the pruned area (shown in grey) is always greater Consider the example of Fig. 7 where the normalized than 90◦. Based on these observations, it can be verified that antipodalhalf-spacesaredrawnandtheirintersectionisshown the space dominated by F (the dotted-shaded area) can be p shaded.Anypointthatliesintheshadedareaisclosertoevery pruned 4. point in rectangle R than every point in rectangle R . fil Q Note that if Rfil and RQ are the MBRs enclosing all f 2 1 f instances of an uncertain object U and query object Q, respectively, any instance u ∈ Uj,j6=ii that lies in the pruned NR O Fp Fp rinegaionny,Tpo2is=ds1ibHleC′iw:Coi′r,ldc.anEnvoetnbeifRtNhNe porfuanniynginrsetagniocneopfaqrti∈alQly M Fil P RQ overlaps with Rfil, we can still trim the part of any other A B Fp H’N:C Fp Fp H’ hyper-rectangle RUj,j6=i that falls in the pruned region. It RQ H’M:B f f is known that exact trimming becomes inefficient in high D C H’ O:D P:A 3 4 dimensional space, therefore, we adopt the loose trimming of R proposed in [20]. Fig. 9: Pruning area of half- Fig. 10: Dominance Prun- cnd space pruning and domi- ing: Shaded areas can be Algorithm 1 : hspace pruning (Q,R ,R ) nance pruning pruned fil cnd Input: Q: an MBR containing instances of Q ; Rfil: the MBR to be usedfortrimmingRcnd:thecandidateMBRtobetrimmed Let RQ be the MBR containing instances of Q. We can Description: obtain the 2d regions as shown in Fig. 10. Let R be an 1: Rem=∅ //Remnant rectangle Ui MBR of a filtering object R that lies completely in one of 32:: forifeQachiscaorpnoeirntCtiheonfRfil do the 2d regions. Let f be theffiulrthest corner of RUi from RQ 4: Remi=clip(Rcnd,HCi:Q) //clipping algorithm [10] and n be the nearest corner of RQ from f. The frontier point 5: elseifQisahyper-rectanglethen 67:: CRie′m=ian=ticploipd(aRlccnodrn,eHrCo′fi:CCii′)in//Qclipping algorithm [10] FPRpUliNeIsNaGt RthUeLcEen3tr:e oAfnlyineinjsotainnicneguf∈anUdjnt.hat is dominated 8: enlargeRemtoencloseRemi by the frontier point Fp of a filtering object cannot be RNN 9: ifRem=Rcnd then of any q ∈Q in any possible world. 10: returnRcnd 11: returnRem Fig. 10 shows four examples of dominance pruning (one in eachregion).Ineachpartitiontheshadedareaisdominatedby The overall half space pruning algorithm that integrates Fp andcanbepruned.NotethatifRfil isnotfullydominated pruning rules 1 and 2 is illustrated in Algorithm 1. For each byRQ,wecannotusethispruningrulebecausethenormalized half-space, we use the clipping algorithm in [10] to find antipodal half-spaces in this case do not meet at the same a remnant rectangle Rem ⊆ R that cannot be pruned point. For example, the four normalized antipodal half-spaces i cnd (lines 4 and 7). After all the half-spaces have been used for intersectattwopointsinFig.7.Ingeneral,thepruningpower pruning,wecalculatetheMBRRem⊆R astheminimum of this rule is less than that of the half-space pruning. Fig. 9 cnd bounding hyper rectangle covering every Rem . As such, we showstheareaprunedbythehalf-spacepruning(shadedarea) i trim the original R to Rem. and dominance pruning (dotted area). cnd ForbetterillustrationwezoomFig.7andshowtheclipping The main advantage of this pruning rule is that the pruning of a hyper-rectangle R in Fig. 8. The algorithm returns procedureiscomputationallymoreefficientthanthehalf-space cnd Rem , Rem (rectangles shown with broken lines) when pruning,ascheckingthedominancerelationshipandtrimming 1 2 H′ and H′ are parameters to the clipping algorithm, the hyper-rectangles is easier. M:B P:A respectively. For the half-spaces H′ and H′ the whole N:C O:D 3.IfeverypointinR1 isdominated(dominancerelationshipasdefinedin hyper-rectangle R can be pruned so the algorithm returns cnd skylines)byeverypointinR2 wesaythatR1 isfullydominatedbyR2. φ.Theremnanthyper-rectangleRemisanMBRthatencloses 4.FormalproofisgiveninLemma6ofourtechnicalreport[5] TRANSACTIONSONKNOWLEDGEANDDATAENGINEERING 6 3.3 MetricBasedPruning R 1 PRUNINGRULE 4: AnuncertainobjectRcnd canbepruned u3 R if maxdist(R ,R )<mindist(R ,R ). R R u cnd cnd fil cnd Q Fil 2 2 u 4 This pruning approach is the least expensive. Note that it q q 5 3 u cannot prune part of R , i.e., it either invalidates all the 1 cnd R R instances of Rcnd or does nothing. R’Q Q q2 RQ2 Q1R q4 Q q 3.4 ProbabilisticPruning 1 Note that we did not discuss probability threshold while Fig. 11: Regions pruned by Fig. 12: Probabilistic Prun- presenting previous pruning rules. In this section, we present RQ and its subset RQ′ ing a pruning rule that exploits the probability threshold and embedsitinallpreviouspruningrulestoincreasetheirpruning powers. for every q and p = 0.25 for every u ). Suppose that no i ui i A simple exploitation of the probability threshold is to trim part of R can be pruned using R and any filtering object cnd Q the candidate object using previous pruning rules and then R (for better illustration, filtering object is not shown). We fil prune the object if the accumulative appearance probability prune R using the rectangle R that is contained by R . cnd Q1 Q of instances within its remnant rectangle is less than the This trims R and the remnant rectangle R is obtained. cnd 1 threshold. Next, we present a more powerful pruning rule that Similarly, R is the remnant rectangle when pruning rules are 2 isbasedonestimatinganupperboundoftheRNNprobability applied for R . Note that only the instances in R (u and Q2 1 1 of candidate objects. u ) can be the RNN of instances in R (q , q and q ). 2 Q1 3 4 5 First, we present an observation deduced from Lemma 5 in Similarly, no instance can be the RNNs of any instance in our technical report [5]. In previous pruning rules, we prune R because R is empty. So the maximum RNN probability Q2 2 some area using MBR of a query object RQ and a filtering of Rcnd is (0.6×0.5)+(0.4×0) = 0.3. If the probability objectRfil.WeobservethattheareaprunedbyusingRQ′ and thresholdρisgreaterthan0.3,wecanpruneRcnd.Otherwise, Rf′il always contains the area pruned by RQ and Rfil where we can continue to trim Rcnd by using the smaller rectangles RQ′ ⊆ RQ and Rf′il ⊆ Rfil. Fig. 11 shows an example. The contained in RQ1. (cid:4) shadedareaisprunedwhenR′ andR areusedforpruning Q fil and the dotted shaded area is pruned when R and R are In our implementation, we build an R-tree on query object Q fil used. Note that this observation also holds for the dominance and the pruning rule is applied iteratively using MBRs of pruning. children. For more details, please see Algorithm 5. We can use the observation presented above to prune the Although the smaller rectangles Rf′il contained in Rfil can objects that cannot have RNN probability greater than the also be used, we do not use them because unlike query object threshold. First, we give a formal description of this pruning there may be many filtering objects. Hence, using the smaller rule and then we give an example. rectangles for each of the filtering objects would make this pruning rule very expensive in practice (more expensive than PRUNINGRULE 5: Let the instances of Q be divided into the efficient verification presented in Section 4.3). n disjoint5 sets {Q ,Q ,...,Q } and R be the mini- 1 2 n Qi mum bounding rectangle enclosing all instances in Q . Let i {R ,R ,..., R } be the set of bounding rectangles 3.5 Integratingthepruningrules cnd1 cnd2 cndn such that each Rcndi contains the instances of the candi- Algorithm 2 is the implementation of Pruning rules 1–4. date object that cannot be pruned for Qi using any of the Specifically,weapplypruningrulesinincreasingorderoftheir pruning rules. Let PRQi and PRcndi be the total appearance computational costs (i.e., from Pruning rule 4 to 1). While Ppronib=a1b(iPliRtiecnsdio·fPRinQsita)n<ceρs,itnheQcainadniddatReconbdjie,ctrceaspnebcetivperulyn.edIf. osinmesp,lethperyuncianngqruuilcekslyardeisncoatrdasmraensytrincotinn-gpraosmmisoinregecxapnednidsiavtee Pruningrule5computesanupperboundoftheRNNproba- objects and save the overall computational time. bility of the candidate object by assuming that all instances in Rcndi are RNNs of all instances in Qi. The candidate object Rcnd can be safely pruned if this upper bound is still less than the threshold. Example 3: Fig. 12 shows MBRs of the query object RQ R2 R1 and a candidate object R along with their instances (q cnd 1 to q and u to u ). Assume that all instances within an 5 1 4 object have equal appearance probabilities (e.g; pqi = 0.2 RQ 5.WeonlyrequireinstancesofQtobedisjoint.Thepruningrulecanbe aapspslhieodwenviennFwihge.n12th.eminimumboundingrectanglesRQi overlapeachother Fig. 13: Rcnd can be pruned by R1 and R2 TRANSACTIONSONKNOWLEDGEANDDATAENGINEERING 7 Algorithm 2 : Prune(Q,S ,R ) Algorithm 3 : Answering Probabilistic RNN fil cnd Input: RQ: an MBR containing instances of Q ; Sfil: a set of MBRs Input: Q:uncertainqueryobject;ρ:probabilitythreshold; tobeusedfortrimmingRcnd:thecandidateMBRtobetrimmed Output: allobjectsthathavehigherthanρprobabilitytobeRNNofQ Description: Description: 1: foreachRfil inSfil do 1: Shortlisting:Shortlistcandidateandfilteringobjects(Algorithm4) 2: if maxdist(Rcnd,Rfil) < mindist(RQ,Rcnd) then // 2: Refinement: Trim candidate objects using disjoint subsets of Q and Pruning rule 4 applypruningrule5(Algorithm5) 3: returnφ 3: Verification: Compute the exact probabilities of each candidate and 4: ifmindist(Rcnd,Rfil)>maxdist(RQ,Rcnd)then reportresults 5: Sfil=Sfil−Rfil //Rfil cannot prune Rcnd Rem=Rcnd 6: foreachRfil inSfil do 7: ifRfilisfullydominatedbyRQinapartitionpthen //Pruning 4.1 Shortlisting rule 3 8: if somepartofRemliesinthepartitionpthen In this phase (Algorithm 4), the global R-tree is traversed to 9: Rem=thepartofRemnotdominatedbyFp shortlist the objects that may possibly be the RNN of Q. The 10: if(Rem=φ)thenreturnφ MBR R of each shortlisted candidate object is stored in a 11: foreachRfil inSfil do cnd 12: Rem=hspace pruning(RQ,Rfil,Rem) //Pruning Rules 1 setofcandidateobjectscalledScnd.Initially,rootentryofthe and 2 R-tree is inserted in a min-heap H. Each entry e is inserted in 13: if(Rem=φ)thenreturnφ the heap with key maxdist(e,R ) because a hyper-rectangle 14: returnRem Q that has smaller maximum distance to R is likely to prune Q a larger area and has higher chances to become the result. It is important to use all the filtering objects to filter a Algorithm 4 : Shortlisting candidate objects. Consider the example in Fig. 13. R cnd cannot be pruned by either R or R , but will be pruned by 1: Sfil=∅,Scnd=∅ 1 2 2: Initializeamin-heapH withrootentryofGlobalR-Tree considering both of them. 3: whileHisnotemptydo Two subtle optimizations in the algorithm are: 4: de-heapanentrye 5: if(Rem=prune(RQ,Sfil,e))6=φthen • If mindist(Rcnd,Rfil) > maxdist(RQ,Rcnd) for a 6: ifeisadataobjectthen given MBR R , then R cannot prune any part of 7: Scnd=Scnd∪{e} fil fil 8: elseifeisaleaforintermediatenodethen Rcnd. Hence such Rfil is not considered for dominance 9: Sfil=Sfil−{e} and half-space pruning (lines 4-5). However, R may 10: foreachdataentryorchildcinedo fil still prune some other candidate objects, so we remove 11: insertcintoH withkeymaxdist(c,RQ) 12: Sfil=Sfil∪{c} such R only from a local set of filtering object, S . fil fil Thisoptimizationreducesthecostofdominanceandhalf- We try to prune every de-heaped entry e (line 5) by using space pruning. the pruning rules presented in the previous section. If e is • If the frontier point Fp1 of a filtering object Rfil1 is a data object and cannot be pruned, we insert it into S . dominated by the frontier point F of another filtering cnd p2 Otherwise, if e is an intermediate or leaf node, we insert its objectR ,thenF canberemovedfromS because fil2 p1 fil children c into heap H with key maxdist(c,R ). Note that the area pruned by F can also be pruned by F . Q p1 p2 an entry e can be removed from S (line 9) if at least one However, note that a frontier point cannot be used to fil of its children is inserted in S because the area pruned by prune its own rectangle. Therefore, before deleting F , fil p1 an entry e is always contained by the area pruned by its child we use it to prune rectangle belonging to F . This p2 (Lemma 5 in our technical report [5]). optimization reduces the cost of dominance pruning. 4.2 Refinement 4 PROPOSED SOLUTION In this phase (Algorithm 5), we refine the set of candidate objectsbyusingpruningrule5.Morespecifically,wedescend In this section, we present our algorithm to find the proba- into the R-tree of Q and trim each candidate object R cnd bilistic RNNs of an uncertain query object Q. The data is against the children of Q and apply pruning rule 5. stored in system as follows: for each uncertain object, an R- Let PR be the aggregate probability of instances in any tree is created and stored on disk that contains the instances hyper-rectangle R. At this stage PRcnd of a candidate object of the uncertain object. Each node of the R-tree contains the may be less than one because R might have been trimmed cnd aggregateappearanceprobabilityoftheinstancesinitssubtree. during shortlisting phase. We can prune R if upper bound cnd WerefertheseR-treesaslocalR-treesoftheobjects.Another RNN probability of a candidate object MaxProb=PRcnd is R-treeiscreatedthatstorestheMBRsofalluncertainobjects. less than ρ (line 3). This R-tree is called global R-tree. We use a max-heap that stores entries in form (e,R,key) Algorithm 3 outlines our approach. Our algorithm consists where e and R are hyper-rectangles containing instances of of three phases namely Shortlisting, Refinement and Verifica- Q and R , respectively. key is the maximum probability of cnd tion. In the following sub-sections, we present the details of instances in R to be the RNNs of instances in e (i.e; key = each of these three phases. Pe · PRcnd). We initialize the heap by inserting (Q, Rcnd, TRANSACTIONSONKNOWLEDGEANDDATAENGINEERING 8 Algorithm 5 : Refinement Maxdist(u,R) 1 1 Description: R 1: foreachRcnd inScnd do 2 R 2: if(MaxProb=PRcnd)<ρthen 1 3: Scnd=Scnd−Rcnd;continue; R Rcndu1 4: Initializeamax-heapHcontainingentriesinform(e,R,key) Q u 5: insert(Q,Rcnd,MaxProb)intoH 2 R 6: whileHisnotemptydo 4 R 7: de-heapanentry(e,R,p) 3 8: Rem=Prune(e,Sfil,R) 9: MaxProb=MaxProb−p+(Pe·PRem) Maxdist(u,R )+dist(u,c) 2 Q 2 10: ifMaxProb<ρthen 11: Scnd=Scnd−Rcnd;break; Fig. 14: Finding the range of the global query 12: if(PRem>0)AND(eisanintermediatenodeorleaf)then 13: foreachchildcofedo 14: insert(c,Rem,(Pc·PRem))intoH Consider the example of Fig. 14 where the range of queries centred at u and u are maxdist(u ,R ) and 1 2 1 1 maxdist(u ,R ), respectively (circles with broken lines). MaxProb) (line 5). For each de-heaped entry (e,R,p), we 2 Q We want to reduce multiple range queries to a single range trim the hyper-rectangle R against e by S and store the fil query centred at the centre of R with a global range r trimmed rectangle in Rem (line 8). The upper bound RNN cnd probability MaxProb is updated to MaxProb−p+(Pe · suchthatallinstancesrequiredtocomputeRNNprobabilityof PRem). Recall that p=Pe·PR was inserted with this entry everycandidateinstanceui ∈Rcnd arereturned.Letri bethe range of the range query of u computed as described above. assuming that all instances in R are RNNs of all instances in i The global range r is max(r +dist(u ,c)) for every u ∈ e. After we trim R using e (line 8), we know that only the i i i R wherecisthecentreofR .IntheexampleofFig.14, instances in Rem can be RNNs of e. That is the reason we cnd cnd subtract p from MaxProb and add (Pe·PRem). the global range is r = maxdist(u2,RQ) + dist(u2,c) as shown in the figure (solid circle). Note that this range ensures At any stage, if the MaxProb < ρ the candidate object that all the instances required to compute RNN probability of can be pruned. Otherwise, an entry (c,Rem,(Pc.PRem)) is both u and u lie within this range. inserted into the heap, for each child c of e. Note that if the 1 2 trimmed hyper-rectangle does not contain any instance then 4.3.2 ComputingtheexactRNNprobabilityofR PRem is zero and we do not need to insert children of e in cnd We issue a range query on global R-tree with range r as the heap for such Rem. computedabove.ForeachreturnedobjectU ,weissuearange Recall that every node in local R-tree stores the aggregate i query on the local R-tree of U to get the instances that lie appearance probability of all instances in its sub-tree which i within the range and then create a list L containing all these makes computation of aggregate probability cheaper. i instances.WesorttheentriesineachlistL inascendingorder i of their distances from u . cnd 4.3 Verification The list L for the instances of query object Q is shown in Q The actual probability of a candidate object R to be the Fig. 15. Each entry e contains two values (d,p) such that d is cnd RNN of Q is the sum of probabilities of every instance ui ∈ distanceofefromucnd andpistheappearanceprobabilityof R to be the RNN of every instance q of Q . To compute the instance e. The lists for other objects are slightly different cnd the probability of an instance u to be RNN of q, we have to in that each entry e contains two values (d,P) where P is i find,foreachuncertainobjectU,theaccumulativeappearance the accumulative appearance probability of all the instances probabilityofitsinstancesthathavesmallerdistancetou than that appear in the list before e. In other words, given an entry i dist(q,u ) (Equation (2)). A straight forward approach is to (d,P),thetotalappearanceprobabilityofallinstances(inthis i issue a range query for every u ∈ R centred at u with list) that have smaller distance than d is P. i cnd i range set as dist(q,u ) and then compute the accumulative Given these lists, we can quickly find the accumulative i appearance probability of instances of each object that are appearance probabilityofallinstancesofanyuncertainobject returned. However, this approach requires | Q | × | Rcnd | that lie closer to ucnd than a query instance qi. The example number of range queries where |Q| and |R | are number below illustrates the computation of exact probability of a cnd of instances in Q and Rcnd, respectively. Below, we present candidate instance ucnd. an efficient approach that issues only one global range query Example 4: Fig. 15 shows the lists of query object Q and to compute the exact RNN probability of a candidate object. three uncertain objects A, B and C. The lists are sorted on their distances from the candidate instance u . We start cnd 4.3.1 Findingrangeoftheglobalrangequery the computation from the first entry q in Q and compute 2 LetR beanMBRcontaininginstancesofafilteringobject. RNN (u ). The distance d is 0.3. We do a binary fil q2 cnd q2 An instance u has zero probability to be RNN of an instance search on A, B and C to find an entry in each list with i q if dist(u ,q)>maxdist(u ,R ). So the range of a range largest d smaller than d . Such entries are a (0.1,0.3) and i i fil q2 3 query for u centred at u is minimum of maxdist(u ,R ) b (0.2,0.4) in lists A and B, respectively. No instance is i i i Q 4 and maxdist(u ,R ) for every R in S . found in C. Hence, the sum of appearance probabilities of i fil fil fil TRANSACTIONSONKNOWLEDGEANDDATAENGINEERING 9 Q A B C a a a a 2 3 2 3 q2(0.3,0.2) a3(0.1,0.3) b1(0.1,0.2) c1(0.4,0.5) q1(0.3,0.3) a2(0.4,0.5) b4(0.2,0.4) c2(0.4,1.0) maxdist(Rcnd,RQ) mindist(Rcnd,q1) q3(0.5,0.3) a6(0.6,0.7) b2(0.4,0.7) a a q q4(0.6,0.1) b3(0.6,0.9) 1 1 1 q5(0.7,0.1) b5(0.6,1.0) a4 c RQ a4 c R R cnd cnd Fig.15:listssortedondistancefromacandidateinstanceucnd mindist(R ,R ) maxdist(R ,q ) cnd Q cnd 1 Fig. 16: Bounding lower and upper bound RNN probabilities instances of B that have distance from u smaller than cnd d is 0.4, similarly for A it is 0.3. Given both q and u q2 2 cnd appear in a world, the probability of u to be RNN of q is cnd 2 accumulate the appearance probabilities of all the instances obtained from Equation (2) as (1−0.4)(1−0.3)=0.42. The u such that every u is always closer to (or further from) cnd probability of u to be RNN of q in any possible world is cnd 2 u than every q . More specifically, we traverse each list L i i 0.42(p ×p ). q2 ucnd and accumulate the appearance probabilities of every instance Similarly the next entry in Q is processed and u for which mindist(R ,R ) > dist(u ,c) + d/2 and i cnd Q i RNN (u ) is computed which is again 0.42 because its q1 cnd store the accumulated probabilities in Pnear. Similarly, the distancefromu isthesame.RNN (u )iszerobecause i cnd q3 cnd accumulated appearance probabilities of every instance u for j thebinarysearchonC givesanentry(d,P)whereP =1(all which maxdist(R ,R ) < dist(u ,c)−d/2 is stored in cnd Q j instances of C have smaller distance to ucnd then dq3). Note Pfar. Then the maximum RNN probability of any instance that,wedonotneedtocomputetheRNNprobabilitiesofu i cnd u is pmax = (1−Pnear). The minimum probability against remaining instances q4 and q5 because their distances ofcnadny insctnadnce uQ∀Ltio be RNiN of Q is pmin = (Pfar) nfrootmethuactntdhearaerelaartgoebrethseaanrcdhqe3dainndanRyNliNstqL3(iubcyndb)in=ary0s.eAarlcsho because Pifar iscnthde total probability ofcnidnstanQce∀sLithati are definitely farther. So we assume that all other instances are becomes smaller for the processing of next query instance.(cid:4) closer to u than q and this gives us the minimum RNN cnd i The above example illustrates the probability computation probability. of an instance u to be the RNN of all instances in Q. We Let PRcnd be the aggregate appearance probability of all cnd repeat this for every instance ucnd ∈ Rcnd to compute the the instances in Rcnd then Rcnd can be pruned if PRcnd · RNN probability of the candidate object. Next, we present pmax < ρ. Similarly, the object can be reported as answer if cnd some optimizations that improve the efficiency of verification PRcnd ·pmin ≥ρ. cnd phase. b) Bounding RNN probabilities using instances of Q: Optimizations If an object R cannot be pruned or verified as result cnd Our proposed optimizations bound the minimum and max- at this stage, we try to make a better estimate of pmin imum RNN probabilities and verify the objects that have the cnd and pmax by using instances within Q. Note that every minimum probability greater than or equal to the threshold. cnd u ∈ R is always closer to a than a query instance cnd cnd i Similarly, the objects that have the maximum probability less q if mindist(R ,q ) > dist(a ,c) + d/2. Similarly, i cnd i i than the threshold are deleted. Below, we present the details every u would always be further from a than q if cnd i i of the proposed optimizations. maxdist(R ,q )<dist(a ,c)−d/2. Considertheexample cnd i i a) Bounding RNN probabilities using RQ: of Fig. 16 where every point in Rcnd is closer to both a1 and Recall that, for each candidate object Rcnd, a global range a4 thanq1.Similarly,everypointinRcnd isfurtherfromboth query is issued and for each object Ui within the range a list a2 and a3 than it is from q1. L is created containing the instances of U lying within the To update pmax, we first sort every list in ascending order i i cnd range. Just before we sort these lists, we can approximate the of dist(c,u) where dist(c,u) is already known (returned by maximum and minimum RNN probability of the candidate global range query). Then, the list L is sorted in ascending Q object based on the following observations. orderofthemindist(R ,q ).Thenforeachq inascending cnd i i LetcbethecentreanddbethediagonallengthofRcnd and order, we conduct a binary search on every list Li and find a be some instance in list A. Every u ∈ R is always the entry e(d,P) with greatest d in the list that is less than i cnd cnd closer to ai than every qi ∈ Q if mindist(Rcnd,RQ) > mindist(Rcnd,qi)−d/2. The probability P of this entry is dist(a ,c) + d/2. Similarly, every u would always be accumulatedappearanceprobabilityPnear ofalltheinstances i cnd i further from ai than every qi ∈ Q if maxdist(Rcnd,RQ) < ai suchthateveryucnd isalwaysclosertoai thanqi.Thenthe dist(ai,c)−d/2.ConsidertheexampleofFig.16,everypoint maximum probability of any instance ucnd ∈ Rcnd to be the einveRrycnpdoiisntaliwnaRyscncdloisseraltwoaay1sthfuarnthaenryfproomintai2ntRhaQn.Sitimisilfarrolmy, RseNarNchoefsqfoirisevpemircynadxqi=inQth∀Leil(i1st−anPdinpemcanrad)x. W=ePd∀oqis∈uQchpmibcinandxa.ry any point in R . The update of pmin is similar except that the list L is Q cnd Q Based on the above observations, for every object, we can sortedinascendingorderofmaxdist(R ,q )andthebinary cnd i TRANSACTIONSONKNOWLEDGEANDDATAENGINEERING 10 search is conducted to find the entry e(d,P) with the greatest 4. The appearance probabilities of instances were generated d that is smaller than maxdist(R ,q ) + d/2. The total following either uniform or normal distribution. Our default cnd i appearance probabilities of all instances in L that are always syntheticdatasetcontainsapproximately1.8Millioninstances i farther from every u than q is Pfar = (1−P). Finally, (6000×600). Similar to [19], the query object follows same cnd i i 2 pmin = (Pfar) and pmin = pmin. distribution as the underlying dataset. icAndfterQup∀dLaitingi pmax ancdndpminP, w∀qei∈dQeleitcendthe candidate Therealdataset6 consistsof28483zipcodesobtainedfrom cnd cnd objects for which PRcnd ·pmax < ρ. Similarly, a candidate 40 states of United States. Each zip code represents an object cnd object is reported as answer if PRcnd ·pmin ≥ρ. and the address blocks within each zip code are the instances. cnd The data source provides address ranges instead of individual c) Early stopping: addresses and we use the term address block for a range If an object R is not pruned by the above mentioned cnd of addresses along a road segment. The address block is an estimationofmaximumandminimumRNNprobabilitiesthen instance in our dataset that lies at the middle of the road we have to compute exact RNN probabilities (as described in segmentwiththeappearanceprobabilitycalculatedasfollows; Section 4.3.2) of the instances in it. By using the maximum Letnbethenumberoftotaladdressesinazipcodeandmbe and minimum RNN probabilities, it is possible to verify the number of addresses in the current address block then the or invalidate an object without computing the exact RNN appearance probability of the current address block is m/n. probabilities of all the instances. We achieve this as follows; The real dataset consists of 11.24 Million instances and the We sort all the instances in R in descending order of their cnd maximum number of instances (address blocks) in an object appearance probabilities. Assume that we have computed the (Sanford, North Carolina) were 5918. exactRNNprobabilityRNN (u)offirstiinstances.LetP be Q theaggregateappearanceprobabilitiesofthesefirstiinstances 5.1 Comparisonwithotherpossiblesolutions and P be the sum of their RNN (u). At any stage, an RNN Q objectcanbeverifiedasanswerifP +(1−P).pmin ≥ρ. We devise a na¨ıve algorithm and a sampling based ap- RNN cnd Similarly,anobjectcanbeprunedifP +(1−P).pmax < proximate algorithm to better understand the performance RNN cnd ρ. of our algorithm. More specifically, in the na¨ıve algorithm, Note that (1 − P).pmin is the minimum probability for we first shortlist the objects using our pruning rule 4 (e.g; cnd the rest of the instances to be the RNN of Q. Similarly, any object Rcnd can be pruned if mindist(Rcnd,RQ) > (1−P).pmax is the maximum probability for the remaining maxdist(Rcnd,Rfil)). Then, we verify the remaining objects cnd instances to be the RNN. as follows. For each pair (ui,qi), we issue a range query centred at u with range dist(u ,q ) and compute the RNN i i i 5 EXPERIMENT RESULTS probability of the instance ui against the query instance qi using the Equation (2). Finally, the Equation (1) is used to In this section we evaluate the performance of our proposed compute the RNN probability of the object. approach. All the experiments were conducted on Intel Xeon In sampling based approach, we create a few sample possi- 2.4 GHz dual CPU with 4 GBytes memory. The node size of ble worlds before starting the computation. More specifically, each local R-tree is 1K and that of global R-tree is 2K. We apossibleworldiscreatedbyrandomlyselectingoneinstance measured both the I/O and CPU time and I/O cost is around from each uncertain object. For each possible world, we 1-5% of the total cost for all experiments. Hence, for clarity create an R-tree (node size 2K) that stores the instances of experiment figures, we display the average total cost per of the possible worlds. This reduces the problem of finding query. We used both synthetic and real datasets. probabilistic RNNs to conventional RNNs. For each possible world, we compute the RNNs using TPL [20] that is the best- TABLE 2: System Parameters knownRNNalgorithmformultidimensionaldata.Letnbethe Parameter Range number of possible worlds evaluated and m be the number of Probabilitythreshold(ρ) 0.1,0.3,0.5,0.7,0.9 possible worlds in which an object R is returned as RNN, cnd Numberofobjects(×1000) 2,4,6,8,10 thenR isreportedasanswerifm/n≥ρ.Thecostsshown Maximumnumberofinstancesinanobject 200,400,600,800,1000 cnd donotconsiderthetimetakenincreatingthepossibleworlds. Maximumwidthofhyper-rectangle 1%,2%,3%,4% Distributionofobjectcentres Uniform,Normal Notethatthisalgorithmprovidesonlyapproximateresults.For Distributionofinstances Uniform,Normal real dataset, the accuracy varies from 60% to 75%. Appearanceprobabilityofinstances Uniform,Normal Na¨ıve algorithm appeared to be too slow (average query time from 7 minutes to 2 hours) so we show its computation Table 2 shows the specifications of the synthetic datasets time only when comparing our verification phase in Fig. 18. we used in our experiments and the defaults values are Fig. 17 compares our approach with the sampling based shown in bold. First the centres of the uncertain objects approximate approach (for 100 and 200 possible worlds) on were created (uniform or normal distribution) and then the synthetic dataset. In two dimensional space, our algorithm is instancesforeachobject(uniformornormaldistribution)were comparable with the sampling algorithm that returns approx- created within their respective hyper-rectangles. The width imate answer. On the other hand, the Fig. 17 shows that our of the hyper-rectangle in each dimension was set from 0 algorithm is more efficient for higher dimensions and scales to w% (following uniform distribution) of the whole space and we conducted experiments for w changed from 1 to 6.http://www.census.gov/geo/www/tiger/
Description: