Diversified Spatial Keyword Search On Road Networks Chengyuan Zhang†, Ying Zhang‡,†, Wenjie Zhang†, B ( )Xuemin Lin♯,†, Muhammad Aamir Cheema§,†, Xiaoyang Wang† † The University of New South Wales, ‡ QCIS, University of Technology, Sydney ♯ Shanghai Key Laboratory of Trustworthy Computing, East China Normal University § Clayton School of Information Technology, Monash University {zhangc, yingz, zhangw, lxue, xiaoyangw}@cse.unsw.edu.au, [email protected] ABSTRACT 2]) such that users can effectively exploit both spatial and textual information of these spatio-textual objects. Most Withtheincreasingpervasivenessofthegeo-positioningtech- of the existing work focuses on Euclidean space. However, nologies, there is an enormous amount of spatio-textual ob- in practice, the road network distance (cost) is employed jects available in many applications such as location based in many key applications (e.g., location based services) of services and social networks. Consequently, various types the spatial keyword search. Therefore, it is critical to de- of spatial keyword searches which explore both locations velop efficient indexing techniques and query algorithms to andtextualdescriptionsoftheobjectshavebeenintensively support spatial keyword search on road networks. In this studied by the research communities and commercial orga- paper, we focus on the boolean spatial keyword search on nizations. In many important applications (e.g., location road networks which aims to find a set of objects each of based services), the closeness of two spatial objects is mea- which containsall querykeywordsand is close tothequery sured by the road network distance. Moreover, the result in termsof road network distance(cost). diversification is becoming a common practice to enhance Moreover, it has been widely recognized [1] that the use- the quality of the search results. Motived by the above fulnessofaretrievedobjectdependsnotonlyonitsrelevance facts, inthispaperwestudytheproblem of diversifiedspa- tothequery(i.e.,distanceandkeywordconstraint)butalso tial keyword search on road networks which considers both on other objects in the results. Intuitively, the retrieved the relevance and the spatial diversity of the results. An objects should be dissimilar to each other (i.e., diversified) efficient signature-based invertedindexingtechniqueis pro- since in some scenarios it is less interesting for users to re- posed to facilitate the spatial keyword query processing on trieve two highly similar objects at the same time. In the roadnetworks. Thenwedevelopanefficientdiversifiedspa- context of spatio-textual objects, it is reported in [20, 19] tialkeywordsearchalgorithmbytakingadvantageofspatial that users have strong preference on spatially diversified keyword pruning and diversity pruning techniques. Com- result;that is,thepair-wise distance(i.e., dissimilarity) be- prehensive experiments on real and synthetic data clearly tween two objects in the result should be reasonably large. demonstrate theefficiency of ourmethods. Nevertheless,tothebestofourknowledge,thereisnoexist- ing work on the diversified spatial keyword search on road 1. INTRODUCTION networks. With advances in geo-positioning technologies, there is a Belowisamotivatingexampleinwhichboththerelevance rapidly growing amount of spatio-textual objects collected and thespatial diversity of theresults are considered. in many applications such as location based services and social networks, in which an object is described by its spa- p(cid:13)3(cid:13)((cid:13)pancake(cid:13))(cid:13) tial location and a set of keywords (terms). For instance, in the local search services, an online business directory p2(cid:13) (cid:13)((cid:13)pancake,lobster, king crab(cid:13))(cid:13) (e.g.,yellowpages)providesthelocationinformationaswell ((cid:13)pancake,lobster(cid:13))(cid:13) as short descriptions of the businesses (e.g., hotels, restau- rants). Consequently, the study of spatial keyword search p(cid:13)4(cid:13) q(cid:13)p1(cid:13) (cid:13)((cid:13)pancake, lobster(cid:13))(cid:13) p(cid:13)9(cid:13) which explores both location and textual description of the p(cid:13)5(cid:13)(p(cid:13) izza,coffee(cid:13))(cid:13) p(cid:13)6(cid:13)((cid:13)king crab(cid:13))(cid:13) ((cid:13)lobster(cid:13))(cid:13) objects has attracted great attention from the commercial organizations and research communities. Motivation. Due to the massive amount of spatio-textual p(cid:13) ((cid:13)pizza,steak(cid:13))(cid:13) Restaurant(cid:13) objectsinmanyimportantapplications,variousspatialkey- 7(cid:13) p(cid:13)8(cid:13) ((cid:13)sushi, steak(cid:13))(cid:13) wordquerymodels,queryprocessingtechniques,andindex- ingmechanismshaveemergedinrecentyears(surveyedin[6, Figure 1: Online Yellow Pages Map Example 1 (Motivation). In Fig. 1, there is a set of restaurants in Sydney CBD whose locations (represented by squares) and service lists (a set of keywords) are regis- tered intheonlineyellowpagesmapofalocalsearchservice (c) 2014, Copyright is with the authors. Published in Proc. 17th Inter- provider. Suppose atourist with aGPS-enabled smartphone national Conference onExtending Database Technology (EDBT),March wants to: 1) have a nice dinner and then 2) visit nearby at- 24-28,2014,Athens,Greece: ISBN978-3-89318065-3,onOpenProceed- tractions or shops after the dinner. Assume she decides to ings.org.DistributionofthispaperispermittedunderthetermsoftheCre- ativeCommonslicenseCC-by-nc-nd4.0 enjoy the lobster and the pancake which are famous food in Sydney, but has no idea what kinds of attractions or shops pruneobjectsbasedonthespatialkeywordconstraints(i.e., to explore until a set of candidates near the restaurants are distance based and keyword based constraints) and the ob- readily available. This implies that she expects to see a lim- jectivefunctionatthesametime. Asshowninourempirical ited number (say k=2) of restaurants, denoted by S, which study,itiscostexpensiveifwefirstretrieveallspatio-textual areclosetohercurrent location(e.g.,the pointq inFig.1), objects which satisfy spatial keyword constraints and then and each of which serves both lobster and pancake. Mean- applytheexistingdiversificationalgorithm. Thisisbecause while,these k restaurants shouldbewellspatiallydistributed many non-promising objects may be loaded for the diver- on the area in the sense that a reasonable number of at- sification computation, and the pairwise network distance tractions orshopswithinwalkingdistance tothe restaurants computationonroadnetworksiscostexpensive. Thismoti- can be considered for her post-dinner activity. Intuitively, vates us to develop incremental diversified spatial keyword although p and p in Fig. 1 are the two closest restaurants searchalgorithmsothatthespatialkeywordpruninganddi- 1 2 each ofwhichserves both pancake andlobster, S ={p ,p } versitypruningtechniquescanbeseamlesslyintegratedand 1 1 2 isnotagood result because two restaurants arevery closeto hencesignificantly reducetheoverall cost. each other, and the attractions or shops close to p1 ishighly Contributions. Our main contributions can be summa- likely to be reported by p2 as well. In this scenario, S2 = rized as follows. {p ,p } might be a better choice since it provides more at- 1 4 • Weformally definetheproblem ofdiversifiedspatial key- tractions orshops forconsideration withaslightsacrifice in word search on road networks. the relevance (i.e., closeness) compared with S . 1 • We develop an efficient signature-based inverted index- Motivated by the above example, in this paper we study ing technique as well as an efficient incremental network the problem of diversified spatial keyword search on road expansion algorithm for spatial keyword search on road networks. Given a set of spatio-textual objects in a road networks. We further propose a partition based method network, a spatial keyword query consists of a location and to enhancetheeffectivenessof thesignature technique. a set of query keywords, which aims to retrieve nearby ob- • Aneffectiveincrementaldiversifiedspatialkeywordsearch jects each of which contains all query keywords. Moreover, algorithmisproposedbasedonthespatialkeywordprun- wealsoconsiderthespatial diversity oftheresults. Inorder ing and diversitybased pruningtechniques. to meet a user’s preference on therelevance and thespatial • Comprehensiveexperimentsonrealandsyntheticdatasets diversity of the results, we employ a bi-criteria objective demonstrate the effectiveness and efficiency of our meth- function f which effectively combines both aspects. Specif- ods. ically, we aim to find a set S of objects with |S| = k such thateachobject inS containsall querykeywordsandf(S) Roadmap. The rest of the paper is organized as follows. is maximized. In this paper, we adopt the popular max- Section 2 formally definesthe problem of diversified spatial sum diversification function [12], which is formally defined keyword search on road network, followed by some prelim- inSection2.1,basedonthenetworkdistancesoftheobjects inary work. Section 3 introduces the signature-based in- tothequerylocation(relevance)andthepairwisenetwork verted indexing technique as well as efficient spatial key- distances among objects (spatial diversity)in S. word search algorithm on road network. An incremental diversified spatial keyword search algorithm is developed in Challenges. The challenges for the problem of diversified Section 4. Experimental results are reported in Section 5. spatial keyword search on road networksare two-fold. Section 6 introduces related work, and Section 7 concludes Firstly,effectiveindexingtechniqueisrequiredfordiversi- thepaper. fiedspatialkeywordqueryonroadnetworksduetothelarge sizeofroadnetworksandspatio-textual objectsinmanyap- 2. PRELIMINARY plications. Asshowninourempiricalstudy,theperformance of the algorithm is poor if we simply keep textual informa- In this section, we first formally define the problem of tion with objects in the existing road networking indexing diversified spatial keyword search on road networks in Sec- structure because a large number of irrelevant objects may tion2.1. Section2.2introducesthedisk-basedroadnetwork beloadedbeforeweapplythekeywordbasedpruning. Sim- data structure, and Section 2.3 presents a general greedy ilarly, since most of the existing s patial keyword indexing based diversification algorithm. Table 1 summarizes the techniques are proposed in Euclidean space which are in- mathematical notationsused throughoutthis paper. dependent to the underlying road network structure, it is Notation Definition cost expensive to conduct the spatial keyword based prun- o(q) aspatio-textual object (query) ing during the network expansion. We adopt the inverted G aroad network indexingtechniqueon theedges of theroad network tosig- q.T aset of querykeywordsfor query q nificantlyimprovetheperformanceofspatialkeywordsearch V, |V| vocabulary,size of thevocabulary since objects which do not contain any query keyword can n anode in theroad network be immediately excluded from computation due to the na- ture of the inverted indexing technique. Observe that the e, (n1,n2) an edge, an edgewith two end-nodesn1 and n2 performance of the inverted indexing technique is affected m thenumberobjects on an edge by the number of false hits (i.e., the I/Os invoked by ob- δ(o1,o2) thenetwork distance (cost) between o1 and o2 jects containing part of the query keywords), we propose δT network distance threshold in thesignature-basedinvertedindexingtechniquesuchthata network expansion small summary is built for each keyword and the number δmax maximal network distance in thesearch of false hits can be significantly reduced by exploiting the θ(o1,o2) diversification distance between o1 and o2 AND semantics. We further enhance the pruning capabil- θ diversification distance threshold T ityofthesignaturetechniquebypartitioningobjectsonthe CP(CO) core pairs (objects) same edge. Secondly, it is a challenge to develop novel technique to Table 1: The summary of notations. 2.1 Problem Definition Definition 1 (SK Query). Given a road network G, Road Networks. Inthispaper,aroadnetworkismodeled a set O of spatio-textual objects, a query point q which is asaweightedgraphG =(N,E,W)wherearoadnoden∈N also a spatio-textual object, and a network distace δmax, a representsaroadintersection,anedgee=(n ,n )∈E cor- spatialkeywordqueryretrievesobjectseachofwhichcontains responds to a road segment which connects t1wo2road nodes all query keywords ofq andiswithinnetwork distance δmax n1 andn2,andanon-negativeweightw(n1,n2)∈W stands fromq;thatis,weretrieveobjectso∈O withδ(o,q)≤δmax for the cost (e.g., distance or travel time) associated with andq.T ⊆o.T. WeuseSK(O,q,δmax)torecord theobjects the road segment. We assume each edge is bi-directional. retrieved by the above SK query. Whenever there is no ambiguity, we assume the end-node For presentation simplicity, we use spatial keyword con- n has smaller id than n where n is called the reference 1 2 1 straint to denote the distance constraint and keyword con- node of theedge. Letpbeaspatial pointlyingontheedge straint. We say that an object satisfies the spatial keyword (n ,n ),weassumethecostfromthenoden top,denoted 1 2 1 constraintifitiswithinnetworkdistanceofδ fromqand by w(n ,p), is proportional to the distance between them1. max 1 contains all query keywords. Clearly,wehavew(n ,p)+w(p,n )=w(n ,n )foranedge 1 2 1 2 (n1,n2)andapointpontheedge. Thenfortwogivenpoints Example 3. InFig.2,givenaqueryqwithq.T ={t1,t2} uandv intheroadnetwork,weuseδ(u,v)torepresentthe andδ =20,wehaveSK(O,q,δ )={o ,o ,o }. Note max max 1 2 8 network distance (cost) between u and v which is the sum that, although o contains both t and t , it is excluded be- 9 1 2 of theedge weights along theleast costly path from u to v. causeofthenetworkdistanceconstraint. Meanwhile,objects Note that the least costly path corresponds to the shortest o , o and o are eliminated due to the query keyword con- 3 4 5 path if the weight represents the distance of an edge. It is straint although they are within network distance 20 from immediate that we have δ(u,v) = δ(v,u). Given a point p q. lying on theedge (n ,n ), following equation shows how to 1 2 derivenetworkdistancebetweenpandaquerypointqwhen Bi-criteria objective function (f). Given a subset S of q does not lie on (n ,n ). objects with |S| = k, we use the popular max-sum diver- 1 2 sification function [12] as the bi-criteria objective function, δ(q,p)=min(δ(q,n )+w(n ,p) , δ(q,n )+w(n ,p)) (1) 1 1 2 2 denoted by f, where the relevance of S (Rel(S)) is mea- Note that we have δ(q,p) = w(q,p) if both q and p lie on sured by the network distances of the objects to the query thesame edge. and the diversity of S (Div(S)) is captured by their pair- wise network distances. Spatio-textual object. A spatio-textual object o is de- scribed by a spatial point in a 2-dimensional space and a f(S) = λ×Rel(S)+(1−λ)×Div(S) (2) set of keywords (terms) from a vocabulary V, denoted by λ δ(u,q) 1−λ o.loc and o.T respectively. For presentation simplicity, we = X(1− )+ X δ(u,v) k δ k(k−1)δ assumeobjects liealongtheedges(i.e., road segments)of a u∈S max max u,v∈S road network G. In this paper hereafter, whenever there is Here, λ (0 ≤ λ ≤ 1) is a parameter specifying the trade- no ambiguity,“spatio-textual object”is abbreviated to“ob- off between therelevance and thediversity. We assume the ject”ando (q)is usedto representits location o.loc (q.loc). larger values are preferred in this paper. So the relevance n(cid:13)5(cid:13) Spatial Textual object(cid:13) score of an object u, denoted by Rel(u), is measured by 20(cid:13) Road Intersection (Node)(cid:13) 1− δ(u,q), and Rel(S)= Rel(u). Observe that there Road Segment (Edge)(cid:13) δmax Pu∈S o(cid:13)5(cid:13)(t(cid:13)1(cid:13))(cid:13) arek objectsand k(k−1) pairsofobjectsinS,togetherwith 2 n(cid:13)1(cid:13)o(cid:13)2(cid:13)(t(cid:13)1(cid:13),t(cid:13)2(cid:13))(cid:13) 10(cid:13) n(cid:13)4(cid:13) o(cid:13)4(cid:13)(t(cid:13)1(cid:13),t(cid:13)3(cid:13))(cid:13) o(cid:13)6(cid:13)(t(cid:13)1(cid:13))(cid:13) o(cid:13)7(cid:13)(t(cid:13)3(cid:13))(cid:13) n(cid:13)7(cid:13) tinhethfaectretshualttsδ(ouf,SqK)≤qδumerayx,awnedhδa(vue,v0)≤≤2Rδmela(xS)for≤ob1jeacntds 10(cid:13) o(cid:13)1(cid:13)(t(cid:13)1(cid:13),t(cid:13)2(cid:13))(cid:13) q(cid:13) 40(cid:13) 0 ≤ Div(S) ≤ 1 where the larger Rel(S) (Div(S)) value (t1,t2)(cid:13) o(cid:13)3(cid:13)(t(cid:13)2(cid:13))(cid:13) 10(cid:13) 30(cid:13) 15(cid:13) o(cid:13)9(cid:13)(t(cid:13)1(cid:13),t(cid:13)2(cid:13))(cid:13) represents the higher relevance (diversity) of S. As shown in[13],theresultsareusuallyexpectedtobediverseenough n(cid:13)2(cid:13) 10(cid:13) n(cid:13)3(cid:13) o(cid:13)8(cid:13)(t(cid:13)1(cid:13),t(cid:13)2(cid:13))(cid:13) n(cid:13)6(cid:13) without sacrificing relevancefor 0.5≤λ≤0.9. Problem Statement. In this paper, we investigate the Figure 2: Example of Diversified SK Search Example 2. Fig.2illustratesanexampleoftheroadnet- problem of diversified spatial keyword search on road net- work G and a set of spatio-textual objects O. There are 7 works. Given a road network G, a set O of spatio-textual nodes, 8 edges and 9 spatio-textual objects where the key- objects, a query object q, a distance δmax, a bi-criteria ob- words of each object are listed (e.g., object o contains key- jective function f, and a natural number k we aim to find 1 words t1 and t2) where the vocabulary V is {t1,t2,t3}. We a set of objects S ⊆ SK(O,q,δmax) such that |S|= k and alsolabelthedistance(length)ofeachedge(e.g.,d(n ,n )= f(S)ismaximized. Inthispaper,tiesarebrokenarbitrarily. 1 4 10). For presentation simplicity, we use the distance as the Example 4. In Fig. 2, given a diversified SK query q weight of each edge, i.e.,w(n ,n ) =d(n ,n ) =10. Inthe 1 4 1 4 with q.T = {t ,t }, δ = 20, k = 2 and λ = 0.6, objects 1 2 max example, we have δ(q,o )=10, δ(q,o )=12, δ(q,o )=15, 1 2 8 {o ,o } will be retrieved. More specifically, since there are δ(o ,o )=2, δ(o ,o )=25, and δ(o ,o )=27. 1 8 1 2 1 8 2 8 three objects satisfying spatial keyword constraint in Exam- ple 3, we aim to choose a set S with |S|=2 such that f(S) SK Query on Road Networks. Belowisaformaldefini- is maximized. There are three possible solutions where S = tion of thespatial keyword (SK)query on road networks. 1 {o ,o }, S = {o ,o } and S = {o ,o }. When λ = 0.6, 1 2 2 1 8 3 2 8 1we have w(n ,p) = w(n ,n )× d(n1,p) where d(n ,n ) we have f(S ) = 0.29, f(S ) = 0.475, and f(S ) = 0.465. 1 1 2 d(n1,n2) 1 2 1 2 3 represents the distance (length) between two end-node n1 Therefore, we have S = {o1,o8}. S turns to {o1,o2} when and n , and d(n ,p) corresponds to the distance (length) λ=0.9, which puts a high priority to the network distances 2 1 between noden and the point p along theedge (n ,n ). of the objects from q. 1 1 2 2.2 Disk-based RoadNetworkPresentation that we randomly pickone moreobject from theremaining Weadoptthepopularconnectivity-clusteredaccessmethod objects if k is odd. (CCAM) [18] to represent the road network G which effec- tively organizes theadjacent lists of the road nodes so that 3. SK SEARCH ON ROADNETWORKS we can takeadvantageof theaccess locality and reducethe To support the diversified SK search, we need to develop I/O costs during the query processing on road networks. efficient algorithm toretrieveobjects which satisfy thespa- Nodes of the road network are sorted by their Z-ordering tialkeywordconstraint(i.e.,keywordconstraintandnetwork accordingtothespatiallocations. Moreover,in[18]thenet- distance constraint). Section3.1proposesasignature-based work is partitioned into groups by recursively applying a invertedindexstructuretoorganizeobjectssuchthatmany two-way-partition method until the nodes’ adjacent lists of non-promising objects can be pruned at a cheap cost. Sec- each group can be fit into one page. For an adjacent list tion 3.2 presents an efficient SK search algorithm based on of a node n , the information of edges are stored including i the network expansion. Section 3.3 improves the effective- end-nodes, distance, and weight. We also build a network nessofourindexingtechniquebypartitioningobjectsonthe R-tree [16] to organize the minimal bounding rectangles of same edge. the edges of the network. For a given object, we may iden- tify its corresponding edges by utilizing the MBRs of the 3.1 Signature-Based Inverted Index edges in a branch-and-boundfashion. A large numberof irrelevant objects may be loaded if we 2.3 Greedy Algorithm forDiversification simplystoreobjectstogetherwiththeircorrespondingedges in the CCAM structure introduced in Section 2.2. There- We define a diversification distance between two objects fore, in addition to the CCAM structure which effectively u and v, denoted by θ(u,v), against the road network G as captures the topology of the road network for the network follows. expansion, it is desirable to utilize other indexingstructure δ(u,q) δ(v,q) to organize the spatio-textual objects. In this paper, we θ(u,v) = λ(2− − ) δ δ adopt the popular inverted indexing technique to organize max max 1−λ objects. Specifically, for each keyword t, objects containing + δ(u,v) (3) t are kept with their corresponding edges which are main- δ max tained by a B+ tree where the key of an edge is the Z- where θ(u,v) records the relevance and the diversity for ordering code of its center point. We also keep the offset a pair of objects u and v in S. Sincewe have distanceoftheobjecttothereferenceend-nodeoftheedge. Recallthatwecanderivew(n ,o)andw(n ,o)foranobject 1 2 λ δ(u,q) λ δ(u,q) X(1− ) = (k−1)X(1− ) o based on d(n1,n2), d(n1,o) and w(n1,n2) if o lies on the k u∈S δmax k(k−1) u∈S δmax edge (n1,n2). λ δ(u,q) δ(v,q) Anicepropertyoftheinvertedindexingstructureisthat, = X (1− +1− ) for agivenqueryq,only theobjects containingat least one k(k−1)u,v∈S δmax δmax query keyword will be involved in the search. Nevertheless, manyobjectswhichdonotcontainallquerykeywordsmay Then the bi-criteria objective function f in Equation 2 also be loaded. This may seriously deteriorate the perfor- can berewritten as follows. mance especially when the number of the query keywords is not small. Therefore, we further improve the I/O effi- 1 ciency by building signatures of the edges and then exploit f(S) = X θ(u,v) (4) the and semantics of the keyword constraint. Intuitively, k(k−1) u,v∈S we can use I(e,t) to represent the signature of an edge e where I(e,t) = 1 indicates that there is at least one object where f(S) is the average pairwise diversification distances with keyword t lying on edge e, and I(e,t) = 0 otherwise. of theobjects in S. Clearly,wedon’tneedtoexploreanedgeeifthereisaquery Algorithm 1: Diversified SK query (P, k) keyword t with I(e,t) = 0, and hence reduce the I/O cost because the size of the signatures of the edges are usually Input : P:asetofobjectssatisfyingspatialkeyword much smaller than the inverted index file and can be easily constraint Output : S:asetofdiversifiedobjectswithsizek fit into the main memory. For presentation simplicity, we 1 S:=∅; useabitmaptorepresentthesignatureofakeywordwhere 2 for i:=1to⌊k⌋do abitisusedtokeepthesignatureofanedgeintheroadnet- 2 3 Findpairu,v ∈P withargmax θ(u,v); work G. In our implementation, we do not build signature u,v∈P 4 S :=S ∪{u,v};P:=P \{u,v}; for a keyword t if its inverted file can be fit into one data 5 addanarbitraryobjectu∈P toS If k isodd; page. Moreover,werecursivelydividetheedgesbyKD-tree 6 returnS partionmethodbasedonthecenterpointsoftheedges,and eachleafnodecorrespondstothesignatureofanedge. Then As shown in [12], the problem of finding maximal f(S) thesignaturesize of a keywordcan besignificantly reduced in Equation 4 is NP-hard and there is a greedy algorithm by compacting the tree node if all of its descendant nodes whichprovides2-approximationoftheoptimalsolution. Al- share thesame signature value. gorithm1illustratesthedetailofthegreedyalgorithmwhere Given a set T of query keywords and an edge e, Algo- we assume the objects retrieved by the SK query and their rithm 2 illustrates how to utilize the signature-based in- pairwise diversification distances are readily available. In vertedindextoloadobjectswhichlieontheedgeeandeach each iteration (Line 2-4), a pair of objects u and v with ofwhichcontainsallkeywordsinT. Clearly,noneoftheob- thelongest diversification distance will bechosen,and they jectswillberetrievedifwehaveI(e,t)=0foranykeyword will not be considered in the following computation. Note t ∈ T (Line 1-3). Let R denote the objects satisfying the Algorithm 2: LoadObjects( e, T) Algorithm 3: SK Search(q, δ ) max Input : e: anedge,T : querykeywords Input : q:queryobject,δmax:maximalnetwork Output :R: objectssatisfyingquerykeywordconstraint distance 1 for eachquerykeywordt∈T do Output : R:objects satisfyingSKquerycondition 2 if I(e,t)=0then 1 R:=∅;Q=∅;δT :=0; 3 return ∅ 2 n←thenodewhereq islocated, andnismarked; 3 PushnintoQwithδ(n)=0; 4 Ri←objects lyingonewithti∈T; 4 while Q=6 ∅do 5 returnT|iT=|1Ri 65 δnT←:=Qδ.(dne)q;uTeuerem()in;ateWhileLoopif δT >δmax ; query keyword constraint, we have R = |T| R where R 7 Markthenoden; Ti=1 i i 8 for eachunmarkednodeni inadjacentlistofndo representstheobjects containingthei-thkeyword,denoted 9 if δ(ni)>δ(n)+w(n,ni)then byti,ofT. ThedominantcostofAlgorithm2istheloading 10 δ(ni):=δ(n)+w(n,ni); oftheobjectsfromtheinvertedfiles,ourperformanceanaly- 11 if ni isunvisitedthen sisandempiricalevaluationdemonstratethatthesignature 12 F :=LoadObjects((n,ni), q.T ); techniquecan significantly reduce theI/O costs. 13 if F 6=∅then 14 for eachobjecto∈F do 15 d(o):=d(n)+w(n,o); 3.2 SK Search Algorithm 16 R:= R ∪ F; Sinceweaimtosupportthegeneralroadnetworkinwhich various cost models (e.g., distance and travel time) may be 17 Pushni intoQ; used,weadapttheincrementalnetworkexpansion(INE)al- 18 for eachmarkednodeni inadjacentlistofndo gorithm in[16] toincrementallyaccessobjectsbecauseINE 19 for eachobjectolyingonedge(ni,n)do algorithm does not rely on specific restrictions (e.g., Eu- 20 if δ(o)>δ(n)+w(n,o)then 21 δ(o):=δ(n)+w(n,o); clideandistancerestriction)orpre-computation(e.g.,voronoi diagram and shortcut) of the road networks. Observe that 22 Marktheobjecto; thenetworkdistance iscalculatedfromscratchforeachindi- 23 for eachobjecto∈Rdo vidualobjectencounteredin[16],weintegratetheDijkstra’s algorithm [8] with theINEso thatnetwork distances ofthe 24 R:=R\oif δ(o)>δmax; 25 Marko∈Rif oisunmarked; objects are calculated in an accumulative way during the 26 returnR network expansion. Thespatialkeywordbasedpruningisappliedinthesense to the monotonic property of δ in Algorithm 3, we have thatonlyobjects withinthesearch region areaccessed dur- T δ(q,o) > δ if n and n are not marked since δ(q,o) ≥ ingthenetworkexpansion(i.e.,networkdistance constraint) max 1 2 min(δ(q,n ),δ(q,n )). Inthecaseonlyonenode(sayn )is and many non-promising objects are pruned by taking ad- 1 2 1 marked,i.e., δ(o)=δ(n )+w(n ,o), we haveδ(q,o)=δ(o) vantage of the signature-based inverted indexing technique 1 1 if δ(o) ≤ δ since δ(q,n ) ≥ δ . Consequently, all objects (i.e., keyword constraint). T 2 T satisfyingthekeywordconstraintandnetwork distance con- Algorithm 3 illustrates the implementation details of the straint are correctly retrieved in Algorithm 3. spatial keyword search on road networks. For presentation simplicity,weassumethequerypointqstartsfromanoden Performance Analysis. The main cost of Algorithm 3 inLine2. Amin-priority-queueQisemployedtokeepnodes consists of two parts: road network traverse (CG) and the accessed during theexpansion where the key of a node n is loading of objects (CO). Let ln and le denote the number denoted by δ(n) where δ(n) = ∞ if the node n has not of nodes and edges accessed during the network expansion, been visited. In this paper, we have δ(n) = δ(q,n) if a then we have CG =lnlog(ln)+le+In where In is the I/O node n is marked (Line 2 and 7). Similarly, we use δ(o) to latency to load the adjacent lists which is ln× tio in the computeδ(q,o), andwe haveδ(o) =δ(q,o) if an object ois worst case assuming the delay of each I/O is tio. Similarly, marked(Line22and25). InAlgorithm3,nodesareaccessed wehaveCO =ρ×le×|q.T|×tiointheworstcasewhereρis in non-decreasing order of their network distances from q. thepercentageofedgeswhichloadobjectsfromtheinverted Line 6 updates δ which is the lower bound of the network files. Note that we have ρ=1 if the signature technique is T distance foranyunmarkednode. Theexpansionterminates not employed. As demonstrated in theempirical study, CO whenδ >δ ,whichimpliesthatδ(q,n )>δ forany isthedominantcostforAlgorithm 3andhenceit iscritical T max x max unmarked node n . For each node n in the adjacent list to minimize ρ by utilizing signature technique. Below, we x i of the node n, Line 10 updates δ(n ) if n is not marked, analyze the expected number of objects loaded regarding i i and the objects on edge (n,n ), which satisfy the keyword differentobjectindexingtechniquesdiscussedinSection3.1. i constraint, will be loaded if n is visited at the first time Assume there are on average m objects lying on each i (Line 12), followed by the computation of network distance edge and s keywords for each object which are randomly based on δ(n) (Line 15). If the node ni is already marked, chosen from the vocabulary V. Let C1, C2 and C3 denote Line 18-22 may update the network distance of the objects the number of objects loaded in Algorithm 3 when the ob- andsetthemmarkedsincebothend-nodes(nandn )ofthe jects areorganized byCCAM structure,invertedindexand i edge are marked. At the end of Algorithm 3, objects with signature-basedinvertedindexrespectively. Itisimmediate network distance longer than δmax are prunedfrom R. that C1 = le ×m since we need to load all objects lying on each edgeforthekeyword constrainttest. Recallthatl Correctness. Thecorrectnessofthenetworkdistance com- e represents the number of edges accessed at Line 12 of Al- putation is immediate for the nodes as the computation gorithm 3. As the expected number of objects lying on an follows the Dijkstra’s algorithm [8]. Suppose o is an ob- edge with the keyword t is m×s, we have C = l × m×s ject lyingonedge(n1,n2),accordingtoEquation1,wecan |V| 2 e |V| come up with correct δ(q,o) if n1 and n2 are marked. Due ×|q.T|. Withprobabilityps=1−(1−|Vs|)m,thesignature of theedge isset to1 for each keyword. Therefore, an edge satisfying the keyword constraint). Regarding the example will pass the signature test with probability p|sq.T|. Then inFig.3,assumethatQ={q1,q2,q3}whereq1.T ={t1,t3}, we have C = l ×p|q.T| ×m×s ×|q.T|. According to the q2.T = {t2,t4} and q3.T = {t1,t2}, we have ξ(q1,e) = 0, 3 e s |V| ξ(q ,e) = 5, and ξ(q ,e) = 5. Similarly we have ξ(q ,P) above analysis, the signature-based inverted indexing tech- 2 3 1 = ξ(q ,e ) +ξ(q ,e ) = 0, ξ(q ,P) = 0, and ξ(q ,P) = nique is expected to achieve better performance compared 1 1 1 2 2 3 2 +0 = 2. Let ξ(Q,P) denote the false hit cost of the with othertwo alternatives. partition P and Pr(q) denote the probability that a query 3.3 Enhancement ofthe Signature Technique q∈Q is issued, we have Inthissubsection,weenhancetheeffectivenessofthesig- nature technique by partitioning the objects on the same ξ(Q,P) = Xξ(q,P)×Pr(q) (6) edge. Wefirstintroducethemotivationofthemethod,then q∈Q a dynamicprogramming based method is proposed. Algorithm. Suppose an edge e contains m objects which e(cid:13)1(cid:13) e(cid:13)2(cid:13) areindexedbytheirvisitingorderalongtheedge(asshown in Fig. 3(a)). Given a number c, we aim to find a partition o(cid:13) (t(cid:13) ,t(cid:13) )(cid:13) o(cid:13) (t(cid:13) ,t(cid:13) )(cid:13) o(cid:13) (t(cid:13) )(cid:13) o(cid:13) (t(cid:13) )(cid:13) o(cid:13) (t(cid:13) ,t(cid:13) )(cid:13) 1(cid:13) 1(cid:13) 3(cid:13) 2(cid:13) 2(cid:13) 3(cid:13) 3(cid:13) 1(cid:13) 4(cid:13) 4(cid:13) 5(cid:13) 1(cid:13) 4(cid:13) e(cid:13) of e with c cuts which has the minimal false hit cost. The key idea is to use the dynamic programming technique to e(cid:13) (a)(cid:13) e(cid:13)1(cid:13) e(cid:13)2(cid:13) find optimal solution based on the local optimal solutions with less number of cuts. Specifically, we use P(i,j,c) to t(cid:13)1(cid:13) 1(cid:13) t(cid:13)1(cid:13) 1(cid:13) 1(cid:13) denote all possible partitions each of which divide objects t(cid:13)2(cid:13) 1(cid:13) t(cid:13)2(cid:13) 1(cid:13) 0(cid:13) between o and o (inclusive) into c+1 virtual edges. Let tt(cid:13)(cid:13)34(cid:13)(cid:13) 11(cid:13)(cid:13) tt(cid:13)(cid:13)34(cid:13)(cid:13) 01(cid:13)(cid:13) 01(cid:13)(cid:13) Pco∗s(ti,Pj,∗c(i),ijd,ecn)o.tNejoatepathrtaittiwone uinsePP(i,tjo,cd)ewnoitthetthheecmosintiomfaal t(cid:13) 0(cid:13) t(cid:13) 0(cid:13) 0(cid:13) s s 5(cid:13) 5(cid:13) partition P. Initially,we have (b)(cid:13) (C)(cid:13) Figure 3: Example of Edge Partition Ps∗(i,j,0) = ξ(Q,P(i,j,0)) (7) Motivation. As shown in Fig. 3(a), suppose we have five where P(i,j,0) is the partition with one virtual edge (i.e., objects o1(t1, t3), o2(t2, t3), o3(t1), o4(t4), and o5 (t1, t4) c=0) which contains all objects between oi to oj. lyingonanedgeeandthevocabularyV ={t ,t ,t ,t ,t }. Thekeyobservationisthat,inordertocomputeP∗(i,j,c), 1 2 3 4 5 ThesignaturesoffivekeywordsareshowninFig.3(b)where we assume that one of the cut is located at the k-th object I(e,t ) = 1, I(e,t ) = 1, I(e,t ) = 1, I(e,t ) = 1, and with i ≤ k < j. Let Q(i,j,k,c) denote the partitions in 1 2 3 4 I(e,t ) = 0 respectively. Given a query q with q.T = P(i,j,c) where one of the cuts is exactly located at the k- 5 {t ,t }, all objects will be loaded if e is accessed in Algo- th object. We use Q∗(i,j,k,c) to represent the partition 2 4 rithm3sinceI(e,t2)=1andI(e,t4)=1. However,noneof in Q(i,j,k,c)with thelowest cost, denotedbyQ∗s(i,j,k,c). theobjectscontainsbotht andt . Wesaythisisafalsehit Then,wecancomeupwithQ∗(i,j,k,c)byenumeratingall 2 4 if an edge passes the signature test of a query but does not possible combinations of the partitions regarding two sides return any object satisfying the keyword constraint. Simi- of the k-thobject. Specifically we have larly,aqueryqwithq.T ={t3,t4}mayresultinafalsehitas Q∗(i,j,k,c) = min {P∗(i,k,v)+ well. Notethatitisatrue hit regardingq.T ={t1,t3}since s 0≤v≤c−1 s the object o1 contains both t1 and t3. And q.T = {t1,t5} P∗(k+1,j,c−v−1)} (8) s is also not a false hit since it fails the signature test. Intu- itively,wecanpartitiontheedgeeintotwovirtualedgese Note that we simply set P∗(i,j,c) = ∞ if there are no 1 s and e as shown in Fig. 3(a), and the signature of e can be enough cuttingpositions, i.e., j−i<c. 2 refined as shown in Fig. 3(c). Then we can avoid loading Byexhaustingallpossiblefixedcuttingpostions,wehave objects resulted from thefalse hitwhen q.T ={t ,t }since 2 4 P∗(i,j,c) = min {Q∗(i,j,k,c)} (9) it fails thesignature tests for both e and e . s s 1 2 i≤k≤j−1 In this paper, a partition P consists of a set of virtual edges resulting from a number of cuts against an edge and Algorithm 4 illustrates the outline of the dynamic pro- each virtual edge covers a set of objects along the edge. In grammingbasedtechniquetoidentifyapartitionwithccuts Fig. 3(a), we have P = {e ,e }. There are m−1 possible such that its false hit cost is minimized. Line 1-2 compute partitionsif therearemob1jec2ts on edgee an`dbb+´1virtual the cost for each possible simple partition (i.e., partition edges(i.e.,bcuts). Inthissubsection,weproposeadynamic with one virtual edge) according to Equation 7. Line 3- programming based technique to partition objects lying on 5 iteratively compute the optimal partitions with k cuts an edge for a given number of cuts allowed such that the (1 ≤ k < c) for all possible partitions based on Equation 8 numberofobjectsloadedduetofalsehits canbeminimized. and 9. Then we come up with the final solution by identi- Specifically,foragivenSKqueryq,ξ(q,e)denotesthefalse fying P∗(1,m,c) based on theabove intermediate solutions hit cost of e (i.e., the number of objects loaded due to the (Line6). false hits of the edge e). Similarly, we can define the false Lines 3-5 contribute the dominant cost of Algorithm 4, hit cost of a partition P,denoted byξ(q,P), where which is O(c2m3). This is cost-prohibitive in practice, and hence we resort to the greedy heuristic in our implementa- ξ(q,P) = X ξ(q,e′). (5) tion. Specifically, startingfrom thewhole edge, i.e., a0-cut partition which coversall objects, at each iteration, we find e′∈P a cutting position j (1 ≤ j ≤ m−1) such that the cost of Note that ξ(q,e) = 0 if i) e is not visited, ii) e fails the therefinedpartitionisminimized. Ineachiteration,ittakes signature test, or iii) it is a true hit (i.e., find an object O(m×s )timetosetupthesignatureswheres denotesthe t t Algorithm 4: Partition(e, c) Section 4.2 presents an efficient algorithm to incrementally maintain θ . In Section 4.3, we propose an efficient incre- Input : e :edgeforpartition,c :thenumberofcuts T Output :P∗(e) :theoptimalpartitionofe mentaldiversifiedSKsearchalgorithmbasedonthepruning 1 for 1≤i≤j≤mdo technique. 2 ComputeP∗(i,j,0)basedonEquation7; s 4.2 Incrementally Maintainthe Threshold θ 3 for k=1toc−1do T 4 for 1≤i<j≤mdo Inthissubsection,weintroducehowtoincrementallymain- 5 ComputePs∗(i,j,k)basedonEquation8and9; tain θT. Below, we first introduce some notations and a lemma. 6 ComputeP∗(1,m,c)basedonEquation9; s For presentation simplicity, we assume the diversification 7 returnP∗(1,m,c) distances aredistinctvaluesandcorepairs inCP(R)arede- creasingly ordered by their diversification distances, where average number of keywords appeared on the edge e. And ρ (R) denotes the objects in the i-th core pair. Regarding thepartitioncostcalculationtakesO(m×|Q|×q )whereq i t t the example of Fig. 4, we have ρ (R) ={o ,o } and ρ (R) istheaveragenumberofthequerykeywords. Therefore,the 1 1 2 2 = {o ,o }. We say a new arriving object o is dominated totalcostofthegreedyalgorithmisO(c×m×(s +|Q|×q )) 3 4 t t by a core object o′ ∈ CO(R) if θ(o,o′) < θ(o′,o ) where where c is thenumberof cuts(partitions). y (o′,o ) is a core pair. For instance, in Fig. 4 o dominates y 4 Remark 1. When the query log Q is not available, we oifθ(o,o4)<θ(o3,o4). WeuseRi todenotethesetof first can generate it on the fly based on the assumption that the i objects arrived. Following lemma implies that we do not high frequent keyword is more likely to appear as a query need to consider a pair (o,o′) for the update of core pairs keyword. when o arrives if ois dominated by o′. O(cid:13) 4. DIVERSIFIED SK SEARCH new object arrived(cid:13) 1(cid:13) Inthissection,wepresentadiversifiedSKsearchonroad o(cid:13) o(cid:13)7(cid:13) O(cid:13)2(cid:13) networks. Section4.1introducesthemotivationofourdiver- sified SK search algorithm. Section 4.2 presentsan efficient o(cid:13)1(cid:13) o(cid:13)2(cid:13) O(cid:13) algorithmtocontinuouslymaintainadiversificationdistance o(cid:13) o(cid:13) threshold for the pruning purpose. Section 4.3 develops an 3(cid:13) 4(cid:13) incremental diversified SK search algorithm. o(cid:13) o(cid:13) q(cid:13) 5(cid:13) 6(cid:13) 4.1 Motivation current core pairs(cid:13) A straightforward implementation of the diversified SK Figure 5: Example of query is to first retrieve a set R of objects based on Algo- Figure 4: Update CP Pruning rithm 3, i.e., objects satisfying spatial keyword constraint, and then feed them to Algorithm 1 (Section 2.3). Since Lemma 1. Let o be the (i+1)-th arrived object, (o,o′) we do not rely on pre-computation of the road network or cannot become a core pair if o is dominated by a core object some specific restrictions, it may be expensive to compute o′ ∈CO(R ). i pair-wisediversificationdistances forobjectsinRwhen|R| is large. Similarly, the I/O cost of loading objects grows Proof. Theproofisbycontradiction. Suppose(o′,oy)is against |R| in Algorithm 3. This motivates us to develop onthej-thcorepair ofRi,i.e.,o′ ∈ρj(Ri),wehaveρy(Ri) effective pruning technique so that a significant number of =ρy(Ri+1) for any 1≤y <j if (o,o′) becomes a core pair objects may bepruned from R at a cheap cost. in Ri+1. This is because the distance of the y-th pair is Forpresentationsimplicity,weassumek isanevennum- longerthanθ(o,o′)andocannotcontributetothey-thpair ber in this section. Let CP(R) denote the core pairs of the for any 1≤y <j. Then theobject oy is not chosen for the objectsinRwhicharethe k pairsofobjectschoseninAlgo- firstj−1corepairs,thisisconflictwiththefactthat(o,o′) rithm 1; that is, thetop k 2pairs of objects with thelongest is a core pairs of Ri+1 while θ(o,o′)<θ(o′,oy). 2 diversificationdistances whereeachobjectcancontributeto Algorithm5illustrateshowtoefficientlyupdatecorepairs at mostonepair. Theobjectsinvolved,denotedbyCO,are againstthearrivalofanewobject,whereCP andCOdenote core objects whichcorrespondtothek diversifiedobjectsin currentcore pairs andcore objects respectively. Weassume the candidate objects. In this paper, we aim to incremen- there are at least k objects (i.e., |CP| = k) before the new tally process theobjects so that some objects can be safely 2 object o arrives. For the object o, we use φ(o) = {o } to pruned if they have no chance to be chosen as core objects. y denoteobjectswhereθ(o,o )>θ ando doesnotdominate Based on the fact that any unvisited object in Algorithm 3 y T y o (Line 2). In Line 3, o′ represents the object in φ(o) with is not within network distance δ , we can employ a min- T thelongest distance too. Then we havethefollowing three priorityqueuetomaintainobjectsmarkedandoutputthem cases: inanon-decreasingorderoftheirnetwork distances from q. We use θ to record the shortest diversification distance i) φ(o)=∅ (Line15). inCP forobTjectsseensofar. ItisshowninSection4.2that ii) o′ is not a core object (Line 6-8). θ grows monotonically. This implies that we can safely iii) o′ is a core object (Line 10-13). T prune an object o from the diversified search if there is no Inthefirstcase,thealgorithmimmediatelyterminates. In another object o′ ∈ R so that θ(o,o′) ≥ θ . Another im- thesecondcase,Line7removesthe k-thcore pair fromCP T 2 portantissueis howtoefficientlymaintain θ asit requires and then add a new core pair (o,o′). For instance, we have T us to incrementally compute core pairs for objects seen so CP = {(o ,o ), (o,o ),(o ,o )}if o′ =o intheexampleof 1 2 7 3 4 7 far. Itisexpensivetore-computeCP from thescratch (i.e., Fig.4. Thenθ isupdatedandthealgorithmterminates. In T invoke Algorithm 1) against the arrival of each new object. thethirdcase,weuse(o,o′)toreplacethecorepair (o′,o ), y Algorithm 5: Update Core Pairs and θ an motivating example of the diversity based pruning tech- T nique. We use γ to denote the maximal network distance Input : o: thearrivingnewobject Output : Updatedcore pairs andthedistancethresholdθT among theobjects visited sofar, e.g., objects in theshaded 1 whiletruedo area. Foranyunvisitedobjecto1,wehaveδ(q,o1)≥γ since 2 φ(o)←{oy|θ(o,oy)>θT andoy doesnotdominateo}; objectsarriveaccordingtotheincreasingorderoftheirnet- 3 o′←thefurtherestobjectinφ(o)w.r.to; work distances. Moreover, we have δ(o1,o2)≤2×δmax for 4 if φ(o) 6= ∅then two unvisited objects o and o since δ(q,o ) ≤ δ and 5 if o′6∈coreobjects CO then δ(q,o ) ≤ δ . Accor1ding to2Equation 3,1 we camnaxcome 67 C(oP1,:o=2)C←P t\he(ok21-,toh2)c;oCrePp:a=irCiPnC∪P(o;,o′); uonp wγi2tahndthδemupax.peSribmoiulanrdly,ofwθe(oc1a,no2d)e,rdiveenottheedubpypθ¯eur,bboausnedd 8 UpdateθT andTerminatethealgorithm; of θ(o,o ) bmeatxween a visited object o and an unvisited ob- 1 9 else ject o , denoted by θ¯ (o), based on the fact that δ(o,o )≥ 10 oy ←theobjectoy where(o′,oy)isacore pair; δ(q,o)1+γ for any viusited object o. Clearly, we can sa1fely 11 CP :=CP \ (o′,oy);CP:=CP ∪ (o,o′); pruneall unvisitedobjects if θ¯ <θ (i.e., upperboundfor 1123 Uop←daotye;θT; any two unvisited objects ) anud θ¯u(To) < θT for any visited objecto(i.e.,upperboundbetweenoandanyunvisitedob- 14 else ject); that is, none of the unvisited objects can become a 15 Terminatethealgorithm; core object in the diversified search. With the same ratio- nale, a visited object o can also be eliminated from future where (o′,o ) is a core pair before the arrival of o. θ is computation if θ¯u(o) < θT and θ(o,o′) < θT for any other y T visited object o′. updated and we repeat the while loop (Line 1) by treating oy asthenewarrivingobjectowhichmaycontributetocore Algorithm 6: Diversified SK Search( q, k, δmax ) pairs again. Input : q:queryobject,k:numberofobjects requested Time Complexity. Suppose there are at most n objects δmax:maximalnetwork distance in R, it takes O(n) time to choose o′. In the algorithm Output : S:diversifiedk objects regardingSKquery correctness analysis below, weshow that thewhile loop can 1 ComputeCP andθT onthefirstk arrivedobjects; repeat at most k times. Consequently, Algorithm 5 takes 2 for eachoarrivedinorder(fromAlgorithm3)do O(kn) timein the2worst case. 3 Updatecore pairs CP andthresholdθT (Algorithm5); 4 γ←δ(q,o); Algorithm Correctness. The correctness of Algorithm 5 5 o1,o2←twoobjectswithδ(q,o1):=δ(q,o2):=γ isimmediateforcasei). Inthesecondcase(Lines6-8),sup- 6 andδ(o1,o2):=2×δmax; posetherearejcorepairswithdistanceslongerthanθ(o,o′), 7 if θ(o1,o2)<θT then 8 F :=true; theywillremainunchangedbecausetheircorrespondingob- 9 for eachobjectox visitedsofardo ojecatcscocardninnogttcoontthreibsuetleecttoioannycrniteewriacoorfeop′a;irthwatithis,thoecoabnjneoctt 1101 iLfetθδ((ooxx,,oo11))>:=θT2×thγen; affect thefirst j core pairs in CP. Then (o,o′) will serve as 12 F :=false;GotoLine2; the (j+1)-th core pair in CP. This implies that o will not 13 else if θ(ox,o′)<θT foranyvisitedobjecto′ affecttheselectionofcorepairs afterthe(j+1)-thpairsince then o′ is not a core object before the arrival of o, and hence we 14 Removeox fromfuturecomputation; onlyneedtoremovetheprevious k-thcore pair. Regarding 2 15 if F =truethen the third case (Lines 10-13), we use (o′,oy) to denote the 16 TerminatethenetworkexpansionofAlgorithm3; j-th core pair before the arrival of o. With the same ra- tionale, o cannot affect any core pair ranked before (o,oy), 17 returnCP and (o′,o ) will replace (o,o′) as the j-th core pair. Based y on thefacts that (o′,oy) is thej-th core pair before thear- Algorithm. Algorithm6illustratesthedetailsoftheincre- rival of o, and oy is kicked out from core objects by o, oy mentaldiversifiedSKsearch. Line1initializescorepairsand cannotcontributetothefirstj core pairs inthefuture. For θ basedonthefirstkarrivedobjectsinAlgorithm3. Then T instance, if (o3,o4) is replaced by (o,o4) in the example of Lines 2-16 incrementally maintain core pairs and θ against Fig.4,o3 cannotappearinthefirsttwocorepairs anymore. thearrivingobjects(i.e., objectsincrementallyoutputfrom This implies that the while loop of Algorithm 5 can repeat Algorithm 3). Recall that Algorithm 3 can incrementally at most k2 times. output the objects satisfying spatial keyword constraint in Monotonic Property of θ . Since θ will be updated increasing order of their network distances to q. For each T T when the k-th core pair is replaced in the case ii) or the new arriving object o, we can calculate the pair-wise di- 2 caseiii)ofAlgorithm5byanotherpairwithlongerdistance, versification distances between o and other visited objects. the monotonic propertyof θ is immediate as stated in the Note that we may invoke Algorithm 3 to conduct network T following theorem. distances computation which terminates when all pair-wise networkdistances arecalculatedbetweenoandothervisited Theorem 1. ThediversificationdistancethresholdθT grows objects. Basedonthediversitybasedpruningtechnique,we monotonically against the arrival of the objects. terminatethenetworkexpansionifallunvisitedobjectscan- notcontributetothediversifiedk results(Line16). Finally, 4.3 Incremental DiversifiedSK Search Line 17 returns the objects in core pairs for the diversified In this subsection, we introduce the pruning technique SK search. basedonthemonotonicpropertyofθT,aswellasthedetails In each iteration of Algorithm 6, it takes O(nvlog(nv)+ of theincrementaldiversified SK search algorithm. ne+no)timetocomputethenetwork distances foreachin- coming object against existing objects where n and n de- Diversity Based Pruning Technique. Fig. 5 illustrates v e notethenumberofnodesandedgesaccessedintheroadnet- contains 11.5 millions tweets with geo-locations [14] from work respectively,and n is thenumberof objects accessed May2012toAugust2012,andtheroadnetworkisobtained o so far. Line 3 takes O(n ×k) time to maintain core pairs from San Francisco Bay Area (http://www.dis.uniroma1. o and θ . The pruning procedure (Line 4-16) takes O(n ) it/challenge9/download.shtml) with 321,270 nodes and T o time. Consequently, the total time cost of Algorithm 6 is 800,172 edges. To investigate the scalability of the algo- O(n ×(n log(n )+n +n ×k)). This implies that it is rithms,wealso includethesyntheticdataset SYN.Thelo- o v v e o essential to reduce n , i.e., the number of objects accessed cationsoftheobjectsarerandomlychosenfromSF dataset, o in thediversified SK search. and their corresponding keywords are obtained from a vo- cabularywhosetermfrequenciesfollowthezipf distribution 5. PERFORMANCEEVALUATION where the parameter z varies from 0.9 to 1.3 with default value1.1. Bydefault,thenumberofobjects(n ),thevocab- Inthissection,wepresentresultsofacomprehensiveper- o ulary size (n ) and thenumberof keywordsper object (n ) formance study to evaluate the efficiency and scalability of v t aresetto1million,100thousandsand15respectively. Note the proposed techniques in the paper. In our implementa- thatwemoveanobjecttoitsclosestroadsegmentifitdoes tion, the road network is organized by CCAM structure as notlieonanyedgeintheroadnetwork. Intheexperiments, discussed in 2.2. We evaluate the effectiveness of the fol- thelocationsofalldatasets(spatio-textualobjectsandroad lowing indexing techniques for spatio-textual objects in the networks)arescaledtothe2-dimensionalspace[0, 10000]2. spatial keyword search (Algorithm 3). Table2summariestheimportantstatisticsoffourdatasets. • IR Inverted R-tree [23], which is a natural extension of Property SYN NA TW SF thespatial object indexingmethod in [16] 2. • IF The inverted indexing technique described in Sec- # objects 1M 2.2M 11.5M 2.25M tion 3.1. Note that the same indexing structure is em- vocabulary size 100K 208K 1.6M 81K ployed in [17] for adifferent problem. avg. # keywords 15 6.8 10.8 26 • SIF The signature-based inverted indexing technique # nodes. 17K 179K 321K 175K proposed in Section 3.1. # edges. 223K 179K 800K 223K • SIF-P Enhanced SIF by partition technique proposed Table 2: Dataset Statistics in Section 3.3. It is reported in our initial experiments Workload. AworkloadfortheSKqueryanddiversifiedSK that the greedy approach is up to two orders of mag- query consists of 500 queries. The average query response nitude faster than the dynamic programming based ap- time, the average number of disk accesses and the average proachwhiletheyachievesimilarperformanceintermsof number of candidate objects are employed to evaluate the I/O costs reduced. Therefore, in the experiment we use performanceofthealgorithms. Thequerylocationsareran- thegreedy partition approach with themaximal cuts 3. domlyselectedfromthelocationsoftheunderlyingobjects. We also evaluate the efficiency of the diversified spatial Ontheotherhand,thelikelihood ofakeywordtbeingcho- keyword search algorithm where the spatio-textual objects sen as query keyword is freq(t) where freq(t) is the are organized bySIFindexingstructure. Pti∈Vfreq(ti) term frequency of t in the dataset. The number of query • SEQ Astraightforwardimplementationofthediversified keywords(l)variesfrom 1to4withdefaultvalue3. Byde- spatialkeywordsearchalgorithmdiscussedinSection4.1. fault, the maximal search distance (δ ) is set to 500×l. Specifically, we first retrieve all objects satisfying spatial max In addition, the number of results (k) grows from 5 to 20 keyword constraint (Algorithm 3), and then feed the re- and λ varies from 0.5 to 0.9 for diversified spatial keyword sults intothegreedy Algorithm 1 in Section 2.3. search. Bydefault,kandλissetto10and0.8respectively. • COM Theincrementaldiversifiedspatialkeywordsearch AllalgorithmsintheexperimentsareimplementedinJava algorithm proposed in Section 4 (Algorithm 6). andimportantdatastructures(e.g.,R-treeandB+tree)are Datasets. Performance of various algorithms is evaluated obtainedfromthesourcecodepackagereleasedby[21]. Ex- on both real and synthetic datasets. The following four periments are run on a PC with Intel Xeon 2.40GHz dual datasets are deployedin theexperiments. Road Network of CPU and 4G memory running Debian Linux. For the con- NAisobtainedfromtheNorthAmericaRoadNetwork(http: structionofSIF-P,weonlyconsidertheedgeswhosenumber //www.cs.utah.edu/~lifeifei/SpatialDataset.htm)with ofobjectsrankedatthetop10%comparedwithotheredges, 175,812 nodes and 179,178 road segments. The spatio- andthenumberofcutsissetto3. WeuseanLRUmemory textual objects are obtained from the US Board on Geo- buffer whose size is set to 2% of the network dataset size. graphic Names(http://geonames.usgs.gov) in which each All index structures are disk resident, and the page size is object is associated with a geographic location and a short fixedto 4096 bytes. text description. Similarly, we generate dataset SF by ob- tainingthespatiallocationsfromcorrespondingspatialdatasets 5.1 Evaluating SK Search fromRtree-Portal(http://www.rtreeportal.org)andran- domlygeo-taggingtheseobjectswithuser-generatedtextual Inthissubsection,weevaluatetheperformanceofspatial contents from 20 Newsgroups (http://people.csail.mit. keywordsearchagainstfourindexingstructures: IR,IF,SIF edu/jrennie/20Newsgroups). The road network of SF is and SIF-P. obtained from San Francisco Road Network (http://www. Evaluation on different datasets. We investigate the cs.utah.edu/~lifeifei/SpatialDataset.htm)wherethere queryresponsetime,indexconstructiontimeandindexsize are 174,955 nodes and 223,000 edges. The dataset TW of four indexing structures against four datasets NA, SF, SYN andTW whereotherparametersaresettodefaultval- 2We slightly change the find entity function in [16] to effi- ues. Fig. 6(a) reports the response time of four algorithms. ciently support spatial keyword search for the inverted R- tree index. Instead of searching entities on a global R-tree The performance of IR is nearly 4 times slower than other which organize all objects, we only need to explore the in- threeindexingtechniques. This is because theconstruction verted R-treesrelated to thequerykeywords. of IR is independent to the underlying road network struc- IR(cid:13) IF(cid:13) SIF(cid:13) SIF-P(cid:13) 102 400 30 Processing Time (s) 110001 Processing Time (min) 123000000 Index Size (G) 1200 10-1 0 0 NA SF SYN TW NA SF SYN TW NA SF SYN TW (a) ResponseTime (b)IndexConstructionTime (c) IndexSize Figure 6: SK search on Diff. Datasets tafbaounnyrrdmeet,adtakhagnienecni.dergTeciatvohdarievlsrrueaecasnfoptotisoarotengn,edesxw.ionpeIfgeFnteohxsgbiecrvjleeuecacodtttnelosynIicenRihmctetpfichorkroenomsvtihnebvsetehettorehwbteefejodeeplcnleitonrswtfdholieynerxmign.eagdpSngeoIcerFnes- # False Hits 234000000000 SSIIFF--GP ocessing Time (s) 1 01 SSSIFIIFF-P--PP-R--RFaSreenIaFqdl andSIF-Pcanfurthersignificantlyreducetheresponsetime Pr 1000 0.1 by utilizing the signature technique. As expected, it is re- 2 4 8 16 32 NA TW portedthattherefinedsignatureinSIF-Pcanachievebetter Figure 9: Diff. cuts Figure 10: Diff. logs performance. Asshownin Fig. 6(b),compared with IFand Evaluation on space cost-effectiveness. To evaluate SIF,SIF-Ptakesthelongesttimefortheindexconstruction thespacecost-effectivenessoftheadvancedsignature-based as the partition of the edges may be time consuming. Due indexing technique (SIF-P), Fig. 9 reports the number of to the compactness of the signatures, Fig. 6(c) shows that false hits onSF wherethenumberofmaximalcutsallowed SIF and SIF-P only take slightly more space to keep the grows from 2to32. Asexpected,theperformance of SIF-P signatures compared with IF which only keeps the inverted improves when the number of maximal cuts (i.e., available files. index space) grows. Meanwhile, we also evaluate another 3 1250 simple and intuitive approach, namely group-based index- IF IF Response Time(s) 12 SIFS-IPF # I/O 1 257050500000 SIFS-IPF ittfnrehegreqmuctseeo,cnmhwtbnetiienqaraulmsteoiso(nbtS1suIFioald-fnGdtt)hh.tee2Sftsproiegegqcneuiatfiehtcnueatrrleltyce,fiarnlmbeessase.inrdvdFeeosirnatvshineaersttnieandendwcilveist,itedtrufwmoaor,l and onlytheedges containingan object with bothtermst 0 0 1 1 2 3 4 1 2 3 4 and t2 are kept in its corresponding inverted list. For each (a) Time (b)I/O cost numberof maximalcut in Fig. 9,we choose thetop xmost Figure 7: Diff. number of query keywords(l) frequent terms so that thespace occupied by theirpairwise Effect of the number of query keywords (l). Weeval- combined terms is 10 times larger than thesignature file of uatetheeffectofthenumberofquerykeywordsinFig.7on SIF-P index structure. For instance, the size of the signa- dataset NA where l varies from 1 to 4. Fig. 7 reports that turefileofSIF-Pindexis53Mwhenthenumberofmaximal theperformanceofthreealgorithmsintermsofthequeryre- cuts goes to 32. For the corresponding SIF-G, we consider sponsetimeandthenumberofI/Oaccessesdegradesagainst thepairwise combinationof thetop25most frequentterms the growth of l. As expected,SIFsignificantly outperforms which leads to 530M extra index space compared with SIF IFsincethesignature-basedindexcanreducethenumberof technique. AsshowninFig.9,theadvancedsignature-based I/O invokedbyfalse hits. SIF-Pachieves thebetterperfor- indexing technique is more space cost-effective because the mance compared with SIF because the advanced partition size of the signature files is much smaller than that of in- techniquecanfurtherimprovetheeffectivenessofthesigna- verted files. turetechnique. Evaluation on query logs. In Section 3.3, we use the Response Time(s) 000000......123456 SIFS-IIPFF # Candidate 1 025705050000 STNYSWNAF qsiFtkftheuirtgyeeuh.wrcaey1otd0uqrlv,doruaegswen.rdectyIeiondsdlttoecrugmosiibintigousisnnvttdaesriotulityrrnuce,atrctowetetf-lehbytteaochosaeabewnddtqhavauaitacentenhrecyceihdeexnvldtfoeireqgosnt.uimthgeFnetothiabhsueteeurapsrqffiteenue-rpdbcefoeetrarxerysfmdieonldoragbmanyidtncae.edntcheIchonxee-f 0 0 niquesareevaluatedintheexperimentontworealdatasets, 250 500 750 1000 1250 1500 250 500 1500 (a) Time (b) #Candidates NAandTW respectively. SIF-P-Real,SIF-P-FreqandSIF- P-Rand are advanced signature-based indexing structures Figure 8: Diff. search range (δmax) constructed based on different query logs. Specifically, the Effect of the search range (δ ). Fig. 8(a) illustrates max query load is used as query log in SIF-P-Real. For SIF- the query response time of the algorithms as a function of P-Freq, we generate query log based on the frequency of δ ondatasetNA.ItisshownthatIFismuchmoresensi- max the keywords on each edge which is the default setting in tivetothegrowthofδ incomparisonwithSIFandSIF- max the experiment. While the query log is generated by ran- P. This is because the number of false hits grows quickly domly choosing keywordson each edge for SIF-P-Rand. As against δ and IF cannot avoid the unnecessary I/O ac- max expected, Fig. 10 shows that SIF-P-Real achieves the best cesses incurred by the false hits. As expected, Fig. 8(b) performancesinceweusethequeryloadtobuildtheindex. showsthatthenumberofcandidateobjectsonfourdatasets SIF-P-Freqhas thesimilar performance becausewe assume increases where δ grows from 250 to1,500. max that a query keyword is chosen based on its corresponding
Description: