ebook img

k-Nearest Neighbors on Road Networks: A Journey in Experimentation and In-Memory Implementation PDF

0.84 MB·
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview k-Nearest Neighbors on Road Networks: A Journey in Experimentation and In-Memory Implementation

k-Nearest Neighbors on Road Networks: A Journey in Experimentation and In-Memory Implementation Technical Report Tenindra Abeywickrama, Muhammad Aamir Cheema, David Taniar FacultyofInformationTechnology,MonashUniversity,Australia {tenindra.abeywickrama,aamir.cheema,david.taniar}@monash.edu 6 ABSTRACT ignoretheobjectsthatcannotbekNNsbutalsotheroadnetwork 1 vertices that are not associated with objects. Recently, there has Aknearestneighbor(kNN)queryonroadnetworksretrievesthe 0 been a large body of work to answer kNN queries on road net- kclosestpointsofinterest(POIs)bytheirnetworkdistancesfrom 2 works. SomeofthemostnotablealgorithmsincludeIncremental agivenlocation. Today, intheeraofubiquitousmobilecomput- g ing,thisisahighlypertinentquery. WhileEuclideandistancehas NetworkExpansion(INE)[23],IncrementalEuclideanRestriction u beenusedasaheuristictosearchfortheclosestPOIsbytheirroad (IER)[23], DistanceBrowsing[25], RouteOverlayand Associa- A networkdistance,itsefficacyhasnotbeenthoroughlyinvestigated. tionDirectory(ROAD)[20,21],andG-tree[30,31]. Inthispaper, weconductathoroughexperimentalevaluationofthesealgorithms. Themostrecentmethodshaveshownsignificantimprovementin 0 Thisistheextendedtechnicalreportofaconferencepaper[6]. queryperformance.Earlierstudies,whichproposeddisk-basedin- 1 dexes,werecomparedtothecurrentstate-of-the-artinmainmem- 1.1 Motivation ory. However,recentstudieshaveshownthatmainmemorycom- ] S parisonscanbechallengingandrequirecarefuladaptation.Thispa- 1. NeglectedCompetitor. IER[23]wasamongthefirstkNNal- D perpresentsanextensiveexperimentalinvestigationinmainmem- gorithmsonroadnetworks.Ithasoftenbeentheworstperforming ory to settle these and several other issues. We use efficient and methodandasaresultisnolongerincludedincomparisons. The . s fairmemory-residentimplementationsofeachmethodtoreproduce basicideaofIERistocomputeshortestpathdistancesusingDi- c past experiments and conduct additional comparisons for several jkstra’salgorithmtotheclosestobjectsintermsofEuclideandis- [ overlookedevaluations. Notablywerevisitapreviouslydiscarded tance. Althoughmanysignificantlyfastershortestpathalgorithms 2 technique(IER)showingthat,throughasimpleimprovement,itis have been proposed in recent years, surprisingly, IER has never v oftenthebestperformingtechnique. been compared against other kNN methods using any algorithm 9 other than Dijkstra. To ascertain the true performance of IER it 4 1. INTRODUCTION mustbeintegratedwithstate-of-the-artshortestpathalgorithms. 5 2. DiscrepanciesinExistingResults. Wenoteseveraldiscrepan- Ciscoreportsthatmorethanhalfabillionmobiledeviceswere 1 ciesintheexperimentalresultsreportedinsomeofthemostnotable 0 activated in 2013 alone, and 77% of those devices were smart- papersonthistopic. ROADisseentoperformsignificantlyworse 1. phones. Due to the surge in adoption of smartphones and other than Distance Browsing and INE in [30]. But according to [20], GPS-enableddevices,andcheapwirelessnetworkbandwidth,map- 0 ROADisexperimentallysuperiortobothDistanceBrowsingand basedserviceshavebecomeubiquitous. Forinstance,theGlobal- 6 INE. The results in both [20] and [30] show Distance Browsing WebIndex reported that Google Maps was the most used smart- 1 hasworseperformancethanINE.Incontrast,DistanceBrowsingis phone app in 2013 with 54% of smartphone users having used : showntobemoreefficientthanINEin[25]. Thesecontradictions v it[1].Findingnearbyfacilities(e.g.,restaurants,ATMs)areamong identifytheneedforreproducibility. i themostpopularqueriesissuedonmaps. Duetotheirpopularity X 3. ImplementationDoesMatter. Similartoarecentstudy[28], andimportance,knearestneighbor(kNN)queries,whichfindthe r kclosestpointsofinterest(objects)toagivenquerylocation,have we observe that simple implementation choices can significantly a affectalgorithmperformance.Forexample,G-treeutilizesdistance beenextensivelystudiedinthepast. matricesthatcanbeimplementedusingeitherhash-tablesorarrays While related to the shortest path problem in many ways, the and, on the surface, both seem reasonable choices. However the kNNproblemonroadnetworksintroducesnewchallenges. Since arrayimplementationinfactperformsmorethananorderofmag- thetotalnumberofobjectsisusuallymuchlargerthank itisnot nitudefasterthanthehash-tableimplementation.Weshowthatthis efficienttocomputetheshortestpaths(ornetworkdistances)toall isduetodatalocalityinG-tree’sindexanditsimpactoncacheper- objectstodeterminewhicharekNNs.Thechallengeistonotonly formance. In short, seemingly innocuous choices can drastically changeexperimentaloutcomes. Wealsobelievediscrepanciesre- ported above may well be due to different choices made by the implementers.Thusitiscriticaltoprovideafaircomparisonofex- istingkNNalgorithmsusingcarefulin-memoryimplementations. 4.OverlookedEvaluationMeasures/Settings.Allmethodsstud- iedinthispaperdecoupletheroadnetworkindexfromthatofthe setofobjects,i.e.oneindexiscreatedfortheroadnetworkandan- othertostorethesetofobjects.Althoughexistingstudiesevaluate theroadnetworkindexes,nostudyevaluatesthebehaviourofeach 1 individual object index. The construction time and storage cost objectverticesO,akNNqueryretrievesthekclosestobjectsinO for these object indexes may be critical information for develop- basedontheirnetworkdistancesfromq. erswhenchoosingmethods,especiallyforobjectsetsthatchange 2.2 Scope regularly.AdditionallykNNquerieshavenotbeeninvestigatedfor traveltimegraphs(onlytraveldistance),whichisalsoacommon WeseparateexistingkNNtechniquesintotwobroadcategories scenarioinpractice.Finallythemorerecenttechniques(G-treeand basedontheindexingtheyuse:1)blendedindexing;and2)decou- ROAD)didnotincludecomparisonsforreal-worldPOIs. pled indexing. Techniques that use blended indexing [11,15,19] createasingleindextostoretheobjectsaswelltheroadnetwork. 1.2 Contributions Forexample,VN3 [19]isanotabletechniquethatusesanetwork Belowwesummarizethecontributionswemakeinthispaper. Voronoi diagram based on the set of objects to partition the net- 1.RevivedIER:WeinvestigateIERwithseveralefficientshortest work. In contrast, decoupled indexing techniques [20,23,25,30] pathtechniquesforthefirsttime(seeSection5).Weshowthatthe usetwoseparateindexesfortheobjectsetandroadnetwork,which performanceofIERissignificantlyimprovedwhenbettershortest ismorepracticalandhasseveraladvantagesasexplainedbelow. pathalgorithmsareused. ThisoccurstothepointthatIERisthe Firstly,areal-worldkNNquerymaybeappliedtooneofmany bestperformingmethodinmostsettings,includingtraveltimeroad objectsets,e.g.,returnthekclosestrestaurantsorlocatethenear- networkswhereEuclideandistanceisalesseffectivelowerbound. est parking space. Blended indexing must repeatedly index the roadnetworkforeachtypeofobject,entailinghugespaceandpre- 2. HighlyOptimisedAlgorithmsOpen-Sourced:Wepresentef- processingtimeoverheads. Butdecoupledindexingrequiresonly ficientimplementationsoffiveofthemostnotablemethods(IER, oneroadnetworkindexregardlessofthenumberofobjectsets,re- INE,DistanceBrowsing,ROADandG-tree).Firstlywehavecare- sultinginlowerstorageandpre-processingcost.Secondly,ifthere fullyimplementedeachmethodforefficientperformanceinmain is any change in an object set, blended indexing must update the memoryasdescribedinSection6.Secondlywethoroughlychecked wholeindexandreprocesstheentireroadnetwork,whereasdecou- eachalgorithmandmadevariousimprovementsthatareapplicable pledtechniquesneedonlyupdatetheobjectindex. Forexample, inanysetting,asdocumentedinAppendixA.Thesourcecodeand the network-based Voronoi diagram must be updated resulting in scriptstorunexperimentshavebeenreleasedasopen-source[2], expensivere-computations[19]. Conversely, indecoupledindex- makingourbestefforttoensureitismodularandre-usable. ing,theobjectindexes(e.g.,R-tree)aretypicallymuchcheaperto 3.ReproducibilityStudy:Withefficientimplementationsofeach update. The problem is more serious for object sets that change algorithm,werepeatmanyexperimentsfrompaststudiesonmany often,e.g.,iftheobjectsarethenearestavailableparkingspaces. ofthesamedatasetsinSection7. Ourresultsprovideadeeperun- Duetotheseadvantages,allrecentkNNtechniquesusedecou- derstandingofthestate-of-the-artwithnewinsightsintotheweak- pledindexing.Inthispaper,wefocusonthemostnotablekNNal- nesses and strengths of each technique. We also show that there gorithmsthatemploydecoupledindexing.Thesealgorithmseither is room to improve kNN search heuristics by demonstrating that employanexpansion-basedmethodoraheuristicbest-firstsearch G-treecanbemademoreefficientbyusingEuclideandistances. (BFS).Theexpansion-basedmethodsencounterkNNsinnetwork 4. ExtendedExperimentsandAnalysis: Ourcomprehensiveex- distance order. Heuristic BFS methods instead employ heuristics perimentalstudyinSection7extendsbeyondpaststudiesby: 1) toevaluatethemostpromisingkNNcandidates,notnecessarilyin comparingobjectindexesforthefirsttime;2)revealingnewtrends networkdistanceorder, potentiallyterminatingsooner. Westudy by comparing G-tree with another advanced method (ROAD) on thefivemostnotablemethodswhichincludetwoexpansion-based largerdatasetsforthefirsttime;3)evaluatingallmethods(includ- methods,INE[23]andROAD[20],andthreeheuristicBFSmeth- ingROADandG-tree)onreal-worldPOIs; and4)evaluatingap- ods,IER[23],DistanceBrowsing(DisBrw)[25]andG-tree[30]. plicablemethodsontraveltimeroadnetworks. Given the rapid growth in smartphones and the corresponding 5. GuidanceonMain-MemoryImplementations: InSection6 widespread use of map-based services, applications must employ we also demonstrate how simple choices can severely impact al- fastin-memoryqueryprocessingtomeetthehighqueryworkload. gorithmperformance. Weshareanin-depthcasestudytogivein- In-memory processing has become viable due to the increases in sightsintotherelationshipbetweenalgorithmsandin-memoryper- main-memorycapacitiesanditsaffordability. Thus, welimitour formancewithrespecttodatalocalityandcacheefficiency. Addi- study to in-memory query processing. However, we remark that tionallywehighlightthemainchoicesinvolvedandillustratethem disk-basedsettingsarealsoimportantbutarebeyondthescopeof throughexamplesandexperimentalresults,toprovidehintstofu- thispapermainlyduetothespacelimitation. tureimplementers. Significantly,theseinsightsarepotentiallyap- plicabletoanyproblem,notjustthosewestudyhere. 3. METHODS Wenowdescribethemainideasbehindeachmethodevaluated 2. BACKGROUND byourstudy. Somemethodsproposearoadnetworkindexanda kNNqueryalgorithmtouseit. Insomecases,suchasG-tree,we 2.1 ProblemDefinition refertoboththeindexandkNNalgorithmbythesamename. We represent a road network as a connected undirected graph 3.1 IncrementalNetworkExpansion G = (V,E) where V is the set of vertices and E is the set of edges. Fortwoadjacentverticesu,v ∈V,wedefinetheedgebe- IncrementalNetworkExpansion(INE)[23]isamethodderived tweenthemase(u,v),withweightw(u,v)representinganyposi- fromDijkstra’salgorithm.AsinDijkstra,INEmaintainsapriority tivemeasuresuchasdistanceortraveltime.Wedefinetheshortest queueoftheverticesseensofar(initialisedwiththequeryvertex path distance, hereafter network distance, between any two ver- q). The search is expanded to the nearest of these vertices v. If ticesu,v∈V asd(u,v),theminimumsumofweightsconnecting v∈OthenitisaddedtotheresultsetasoneofthekNNsandifv u and v. For conceptual simplicity, similar to the existing stud- isthekthobjectthenthesearchisterminated.Otherwisetheedges ies [25,30], we assume that each object (i.e., POI) and query is of v are used to relax the distances to its neighbors and the ex- locatedonsomevertexinV. Givenaqueryvertexq andasetof pansioncontinues. AsinDijkstra’salgorithm,relaxationinvolves 2 v v6 v10 v R1a R1 v10 2 v8 v 2 R1b v6 v8 v R2b v1 v3 v4 9 v12 v1 v3 v4 v7 R2a 9 v1v2 v v7 v v5 R2 11 5 11 Figure2:ROAD Figure1:SILC:ColoringSchemeandQuadtreeforv 6 ofv. Forexample,inFigure1,thefirstvertexontheshortestpath updatingtheminimumnetworkdistancestotheneighborsofvus- from v6 to v12 is v8 because v12 has the same color as v8. The ing the network distance through v. The disadvantage of INE is colorofv12isfoundbylocatingthequadtreeblockcontainingv12. thatitvisitsallnodesclosertoqthanthekthobject,whichmaybe TheshortestpathcanbecomputedinO(mlog|V|)wheremisthe considerableifthisobjectisfarfromq. numberofedgesontheshortestpath[25]. kNN Algorithm. To enable kNN search, DisBrw stores addi- 3.2 IncrementalEuclideanRestriction tionalinformationineachquadtree.Foreachvertexvcontainedin IncrementalEuclideanRestriction(IER)[23]usesEuclideandis- aquadtreeblockb,itcomputestheratiooftheEuclideanandnet- tanceasaheuristictoretrievecandidatesfromO,asitisalower workdistancesbetweenthequadtreeownersandv. Itthenstores boundonnetworkdistanceforroadnetworkswithtraveldistance theminimumandmaximumratios,λ− andλ+ respectively,with edges.Firstly,IERretrievestheEuclideankNNs,e.g.,usinganR- b. Now,givenanyvertext,DisBrwcomputesadistanceinterval tree[24]. Itthencomputesthenetworkdistancetoeachofthesek [δ−,δ+]bymultiplyingtheEuclideandistancefromstotbythe objects and sorts them in this order. This set becomes the can- λ− andλ+ valuesoftheblockcontainingt. Thisintervaldefines didate kNNs and the network distance to the furthest candidate alowerandupperboundonthenetworkdistancefromstotand (denoted as Dk) is an upper bound on the distance to the true canbeusedtopruneobjectsthatcannotbekNNs. Theintervalis kth nearest neighbor. Now, IER retrieves the next nearest Eu- refinedbyobtainingthenextvertexuintheshortestpathfromsto clideanneighborp. IftheEuclideandistancetopisdE(q,p)and t(asdescribedearlier),computinganintervalforutot,andthen dE(q,p) ≥ Dk, then p cannot be a better candidate by network addingtheknowndistancefromstoutothenewinterval. Byre- distancethananycurrentcandidate.Moreover,sinceitisthenear- finingtheinterval,iteventuallyconvergestothenetworkdistance. estEuclideanneighbor,thesearchcanbeterminated. However,if DisBrw used an Object Hierarchy in [25] to avoid computing dE(q,p)<Dkthenpmaybeabettercandidate.Inthiscase,IER distance intervals for all objects. The basic idea was to compute computesthenetworkdistanced(q,p). Ifd(q,p) < DK,pisin- distanceintervalsforregionscontainingobjects,thenvisitthemost sertedintothecandidateset(removingthefurthestcandidateand promisingregions(andrecursivelysub-regions)first.Wefoundthis updatingDk).Thiscontinuesuntilthesearchisterminatedorthere methoddidnotusetheSILCindextoitsfullpotential. Insteadwe arenomoreEuclideanNNs. retrieveEuclideanNNsascandidateobjectsforwhichintervalsare then computed. Otherwise, the DisBrw kNN algorithm proceeds 3.3 DistanceBrowsing exactlyasin[25]. WereferthereaderstoAppendixA.1.1forfull DistanceBrowsing(DisBrw)[25]usestheSpatiallyInducedLink- detailsandexperimentalcomparisonswiththeoriginalmethod. age Cognizance (SILC) index proposed in [26] to answer kNN 3.4 RouteOverlay&AssociationDirectory queries. [26]proposedanincrementalkNNalgorithm,whichDis- Brwimprovesuponbymakingfewerpriorityqueueinsertions. The search space of INE can be considerably large depending SILCIndex.WefirstintroducetheSILCindexusedbyDisBrw. onthedistancetothekthobject. RouteOverlayandAssociation Foravertexs∈V,SILCpre-computestheshortestpathsfromsto Directory (ROAD) [20,21] attempts to remedy this by bypassing allothervertices. SILCassignseachadjacentvertexofsaunique regionsthatdonotcontainobjectsbyusingsearchspacepruning. color. Then,eachvertexu ∈ V isassignedthesamecolorasthe AnRnetisapartitionoftheroadnetworkG=(V,E),withev- adjacentvertexv thatispassedthroughintheshortestpathfrom ery edge in E belonging to at least one Rnet. Thus, an Rnet R stou. Figure1showsthecoloringoftheverticesforthevertex representsasetofedgesE ⊆ E. V isthesetofverticesthat R R s=v whereeachadjacentvertexofv isassignedauniquecolor are associated with edges in E . To create Rnets, ROAD parti- 6 6 R and the other vertices are colored accordingly. For example, the tionstheroadnetworkGintof ≥2Rnets,recursivelypartitioning verticesv tov havethesamecolorasv (blueverticalstripes) resulting Rnets until a hierarchy of l > 1 levels is formed (with 9 12 8 becausetheshortestpathfromv toeachoftheseverticespasses G being the root at level 0). Figure 2 shows Rnets (for l=2) for 6 throughv (forthisexampleassumeunitedgeweights). thegraphinourrunningexample. Theenclosingboxesandovals 8 Observethattheverticesclosetoeachotherhavethesamecolor representthesetV ofeachRnet. Specifically,R ={v ,··· ,v } R 1 1 7 resulting in several contiguous regions of the same color. These andR ={v ,··· ,v }arethechildRnetsoftherootG. Eachof 2 6 12 regions are indexed by a region quadtree [24] to reduce the stor- R andR arefurtherdividedintoRnets,e.g.,R isdividedinto 1 2 1 agespace. Thecolorofavertexcanbedeterminedbylocatingthe R ={v ,v ,v ,v }andR ={v ,v ,v ,v }. 1a 1 2 3 4 1b 4 5 6 7 regioninthequadtreethatcontainsit. SILCappliesthecoloring ForanRnetR,avertexb∈V withanadjacentedgee(b,v)∈/ R schemeandcreatesaquadtreeforeachvertexoftheroadnetwork. E isdefinedasaborderofR.Forinstance,v isaborderofR R 4 1b This requires O(|V|1.5) space in total and, due to the all-pairs butv isnot.ThesebordersformthesetB ⊆V ,e.g.,theborder 5 R R shortestpathcomputation,O(|V|2log|V|)pre-processingtime. setofR consistsofv ,v andv . ROADcomputesthenetwork 1b 4 6 7 Tocomputetheshortestpathfromstot,SILCusesthequadtree distancebetweeneverypairofbordersb ,b ∈ B ineachRnet i j R ofstoidentifythecoloroft. Thecoloroftdeterminesthefirst andstoreseachastheshortcutS(b ,b ).Nowanyshortestpathbe- i j vertexvontheshortestpathfromstot.Todeterminethenextver- tweentwoverticess,t∈/ V involvingavertexu∈V mustenter R R texontheshortestpath,thisprocedureisrepeatedonthequadtree Rthroughaborderb ∈ B andleavethroughaborderb(cid:48) ∈ B . R R 3 Name Region #Vertices #Edges G1Av2 G0 v8 G2A v10G2B DVET DVeelramwoanrte 4985,,861722 121099,,020848 v1 v3 v4 v6 v7 v9 v12 MCOE CMoloairnaedo 148375,,361656 14,01422,3,45020 G1 G1B v5 G2 v11 NCWA CaNlifoorrtnhi-aW&esNtUevSada 11,,088990,,983135 24,,564350,,844444 Figure3:G-tree E EasternUS 3,598,623 8,708,058 W WesternUS 6,262,104 15,119,284 So if a search reaches a border b ∈ B the shortcuts associated C CentralUS 14,081,816 33,866,826 R withb,S(b,b(cid:48))∀b(cid:48) ∈ B ,canbetraversedtobypasstheRnetR US UnitedStates 23,947,347 57,708,624 R whilepreservingnetworkdistances. Forexample,inFigure2,the Table1:RoadNetworkDatasets bordersofR arev ,v andv (thecoloredvertices)andROAD 1b 4 6 7 precomputestheshortcutsbetweenalltheseborders. Supposethe ders.Forleafnodes,itstoresthenetworkdistancebetweeneachof queryvertexisv andthesearchhasreachedthevertexv . Ifitis 1 4 itsbordersandtheverticescontainedinit. knownthatR doesnotcontainanyobject,thealgorithmcanby- 1b Similartothebottom-upcomputationofshortcutsinROAD,the passR byquicklyexpandingthesearchtootherbordersofR 1b 1b distancematrixofnodesattreelevelicanbeefficientlycomputed withouttheneedtoaccessanynon-bordervertexofR . E.g.,us- 1b byreducingthegraphtoonlyconsistofbordersatleveli+1us- ingtheshortcutbetweenv andv ,thealgorithmcancomputethe 4 7 ingthe distancematricesof thatlevel. Only leafnodesrequire a distancebetweenv tov withoutexploringanyvertexinR . 1 7 1b Dijkstra’ssearchontheoriginalgraph. Givenaplanargraphand SincechildRnetsarecontainedbytheirparentRnet,aborderb optimalpartitioningmethod,G-treeisaheight-balancedtreewitha ofanRnetmustbeaborderofsomechildRnetateachlowerlevel. spacecomplexityofO(|V|log|V|). ThesimilaritieswithROAD Forexample,v inFigure2isaborderforR anditsparentR . 6 1b 1 are clear. One major difference is that G-tree uses its border-to- Thisallowstheshortcutstobecomputedinabottom-upmanner, borderdistancematricesto“assemble”shortestpathdistancesby whereshortcutsatleveliarecomputedusingthoseofleveli+1, the path through the G-tree hierarchy. We refer the reader to the greatlyreducingpre-computationcost. OnlyleafRnetsrequirea originalpaper[30]forthedetailsoftheassemblymethod. Dijkstra’ssearchontheoriginalgraphG. AnotherkeydifferenceisthekNNalgorithm. Tosupporteffi- ROADusesaRouteOverlayindexandanAssociationDirectory cientkNNqueries, G-treeintroducestheOccurrenceList. Given toefficientlycomputekNNs.Recallthatavertexvmaybeaborder anobjectsetO, theOccurrenceListofaG-treenodeG listsits i ofmorethanoneRnet. TheRouteOverlayindexstores,foreach childrenthatcontainobjects,allowingemptynodestobepruned. vertexv,theRnetsforwhichitisaborderalongwiththeshortcut ThekNNalgorithmbeginsfromtheleafnodethatcontainsq,us- treesofv. TheAssociationDirectoryprovidesameanstocheck ing an Dijkstra-like search to retrieve leaf objects. However, we whetheragivenRnetorvertexcontainsanobjectornot.ThekNN foundthisleafsearchcouldbefurtheroptimisedanddetailourim- algorithmproceedsincrementallyfromthequeryvertexqinasim- proved leaf search algorithm in Appendix A.2.1. The algorithm ilarfashiontoINE.However,whenROADexpandstoanewvertex thenincrementallytraversestheG-treehierarchyfromthesource v,insteadofinspectingitsneighbors,itconsultstheRouteOverlay leaf. Elements(nodesorobjects)areinsertedintoapriorityqueue andAssociationDirectorytofindthehighestlevelRnetassociated using their network distances from q. The network distance to a withitthatdoesnotcontainanyobject.ROADthenrelaxesallthe G-treenodeiscomputedusingtheassemblymethodbyfindingits shortcutsinthisRnetinasimilarwaytoedgesinINE,tobypassit. nearestbordertoq. Queueelementsaredequeuedinaloop. Ifthe OfcoursewhenvisnotaborderofanyRnetorifallRnetsasso- dequeued element is a node, its Occurrence List is used to insert ciatedwithvcontainanobject,itrelaxestheedgesofvexactlyas itschildren(nodesorobjectvertices)backintothepriorityqueue. inINE.Thesearchterminateswhenkobjectshavebeenfoundor Ifthedequeuedelementisavertex,itisguaranteedtobethenext therearenofurtherverticestoexpand. nearestobject.Thesearchterminateswhenkobjectsaredequeued. Ausefulpropertyofassemblingdistancesisthat, givenapath 3.5 G-tree throughtheG-treehierarchy,distancescanbematerializedforal- G-tree[30,31]alsoemploysgraphpartitioningtocreateatree readyvisitedG-treenodes.Forexample,givenaqueryvertexqand index that can be used to efficiently compute network distances twokNNobjectsinthesameleafnode,afterlocatingoneofthem, throughahierarchyofsubgraphs.Thepartitioningoccursinasim- thedistancestothebordersofthisleafneednotberecomputed. ilarwaytothatofROADwheretheinputgraphGispartitioned into f ≥ 2 subgraphs. Each subgraph is recursively partitioned 4. DATASETS until it contains nomore than τ ≥ 1 vertices. Forany subgraph G ,V ⊆V isdefinedasthesetofroadnetworkverticescontained Herewedescribethedatasetsusedtosupplytheroadnetwork i i withinit. Anyvertexb ∈ V withanedgee(b,v)wherev ∈/ V G=(V,E)andsetofobjectverticesO⊆V forkNNquerying. i i is defined as a border of G and all such vertices form the set of i 4.1 RealRoadNetworks bordersB .Figure3showsanexamplewherethecoloredvertices i v andv arebordersforthesubgraphG ={v ,··· ,v }. We study kNN queries on 10 real-world road network graphs 5 6 1 1 6 The partitioned subgraphs naturally form a tree hierarchy with as listed in Table 1. These were created for the 9th DIMACS eachnodeintheG-treeassociatedwithonesubgraph.Notethatwe Challenge [3] from data publicly released by the US Census Bu- usenodetorefertotheG-treenodewhilevertexreferstoroadnet- reau.Eachnetworkcoversalltypesofroads,includinglocalroads, workvertices. Notablyanon-leafnodeG doesnotneedtostore andcontainsrealedgeweightsfortraveldistancesandtraveltimes i subgraph vertices, but only the set of borders B and a distance (bothareusedinourexperiments).Wealsoconductin-depthstud- i matrix. Fornon-leafnodes,thedistancematrixstoresthenetwork iesfortheUnitedStates(US)andNorth-WestUS(NW)roadnet- distancefromeachchildnodebordertoallotherchildnodebor- works. The US dataset, covering the entire continental United 4 ObjectSet UnitedStates North-WestUS 106 106 Dijk TNR PFoSasPcstthaOroFkfoofislocsdes 16226951S0,,,i,303z5361e28995 D0000.e...0n0000s0000i731t9y 4511S,,,,4034iz4920e1883 D0000e....n0000s0000i4511ty µQuery Time (s)111110000012345 Dijk PHL CH µQuery Time (s)111100002345 MGPtrHeLe CH 100 MGtree TNR 101 Hospitals 11,417 0.0005 258 0.0002 1 5 10 25 50 0.0001 0.001 0.01 0.1 1 Hotels 8,742 0.0004 460 0.0004 k Density Universities 3,954 0.0002 95 0.00009 (a) Varyingk (b) VaryingDensity Courthouses 2,161 0.00009 49 0.00005 Figure4:IERVariants(NW,d=0.001,k=10,uniformobjects) Table2:Real-WorldObjectSets distancefromv toeachobjectinR isatleast Dmax .Forexam- c i 2m−i+1 pleform=5,thesetR containsobjectswithintherange(Dmax, States, isthelargest with24millionvertices. The NWroadnet- 1 32 D ]. Thusweinvestigatetheeffectofincreasingminimumob- work(with1millionvertices),coveringOregonandWashington, max jectdistancebycomparingquerytimeonR withincreasingi. representsquerieslimitedtoasmallerregionorcountry. Notably i thisisthefirsttimeDisBrwhasbeenevaluatedonanetworkwith 5. IERREVISITED morethan500,000vertices,previouslynotpossibleduetoitshigh pre-processingcost(intermsofbothspaceandtime). NetworkdistancecomputationisacriticalpartofIER.However, to the best of our knowledge, all existing studies [20,21,23,25] 4.2 RealandSyntheticObjectSets employDijkstra’salgorithmtocomputenetworkdistances. Dijk- Wecreateobjectsetsbasedonbothreal-worldpointsofinterest stra’salgorithmisnotonlyslowbutitmustalsorevisitthesame (POIs)andsyntheticmethodsasdescribedbelow. vertices for subsequent network distance computations. Even if Real-WorldPOISets.Wecreated8real-worldobjectsets(listed Dijkstra’salgorithmissuspendedandresumedforsubsequentEu- inTable2)usingdataextractedfromOpenStreetMap(OSM)[4]for clidean NNs, this is necessarily no better than INE, which uses locationsofreal-worldPOIsintheUnitedStates. Eachobjectset Dijkstra-likeexpansionuntilkNNsarefound. isassociatedwithaparticulartypeofPOI,e.g., allfastfoodout- To understand the true potential of IER, we combined it with lets. POIsweremappedtoroadnetworkverticesonboththeUS severalfasttechniques. PrunedHighwayLabelling[7]isamongst and NW road networks using their coordinates. While real POIs thefastesttechniques. Itboastsfastconstructiontimesdespitebe- canbeobtainedfreelyfromOSM,itisnotaproprietysystem. As ing a labelling method, but has similarly large index sizes. The a result the data quality can vary, e.g., the largest object sets in G-treeassembly-basedmethodmentionedearliercanalsocompute OSMmaynotberepresentativeofthetruelargestobjectsetsand networkdistances. Notably, inasimilarmannertoG-tree’skNN thecompletenessofPOIdatamayvarybetweenregions.So,inad- search,the“materialization”propertycanbeusedtooptimisere- ditiontoreal-worldobjectsets,wegeneratesyntheticsetstomake peatednetworkdistancequeriesfromthesamesource(asinIER). generalizableandrepeatableobservationsforallroadnetworks. TheDijkstra-likeleaf-searchcanalsobesuspendedandresumed. ThisisdoublyadvantageousforIER,asitbecomesmorerobustto UniformObjectSets. Auniformobjectsetisgeneratedbyse- “falsehits”(EuclideanNNsthatarenotrealkNNs), especiallyif lectinguniformlyrandomverticesfromtheroadnetwork.Asthese theyareinthevicinityofarealkNN.Werefertothisversionof objectsarerandomlyselectedroadnetworkvertices,theyarelikely G-treeasMGtree. FinallywecombinedIERwithContractionHi- tosimulaterealPOIs,e.g.,areaswithmoreverticeshavemorePOIs erarchies(CH)[13]andTransitNodeRouting(TNR)[8]usingim- (e.g.,cities)whilethosewithfewerroadshaveless(e.g.,ruralar- plementationsmadeavailablebyarecentexperimentalpaper[29]. eas).Thedensityofobjectssetsdisvariedfrom0.0001to1,where Weuseagridsizeof128forTNRasin[29]. distheratioofthenumberofobjects|O|tothenumberofvertices WecomparetheperformanceofIERusingeachmethodinFig- |V|intheroadnetwork. Highdensitiescansimulatelargerobject ure4. PHListheconsistentwinner,being4ordersofmagnitude sets which are common occurrences, e.g., ATM machines, park- fasterthanDijkstraandanorderofmagnitudebetterthanthenext ingspaces.LowdensitiescorrespondtothesparselylocatedPOIs, fastestmethodatitspeak.G-tree,assistedbymaterialization,isthe e.g.,postofficesorrestaurantsinaparticularchain.Bydecreasing nextbestmethod.Allmethodsconvergewithincreasingdensity,as thedensitywecansimulatemoredifficultqueries,asfewerobjects thesearchspacebecomessmaller. NotethatCHisthetechnique implylongerdistancesandthereforelargersearchspaces.Uniform usedtoanswerlocalqueriesinTNR,whichexplainswhyTNRand objectswereusedtoevaluateG-treein[30,31]. CHaresosimilarforhighdensitiesasthedistancesaretoosmall Clustered Object Sets. While some POIs may be uniformly tousetransitnodes.Atlowerdensities,transitnodesareusedmore distributedothertypes,suchasfastfoodoutlets,occurinclusters. often,leadingtoalargerspeedup.Giventheseresults,inourmain Tocreatesuchclusteredobjectsets,givenanumberofclusters|C|, experiments,weincludethetwofastestversionsofIER,i.e.,PHL weselect|C|centralverticesuniformlyatrandom(asabove). For andMGtree. NotethatthesuperiorityofPHLandMGtreeisalso each central vertex, we select several vertices (up to a maximum observedforotherroadnetworksandobjectsets. clustersizeC )initsvicinity,byexpandingoutwardsfromit. max ThisdistributionwasusedtoevaluateROADin[20]. 6. IMPLEMENTATIONINMAINMEMORY Minimum Object Distance Sets. The worst-case kNN query occurswhenthequerylocationisremote.Tosimulatethiswecre- Giventheaffordabilityofmemory,thecapacitiesavailableand ateminimumdistanceobjectsetsasfollows.Wechooseanapprox- the demand for high performance map-based services, memory- imatecentrevertexv byusingthenearestvertextotheEuclidean residentqueryprocessingisarealisticandoftennecessaryrequire- c centreoftheroadnetwork. Wefindthefurthestvertexv fromv ment. However, we have seen in-memory implementation effi- f c andsetD asthenetworkdistancefromv tov .Foranobject ciency can affect performance to the point that algorithmic effi- max c f set R , i ∈ [1,m], we choose |O| objects such that the network ciencybecomesirrelevant[28]. Firstly, thisidentifiestheneedto i 5 105 104 G G G G Chained Hashing 1A 1B 1 2 Quad. Probing A v2 v2 v3 v4 v5 v6 v v5 v6 v7 v8 µQuery Time (s)111000234 Chained Hashing µQuery Time (s)110023 Array G1 v3 G1 5 101 Quad. ProAbrrinagy 101 v 1 5 10 25 50 0.0001 0.001 0.01 0.1 1 v 6 k Density 4 v (a) Varyingk (b) VaryingDensity GB1 v5 G2 7 Figure6:DistanceMatrixVariants(NW,d=0.001,k=10) v6 v8 toleafborderdistances).Nowwecomputethedistancetoeachbor- (a) G (b) G 1 0 derofG fromv ,byfindingtheminimumdistancethroughone 1 1 Figure5:DistanceMatrices ofG ’sborders.Todothis,foreachofG ’sborders,weiterate 1A 1A overG ’sborders, retrievingdistancematrixvaluesforeachpair understandhowthiscanhappensothatguidelinesforefficientim- 1 (updatingtheminimumwhenasmallerdistanceisfound). Thisis plementation may be developed. Secondly, it implies that some shownbytheshadedcellsinFigure5(a).SimilarlyG anditssib- algorithmsmaypossessintrinsicqualitiesthatmakethemsuperior 1 lingG arethenextnodesinthetreepath,andweagainretrieve in-memory. The utility of the latter cannot be ignored. We first 2 distancematrixvaluesbyiteratingovertwolistsofborders.These illustratebothaforementionedpointsusingacasestudyandthen valuesareretrievedfromthematrixoftheLCAnode,G ,andthe outlinetypicalchoicesandourapproachtosettlethem. 0 valuesaccessedareshadedinFigure5(b). 6.1 CaseStudy: G-treeDistanceMatrices As we are iterating over lists (i.e., arrays) of borders, the dis- tancematrixdoesnotneedtobeaccessedinanarbitraryorder,as G-tree’sdistancematricesstorecertainpre-computedgraphdis- weobservedintheG-treeauthors’implementation. Thisismade tances (between borders of sub-graphs), allowing “assembly” of possiblebygroupingthebordersofchildnodesasshowninFigure longerdistancesinapiece-wisemanner.WefirstlydescribetheG- 5andstoringthestartingindexforeachchild’sborders. Addition- treeassemblymethodbelow,thenshowhowtheimplementationof allywecreateanoffsetarrayindicatingthepositionofthenodes’ distancematricescansignificantlyimpactitsperformance. ownbordersinitsdistancematrix.Forexample,theoffsetarrayfor EveryG-treenodehasasetofborders.Fromourrunningexam- G indicatesitsborders(v andv )areatthe3rdand4thindexof pleinFigure3,v andv arebordersofG . Eachnon-leafnode 1 5 6 5 6 1 eachrowinitsdistancematrixshowninFigure5(a).WhileFigure alsohasasetofchildren,forexampleG andG arethechil- 1A 1B 5showsthedistancematrixasa2Darray,itisbestimplemented drenofG . Theseinturnhavetheirownborders,whichwerefer 1 asa1Darray. Thisandthepreviouslydescribedaccessedmethod, toas“childborders”ofG . Adistancematrixstoresthedistances 1 allowallshadedvaluestobeaccessedfromsequentialmemorylo- fromeverychildbordertoeveryotherchildborder.Forexamplefor cations,thusdisplayingexcellentspatiallocality. Thisisshownin G ,itschildbordersarev ,v ,v ,v ,v ,anditsdistancematrixis 1 2 3 4 5 6 Figure5astheshadedcellsareeithercontiguousorverycloseto showninFigure5(a).ButrecallthataborderofaG-treenodemust beingso. Spatiallocalitymakesthecodecache-friendly,allowing necessarilybeaborderofachildnode,e.g.,thebordersofG ,v 1 5 the CPU to easily predict and pre-fetch data into cache that will andv ,arealsobordersofG .Thismeansthedistancematrixof 6 1B bereadnext. Otherwisethedatawouldneedtoberetrievedfrom G repeatedlystoressomeborder-to-borderdistancesalreadyinthe 1 memory, which is 20−200× slower than CPU cache (depending distancematrixofG ,aredundancythatcanbecomequitelarge 1B onthelevel). Thiseffectisamplifiedinrealroadnetworksasthey forbiggergraphs. Toavoidthisrepetitionandutilise, ingeneral, containsignificantlylargernumbersofborderspernode. O(1) random retrievals, a practitioner may choose to implement Wecomparethreeimplementationsofdistancematrices,includ- thedistancematrixasahash-table. Thishastheaddedbenefitof ing the 1D array described above and two types of hash-tables: beingabletoretrievedistancesforanytwoarbitraryborders. chainedhashing[12](STLunordered map);andquadraticprob- Givenasourcevertexsandtargett,G-tree’sassemblymethod ing[12](Googledense hash map). InFigure6,chainedhash- firstlydeterminesthetreepaththroughtheG-treehierarchy. This ingisastaggering30timesslowerthanthearray. Whilequadratic isasequenceofG-treenodesstartingfromtheleafnodecontain- probingisanimprovement,itisstillanorderofmagnitudeslower. ingsthroughitsimmediateparentandeachsuccessiveparentnode Hadweusedeitherofthehash-tabletypes,wewouldhaveunfairly uptotheleast-commonancestor(LCA)node. FromtheLCA,the concludedthatG-treewastheworstperformingalgorithm. pathtracesthroughsuccessivechildnodesuntilreachingtheleaf node containing t. The assembly method then computes the dis- CacheMisses(Data) tances from all borders from the ith node in the path, G , to all DistanceMatrix INS i L1 L2 L3 bordersini+1thnode,G . Thesetwonodesarenecessarilyei- i+1 ChainedHashing 953B 28.8B 20.5B 13B therbothchildrenoftheLCAorhaveaparent-childrelationship.In QuadraticProbing 1482B 11.2B 7.5B 5.3B eithercasetheparentnode’sdistancematrixcontainsvaluesforall Array 151B 1.5B 0.4B 0.3B border-to-border distances. Assuming we have computed all dis- tancesfromstothebordersofGi,wecomputethedistancestothe Table3:HardwareProfiling:250,000QueriesonNWDataset bordersofG byiteratingovereachborderofG andcomputing i+1 i theminimumdistancethroughthemtoeachborderofG . We investigate the cache efficiency of each implementation in i+1 FromourrunningexampleinFigure3,letv bethesourceand CPUcachemissesateachlevelinbillionsinTable3(alsoshowing 1 v bethetarget.Inthiscasethebeginningofthetreepathwillcon- INS,no.ofinstructionsinbillions)usingperfhardwareprofiling 12 tainthechildnodeG andthenitsparentnodeG . Assumewe of250,000variedqueriesonNW.Chainedhashingusesindirection 1A 1 havecomputedthedistancestothebordersofG (easilydoneby toaccessdata,resultinginpoorlocalityandthehighestnumberof 1A usingthedistancematrixofleafnodeG ,whichstoresleafvertex cachemisses. Quadraticprobingimproveslocalityattheexpense 1A 6 105 105 Parameter Values 1st Cut Settled µQuery Time (s)111000234 1st Cut Settled µQuery Time (s)111100001234 PQueue Graph RSyoDRnaedtehnaNeslktiePitcytOwP(IodOsr)kIss DuEn,iVfoTr1,m,M0,E.c1lR,1,uCe,0sft5.Oee0,rr,1e1Nt,d0o0,W,.Tm20a,50ibCn,1l.5,eA00o2,.b0Ej0.,0dW1is,tCan,cUeS 101 PQueue Graph 100 1 5 10 25 50 0.0001 0.001 0.01 0.1 1 Table4:Parameters(DefaultsinBold) k Density (a) Varyingk (b) VaryingDensity consecutively in this order. The vertices array stores the starting Figure7:INEImprovement(NW,d=0.001,k=10) indexofeachvertex’sadjacencylistinedges,alsoinorder. Now foranyvertexuwecanfindthebeginningofitsadjacencylistin ofmorecostlycollisionresolution,henceitusesmoreinstructions edgesusingvertices[u]anditsendusingvertices[u+1]. Thiscon- than chained hashing. However, it cannot achieve better locality tiguityincreasesthelikelihoodofacachehitduringexpansion.We thanstoringdatainanarraysortedintheorderitwillbeaccessed. similarly store ROAD’s shortcuts in a global shortcut array, with Thisorderingmeansthenextvalueweretrievefromthearrayisfar eachshortcuttreenodestoringanoffsettothisarray.Theprinciple morelikelytobeinsomelevelofcache. Unsurprisingly,itsuffers demonstratedhereisthatrecommendeddatastructuresinpaststud- fromthefewestcachemisses.ThisisauniquestrengthofG-tree’s iescannotbeusedverbatim. ItisnecessarytoreplaceIO-oriented distancematricesandshows,whilein-memoryimplementationis datastructurese.g.,wereplacedtheB+-trees,recommendedinthe challenging,itisstillpossibletodesignalgorithmsthatworkwell. originallydisk-basedDisBrwandROAD,withsortedarrays. 6.2 GuidelinesforImplementationChoices 4.Language.C++presentlyallowsmorelow-leveltuning,suchas specifyingthelayoutofdatainmemoryforcachebenefits,making In-memoryimplementationrequirescarefulconsideration,orex- itpreferableinhighperformanceapplications. Implementersmay perimental outcomes can be drastically affected as seen with G- considerotherlanguagessuchasJavaforitsportabilityanddesign tree’s distance matrices and in [28]. Many choices are actually features. ButwhenweimplementedINEwithallaforementioned quitesimple,buttheirsimplicitycanleadtothembeingoverlooked. improvements in Java (Oracle JDK 7), we found it was at least Hereweoutlineseveralchoicesandoptionstodealwiththemtoas- 2×slowerthantheequivalentC++implementation. Onepossible sistfutureimplementers. Toillustratetheimpactofthesechoices reason is that Java does not guarantee contiguity in memory for weprogressivelyimproveafirst-cutin-memoryimplementationof collectionsofobjects. Also,thesameobjectstakeupmorespace INE.EachplotlineinFigure7showstheeffectofoneimproved in Java. Both factors lead to lower cache utilisation, which may choice. Eachroughlyhalvesthequerytime,withthefinalimple- penalisemethodsthatarebetterabletoexploitit. mentationofINEbeing6−7×faster. 1. Priority Queues. All methods in our study employ priority 7. EXPERIMENTS queues. Inparticular,INEandROADinvolvemanyqueueopera- tionsandthusrelyontheirefficientimplementation. Binaryheaps 7.1 ExperimentalSetting aremostcommonlyused,butwemustchoosewhethertoallowdu- plicateverticesinthequeueornot. Withoutduplicates,thequeue Environment. We conducted experiments on a 3.2GHz Intel issmallerandqueueoperationsinvolvelesswork. Butthismeans Corei5-4570CPUand32GBRAMrunning64-bitLinux(kernel theheapindexofeachvertexmustbelookeduptoupdatekeyse.g., 4.2). Ourprogramwascompiledwithg++5.2usingtheO3flag, throughahash-table.Ondegree-boundedgraphs,suchasroadnet- and all query algorithms use a single thread. To ensure fairness, works, the number of duplicates is small, and removing them is we used the same subroutines for common tasks between the al- simply not worth the lost locality and increased processing time gorithms whenever possible. We implemented INE, IER, G-tree incurredwithhash-tables. Asaresult,weseea2×improvement andROADfromscratch. WeobtainedtheauthorscodeforG-tree, whenINEisimplementedwithoutdecreasingkeys(seePQueuein whichweusedtofurtherimproveourimplementation,e.g.,byse- Figure7).Notethatweusethisbinaryheapforallmethods. lectingthebetteroptionwhenourchoicesdisagreedwiththeau- 2. SettledVertexContainer. RecallINEandROADmusttrack thors’choiceofdatastructures. ForDistanceBrowsing,wepartly vertices that have been dequeued from their priority queues (i.e., based our SILC index on open-source code from [29], but being settled). Thescalablechoiceistostoreverticesinahash-tableas a shortest path study this implementation did not support kNN theyaresettled. Howeverweobserveanalmost2×improvement, queries. As a result, we implemented the kNN algorithms our- asshownbySettledinFigure7byusingabit-arrayinstead. This selves from scratch, modifying the index to support them, taking is despite the need to allocate memory for |V| vertices for each the opportunity to make significant improvements (as discussed query. Thebit-arrayhastheaddedbenefitofoccupying32×less in Section 6 and Appendix A). We used a highly efficient open- space than an integer array, thus fitting more data in cache lines. source implementation of PHL made available by its authors [7]. This does add a constant pre-allocation overhead for each query, Allsourcecodeandscriptstogeneratedatasets,runexperiments, whichisproportionallyhigherforsmallsearchspaces(i.e,forhigh anddrawfigureshavebeenreleasedasopen-source[2]forreaders density). Butthetrade-offisworthitduetothesignificantbenefit toreproduceourresultsorre-useinfuturestudies. onlargersearchspaces(i.e.,lowdensity). IndexParameters. TheperformanceoftheG-treeandROAD 3. GraphRepresentation. Adisk-optimisedgraphdatastructure indexesarehighlydependentonthechoiceofleafcapacityτ (G- wasproposedforINEin[23].Inmainmemory,wemaychooseto tree), hierarchylevelsl (ROAD)andfanoutf (both)[20,21,30]. replaceitwithanarrayofnodeobjects,witheachobjectcontaining Weexperimentallyconfirmedtrendsobservedinthosestudiesand anadjacencylistarray. Howeverbycombiningalladjacencylists computed parameters for new datasets. As such, we use fanout intoasinglearrayweareabletoobtainanother2×speed-up(refer f=4forbothmethods. ForG-treewesetτ to64(DE),128(VT, toGraphinFigure7). Firstly,weassignnumberstoverticesfrom ME,CO),256(NW,CA,E),and512(W,C,US).ForROAD,we 0to|V|−1.Anedgesarraystorestheadjacencylistofeachvertex setlto7(DE),8(VT,ME),9(CO,NW),10(CA,E)and11(W,C, 7 105 104 106 105 Index Size (MB)111111000000012345 RGO1tIr0NAe5DEe DisPBH1rw0L6 107 Construction Time (s)111111000000-101234 105 RGOtrAeDe106 DisPBHrwL 107 µQuery Time (s)111100000123 105 RGOtIrNAeDEe106 IEDIREi-sRPB-HGrwLt 107 Path Cost110045 R1O0A5DIGE VR-tre-eGrett. PPBaay1ptthh0a 6sCCsooessdtt 107 104Vertices Bypassed Number of Vertices Number of Vertices Number of Vertices Number of Vertices (a) IndexSize (b) ConstructionTime (a) QueryTime (b) G-tree&ROADStats Figure8:EffectofRoadNetworkSize|V| Figure9:EffectofRoadNetworkSize|V|(d=0.001,k=10) US).WechosevaluesoflforROADinaccordancewiththeresults asitsindex(SILC[26])requiresanall-pairsshortestpathcomputa- reportedin[21]thatshowqueryperformanceofROADimproves tion.However,thecomputationofeachSILCquadtreeisindepen- forlargerl.Specifically,foreachdataset,weincreasedluntileither dentandcanbeeasilyparallelized.Weobservedaspeed-upfactor thequeryperformancedidnotimproveorfurtherpartitioningwas ofverycloseto4×withourquad-coreCPUusingOpenMP.Note notpossibleduetotoofewverticesintheleaflevels. that other methods cannot be so easily parallelized. Despite this QueryVariables.Table4showstherangeofeachvariableused DisBrwstillrequired9hoursonNW,whileparallelizationisuse- inourexperiments(defaultsinbold). Similartopaststudies[30], fulitdoesnotchangetheasymptoticbehaviour. PHLtakeslonger wevarykfrom1to50withadefaultof10. Weused8real-world thanG-treeandROADbutsurprisinglynotsignificantlyso,thanks objectsetsasdiscussedSection4.Wevaryuniformobjectsetden- toprunedlabelling[7]. IER’sindexperformancedependsonthe sitydfrom0.0001to1whered=|O|/|V|withadefaultvalueof networkdistancemethoditemploys(i.e.,G-treeorPHL). 0.001. We choose this default density as it closely matches the RecallthatbothROADandG-treemustpartitiontheroadnet- typicaldensityforreal-worldobjectsetsasshowninTable2. Fur- work. SincethenetworkpartitioningproblemisknowntobeNP- thermorethisdensitycreatesalargeenoughsearchspacetoreveal complete,ROADandG-treebothemployheuristicalgorithms. As interestingperformancetrendsformethods. Wevaryover10real bothmethodsrequirethesametypeofpartitioningweusethesame roadnetworks(listedinTable1)withmedian-sizedNWandlargest algorithm,themultilevelgraphpartitioningalgorithm[18]usedin US road networks as defaults. We use distance edge weights in G-tree. ThismethodusesamuchfastervariantoftheKernighan- Sections7.2and7.3forcomparisonwithpaststudies,andbecause LinalgorithmrecommendedinROAD[20]. Consequently,weare IERandDisBrwweredevelopedforsuchgraphs. Butwerepeat abletoevaluateROADformuchlargerdatasetsforthefirsttime, experimentsontraveltimeslaterinSection7.5. with ROAD being constructed in less than one hour for even the Query and Object Sets. All query times are averaged over largestdataset(US)containing24millionvertices. Theconstruc- 10,000queries. Forreal-worldobjectsets,wetestedeachsetwith tiontimeofROADiscomparabletoG-tree,becausebothusethe 10,000 random query vertices. For uniform and clustered object samepartitioningmethod,andemploybottom-upmethodstocom- sets,wegenerate50differentsetsforeachdensityandnumberof puteshortcutsanddistancematrices,respectively. clusters (resp.) combined with 200 random query vertices. For Weremarkthat,whilemostexistingstudieshavefocusedonim- minimumdistanceobjectsets(describedinSection4.2),wegener- proving query processing time, there is a need to develop algo- ated50setsforeachdistancesetRiwithi∈[1,m].Wealsochose rithms and indexes providing comparable efficiency with a focus 200randomqueryverticeswithdistancesfromthecentrevertexin onreducingmemoryusageandconstructiontime. range[0,Dmax)(i.e.,verticescloserthanR )forusewithallsets. Weusem=26mforNWandm=8forUStoens1uretherewereenough 7.3 QueryPerformance objectsineachsettosatisfythedefaultdensity0.001. WeinvestigatedkNNqueryperformanceoverseveralvariables: road network size, k, density, object distance, clusters, and real- 7.2 RoadNetworkIndexPre-ProcessingCost world POIs. Implementations have been optimized according to Herewemeasuretheconstructiontimeandsizeoftheindexused Section6. Wehaveappliednumerousimprovementstoeachalgo- byeachtechniqueforallroadnetworksinTable1. rithm,asdetailedinAppendixA.IERnetworkdistancesarecom- IndexSize.Figure8(a)showstheindexsizeforeachalgorithm. putedusingbothPHL[7](whenitsindexfitsinmemory)andG- INEonlyusestheoriginalgraphdatastructure,soitssizecanbe treewithmaterialization(shownasIER-PHLandIER-Gt,resp.). seen as the lower bound on space. DisBrw could only be built forthefirst5roadnetworksbeforeexceedingourmemorycapac- 7.3.1 VaryingNetworkSize ity. ThisisnotsurprisinggiventheO(|V|1.5)storagecomplexity. Figure9(a)showsquerytimeswithincreasingnumbersofroad However,inourimplementation,wewereabletobuildDisBrwfor networkvertices|V|forall10roadnetworksinTable1onuniform anindexwith1millionvertices(NW)consuming17GB.PHLalso objects. WeobservetheconsistentsuperiorityofIER-basedmeth- exhibitslargeindexes,howeveritcanstillbebuiltforallbutthe2 ods.Figure9(a)clearlyshowsthereducedapplicabilityofDisBrw. largestdatasets. WenotethatPHLexperienceslargerindexeson EventhoughitsperformanceisclosetoROAD,itslargeindexsize traveldistancegraphsbecausetheydonotexhibitprominenthier- makesitapplicableononlythefirst5datasets. archiesneededforeffectivepruning(ontraveltimegraphswewere SurprisinglyG-tree’sadvantageoverROADdecreaseswithin- abletobuildPHLforallindexes).G-treeconsumedlessspacethan creasing network size |V|. Recall that ROAD can be seen as an ROAD. E.g., for the US dataset G-tree used 2.9GB compared to optimisationonINE,wheretheexpansioncanbypassobject-less ROAD’s4.4GB.Asexplainedinpaststudies[30],ROAD’sRoute regions(i.e.,Rnets).ThusROAD’srelativeimprovementoverINE Overlaycontainssignificantredundancyasmultipleshortcuttrees dependsonthetimesavedbypassingRnetsversusadditionaltime repeatedlystoreasubsetoftheRnethierarchy. spentdescendingshortcuttrees.Ingeneral,giventhesamedensity, ConstructionTime.Figure8(b)comparestheconstructiontime wecanexpectasimilarsizedregiontocontainthesamenumber ofeachindexforincreasingnetworksizes.DisBrwagainstandsout ofobjectsirrespectiveofthenetworksize|V|. Thisexplainswhy 8 104 104 104 104 INE IER-Gt ROAD IER-PHL µTime (s)110023 µTime (s)103 µTime (s)110023 Gtree DisBrw µTime (s)103 Query 101 INE IER-Gt Query 102 Query 101 Query 102 ROAD IER-PHL INE Gtree INE Gtree 100 Gtree DisBrw 101 ROAD IER-Gt 100 101 ROAD IER-Gt 1 5 10 25 50 1 5 10 25 50 0.0001 0.001 0.01 0.1 1 0.0001 0.001 0.01 0.1 1 k k Density Density (a) NWDataset (b) USDataset (a) NWDataset (b) USDataset Figure10:Effectofk(d=0.001) Figure11:EffectofDensity(k=10) INEremainsrelativelyunaffectedby|V|.Italsomeansthatregions Figure 11. With increasing density the average distance between withoutobjectsaresimilarlysized. AlthoughRnetsmaygrow,the objectsdecreasesandingeneralquerytimesarelower.Therateof sizeoftheRnetswedobypassalsogrow,soROADbypassessim- improvementforheuristic-basedmethods(DisBrw,G-tree,IER)is ilarnumbersofvertices.Sothetimesavedbypassingregionsdoes slowerbecausetheyarelessabletodistinguishbettercandidates. notincreasegreatly.ThusROAD’squerytimewithincreasing|V| For IER this means more false hits, explaining why IER-PHL’s mainly depends on the depth of shortcut trees. But the depth is query times increase (slightly) as it has no means to re-use pre- boundedbyl,whichweknowdoesnotincreasegreatly,andasa viouscomputationslikeIER-Gtdoes. Therateofimprovementis resultROADscalesextremelywellwithincreasing|V|. higherforexpansion-basedmethodsastheirsearchspacesbecome G-tree’s non-materialized distance computation cost is a func- smaller. ROAD falls behind INE beyond density 0.01 indicating tionofthenumberofbordersofG-treenodes(i.e.,subgraphs)in- thetippingpointatwhichthetimespenttraversingshortcuttrees volvedinthetreepathtoanothernodeorobject. Withincreasing exceedsthetimesavedbypassingRnets(ifany). Thequerytimes networksize,aG-treenodeatthesamedepthhasmorebordersand plateauathighdensitiesontheUSdatasetforROADandINEbe- thepathcostisconsequentlyhigher. Thus,weseeG-tree“catch- cause it is dominated by the bit-array initialization cost (refer to up” to ROAD on the US dataset. These trends are demonstrated Section6.2). G-treeperformswellathighdensitiesasmorekNNs in9(b). G-tree’spathcost(inborder-to-bordercomputations)in- arefoundinthesourceleafnode.InthiscaseitrevertstoaDijkstra- creaseswhilethenumberofverticesROADbypassesremainssta- likesearch(whichweimprovedasinAppendixA.2.1)providing blewithincreasing|V|(notethesearenotdirectlycomparable). comparableperformancetoINEandROADonNW.G-treeexceeds themontheUSasabit-arrayisnotrequiredduetoG-tree’sleaf 7.3.2 Varyingk searchbeinglimitedtoatmostτ vertices. Figures 10(a) and 10(b) show the results for varying k for the 7.3.4 VaryingClusters NW and US datasets, respectively, on uniform objects. Signifi- cantly,IER-PHLis5×fasterthananyothermethodonNW.While Inthissectionweevaluateperformanceonclusteredobjectsets PHL could not be constructed for the US dataset for travel dis- proposedinSection4.2. Figure12showsthequerytimewithin- tances, IER-Gt takes its place as the fastest method, being twice creasingnumbersofclustersandvaryingk. Inbothcasescluster asfastasG-tree. Interestingly,thisisdespitebothusingthesame size is at most 5. Figure 12(b) uses an object density of 0.001. index, alsomaterializingintermediateresults, andIER-Gthaving Asthenumberofclustersincreasestheaveragedistancebetween the additional overhead of retrieving Euclidean NNs. So this is objects decreases leading to faster queries. This is analogous to reallyanexaminationofheuristicsusedbyG-tree. EssentiallyG- increasingdensity,thusshowingthesametrendasforuniformob- tree visits the closest subgraph (i.e., by one of its borders) while jects. IER-PHL’ssuperiorityisagainapparent. Onedifferenceto IER-Gt visits the subgraph with the next Euclidean NN. IER-Gt uniformobjectsisIER-basedmethodsfinditmoredifficulttodif- can perform better because its heuristic incorporates an estimate ferentiatebetweencandidatesasthenumberofclustersincreases, on distances to objects within subgraphs while G-tree does not. andquerytimesincrease(butnotsignificantly). SimilarlyinFig- Each time G-tree visits a subgraph not containing a kNN it pays ure12(b),askincreases,IER-PHLvisitsmoreclusters,causingits a penalty in the cost of non-materialized distance computations. performance lead to be slightly smaller than for uniform objects. Wehaveseenthiscostincreaseswithnetworksize,whichexplains IER-Gt on the other hand is more robust to this, as it is able to whytheimprovementofIER-GtisgreaterontheUSthanonNW. materializemostresults.G-treeagainperformsbetterthanDisBrw ThisisverifiedinFigure9(b),whichshowsIER-Gtinvolvesfewer andROAD.Duetoclustering,objectsinthesameclusterwilllikely computationsthanG-treeandthegapincreaseswithnetworksize. belocatedinthesameG-treeleafnode. Afterfindingthefirstob- We observe that G-tree outperforms ROAD, DisBrw and INE ject,G-treecanquicklyretrieveotherobjectswithoutrecomputing on NW, with a trend similar to previous studies [30]. INE is the distancestotheleafnode,thusremainingrelativelyconstant. slowestasitvisitsmanyvertices. Fork = 1theROAD,DisBrw andG-treemethodsareindistinguishableasasmallareaislikely tocontaintheNN.ROADandDisBrwscaleverysimilarlywithk. INE IER-Gt 104 Gtmnuu-domterreebbeepersatscttehaorlsfetisshnuabbntehsteRteeqOrGuAte-htnDartenetarbnahovditeehDrr,saiaarsctlBshitrytswopa.breAeeaksmtrnamavetoeearrrrieslayelodiabz,nejaedolc.lrtodsAweasrirnoeegxflpomglcraaeaigatnenteedidr-, µQuery Time (s)111100002345 RGOtrAeDe IEDRi-sPBHrwL µQuery Time (s)111000123 ROINADE IEIRE-RP-HGLt inSection7.3.1,weagainseeG-tree’srelativeimprovementover 101 100 Gtree DisBrw ROADdecreaseinFigure10(b)forthelargerUSdataset. 1 10 100 1000 1 5 10 25 50 Num. of Clusters k (a) VaryingNo.ofClusters (b) Varyingk 7.3.3 VaryingDensity Figure12:EffectofClusteredObjects(NW,|C|=0.001,k=10) Weevaluateperformanceforvaryinguniformobjectdensitiesin 9 Query Time (µs)111100002345 19816874112014614222 1237110567161049315 469149055613514117 317341439211210418 18083322699711123 1680DR307isO330BIA115NrwD116E21 4281861698583I27EIREG-349RtPr-166HeG154eLt817423 Query Time (µs)111000345 1516717201569772 1450012551299529 55298151025431 3340746898409 2941701803389 2555529769322 923385532I258REGORtIr-AN392eGDEet275341184 101 102 Court University Hospital Hotel Fast Food Post Park School Court University Hotel Hospital Post Fast Food Park School (a) NWRoadNetwork (b) USRoadNetwork Figure13:VaryingReal-WorldObjectSets(Defaults:k=10) 105 107 Varying k. Figure 15 shows the behaviour of two typically INE IER-Gt INE Gtree µTime (s)110034 RGOtrAeDe IEDRi-sPBHrwL µTime (s)111000456 ROAD IER-Gt sHceroeaasrcpshiinteagdlskPd,OisaIpssl,atfhyaesayttftroeeonndddotsouimtlbeieltasrsaptnaodrsthhea.otspIoEiftRaul-nsP,ifHoonLrmthisoebaNjgeWacitnsdsfaoitgarnsieinft--. Query 102 Query 110023 iscliagnhtltylyflaoswteerrpthearfnorGm-atrnecee.foArlftahsotufgohodsotiulltlefatsstaesstt,heIsEeRt-ePnHdLtohaaps- 101R1 R2 R3 R4 R5 R6 101Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 pearinclusterswhereEuclideandistanceislessabletodistinguish Query Set Query Set bettercandidates,similartosyntheticclustersinFigure12(b).Thus (a) NWDataset (b) USDataset trendsobservedforequivalentsyntheticobjectsetsinpreviousex- Figure14:VaryingMin.Obj.Distance(d=0.001,k=10) perimentsarealsoobservedforreal-worldPOIs. 7.3.7 OriginalSettings 7.3.5 VaryingMinimumObjectDistance A recent experimental comparison [30] used a higher default Each set R in Figure 14 represents an exponentially increas- i densityofd=0.01. Whilewechooseamoretypicaldefaultden- ingnetworkdistancetotheclosestobjectwithincreasingi,asde- sity,wereproduceresultsusingd=0.01inFigure16forvaryingk scribedinSection4.2.Forthesmallestsets,objectsstilltendtobe andnetworksize.NotethatweusethesmallerColoradodatasetin foundfurtheraway,astherearefewercloservertices. Howeveras Figure16(a)fordirectcomparisonwith[30]. Firstly,allmethods distanceincreasesfurther,weseetheeffectof“remoteness”. INE comparedin[30]nowanswerqueriesinlessthan1ms. Whileour scalesbadlyduetotheincreasingsearchspace.IER-basedmethods CPUisfaster, itcannotaccountforsuchalargedifference. This scalepoorlyastheEuclideanlowerboundsbecomeslessaccurate suggestsourimplementationsareindeedefficient. Secondly,most withincreasingnetworkdistance.Thisisparticularlynoticeablein methodsaredifficulttodifferentiate,assuchahighdensityimplies Figure14(b)asG-treeeventuallyovertakesIER-GtontheUS.But averysmallsearchspace(i.e.,queriesare“easy”forallmethods). IER-PHLstilloutperformsallmethodsonNW.DisBrwperforms poorlyforasimilarreason,makingmanyintervalrefinements. G- treescalesextremelywellinbothcases,asmorepathsarevisited 103 103 INE IER-Gt throughtheG-treehierarchy,morecomputationscanbematerial- ROAD IER-PHL izedforsubsequenttraversals. µTime (s)102 µTime (s)102 Gtree DisBrw 7.3.6 Real-WorldObjectSets Query 101 INE IER-Gt Query ROAD IER-PHL VaryingObjectSets.InFigure13,weshowquerytimesofeach 100 Gtree DisBrw 101 techniqueontypicalreal-worldobjectsetsfromTable2.Theseare 1 5 10 25 50 105 106 107 k Number of Vertices orderedbydecreasingsize,whichisanalogoustodecreasingden- (a) Varyingk (b) Varying|V| sity, showing the same trend as in Figure 11. Schools represent the largest object set and all methods are extremely fast as seen Figure16:kNNQueries(CO,d=0.01,k=10) for high density. A more typical POI, like hospitals, are less nu- merous and show the differences between methods more clearly. 7.4 ObjectSetIndexPre-ProcessingCost Regardless,IER-PHLonNWandIER-GtonUSconsistentlyand TheoriginalROADpaper[20]includedpre-processingofafixed significantlyoutperformothermethodsonmostreal-worldobject object set in its road network index statistics. But there may be sets. AlsonotequerytimesforG-treearehigheronUSthanNW manyobjectsets(e.g., oneforeachtypeofrestaurant)orobjects forthesamesets,confirmingourobservationsinSection7.3.1. may need frequent updating (e.g., hotels with vacancies). So we areinterestedintheperformanceofindividualobjectindexesover varyingsize(i.e.,density).Weevaluate3objectindexesontheUS 105 104 dataset,namely:R-treesusedbyIER,AssociationDirectoriesused µQuery Time (s)111100001234 INE IER-Gt µQuery Time (s)111000123 INE IER-Gt bsbtyeuIdcRnoydOnDesAxtirDsuSBciartzwenedd.aoOIlsfnflcopciunurseare,crsetlioncRace-detoerLedbeijisesntcs(ttosueimnseedeAdemxpbeopysreynGfoda-rnitxrdaelAetlh..oe1bN.a1jope)c.tpetrosthpeatrsitawitneooounulder ROAD IER-PHL ROAD IER-PHL 100 Gtree DisBrw 100 Gtree DisBrw injectedat query time. We investigatethe indexsizes (inKB) in 1 5 10 25 50 1 5 10 25 50 Figure18(a)togaugewhateffecteachdensityhasonthetotalsize. k k ThesizeoftheinputobjectsetusedbyINEisthelowerboundstor- (a) Hospitals (b) FastFood agecost. ROAD’sobjectindexissmallerthanG-tree’sbecauseit Figure15:VaryingkforReal-WorldObjects(NW,k=10) 10

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.