PI : a Parallel in-memory skip list based Index Zhongle Xie∗, Qingchao Cai∗, H.V. Jagadish+, Beng Chin Ooi∗, Weng-Fai Wong∗ ∗NationalUniversityofSingapore +UniversityofMichigan ∗{xiezhongle, caiqc, ooibc, wongwf}@comp.nus.edu.sg [email protected] ABSTRACT sincetheymaysufferfrompoorcacheutilizationduetotheir 6 hierarchicalstructure,coarsegranularityofdataaccessand Duetothecoarsegranularityofdataaccessesandtheheavy 1 poor parallelism. use of latches, indices in the B-tree family are not efficient 0 The integration of multiple cores into a single CPU chip forin-memorydatabases,especiallyinthecontextoftoday’s 2 makes many-core computation a norm today. Ideally, an multi-core architecture. in-memory database index should scale with the number of n In this paper, we present PI, a Parallel in-memory skip a list based Index that lends itself naturally to the parallel on-chipcorestofullyunleashthecomputingpower. TheB+- J andconcurrentenvironment, particularlywithnon-uniform tree index, however, is ill-suited for such a parallel environ- ment. Supposeathreadisabouttomodifyanintermediate 2 memory access. In PI, incoming queries are collected, and node in a B-tree index. It should first prevent other con- disjointlydistributedamongmultiplethreadsforprocessing ] to avoid the use of latches. For each query, PI traverses currentthreadsfromdescendingintothesub-treerootedat B thatnodeinordertoguaranteecorrectness,therebyforcing the index in a Breadth-First-Search (BFS) manner to find D the list node with the matching key, exploiting SIMD pro- serialization among these threads. Worse, if the root node isbeingupdated,alloftheotherthreadscannotproceedto . cessing to speed up the search process. In order for query s process queries. Consequently, the traditional B-tree index processing to be latch-free, PI employs a light-weight com- c doesnotprovidetheparallelismrequiredforeffectiveuseof [ municationprotocolthatenablesthreadstore-distributethe query workload among themselves such that each list node the concurrency provided by a many-core environment. 1 that will be modified as a result of query processing will be Ontheotherhand,singleinstructionmultipledata(SIMD) v accessedbyexactlyonethread. Weconductedextensiveex- is now supported by almost all modern processors. It en- 9 periments, and the results show that PI can be up to three ablesperformingthesamecomputation,e.g.,arithmeticop- 5 erations and comparisons, on multiple data simultaneously, timesasfastastheMasstree,astate-of-the-artB-treebased 1 holding the promise of significantly reducing the time com- index. 0 plexityofcomputation. However,tooperateonindicesusing 0 SIMD requires a non-trivial rethink since SIMD operations Keywords . require operands to be contiguously stored in memory. 1 Database index, skip list, B-tree, parallelization The aforementioned issues highlight the need for a new 0 6 parallelizablein-memoryindex,andmotivateustore-examine 1 1. INTRODUCTION theskiplist[32]asapossiblecandidateasthebaseindexing : DRAM has orders of magnitude higher bandwidth and structure in place of the B+-tree (or B-tree). Skip list em- v ploysaprobabilisticmodeltobuildmultiplelinkedlistssuch i lowerlatencythanharddisks,orevenflashmemoryforthat X matter. With exponentially increasing memory sizes and that each linked list consists of nodes selected according to this model from the list at the next level. Like B+-trees, r falling prices, it is now frequently possible to accommodate a the entire database and its associated indices in memory, thesearchforaquerykeyinaskiplistfollowsbreadth-First traversal in the sense that it starts from the list at the top thereby completely eliminating the significant overheads of levelandmovesforwardalongthisleveluntilanodewitha slow disk accesses [15, 19, 20, 36, 37]. Non-volatile mem- larger key is encountered. Thereupon, the search moves to orysuchasphase-changememoryloomingonthehorizonis the next level, and proceeds as it did in the previous level. destinedtopushtheenvelopefurther. Traditionaldatabase However,sinceonlyonenodeisbeingtouchedbythesearch indices, e.g., B+-trees [12], that were mainly optimized for process at any time, its predecessors at the same level can diskaccesses,arenolongersuitableforin-memorydatabases beaccessedbyanothersearchthread,whichmeansdataac- cessintheskiplisthasafinergranularitythantheB+-tree whereanintermediatetreenodecontainsmultiplekeys,and should be accessed in its entirety. Moreover, with a relax- ation on the structure hierarchy, a skip list can be divided vertically into disjoint partitions, each of which can be in- dividually processed on multi-core systems. Hence, we can expecttheskiplisttobeanefficientandmuchmoresuitable . indexing technique for concurrent settings. It is natural to use latches to implement concurrent ac- 2.1.1 CacheExploitation cessesoveragivenskiplist. Latches,however,arecostly. In A better cache utilization can substantially enhance the fact, merely inspecting a latch may require it to be flushed query performance of B+-trees since reading a cache line from other caches, and is therefore costly. Latch modifica- from the cache is much faster than from memory. Soft- tion is even more expensive, as it will invalidate the latch ware prefetching is used in [10] to hide the slow memory replicas located at the caches of other cores, and force the access. Rao et al. [33] presents a cache-sensitive search tree threads running on these cores to re-read the latch for in- (CSS-tree),wherenodesarestoredinacontiguousmemory spection, incurring significant bandwidth cost, especially in area such that the address of each node can be arithmeti- aNon-UniformMemoryAccess(NUMA)architecturewhere callycomputed,eliminatingtheuseofchildpointersineach accessing memory of remote NUMA nodes can be an order node. ThisideaisfurtherappliedtoB+-trees,andtheresul- of magnitude slower than that of a local NUMA node. tantstructure,calledtheCSB+-tree,isabletoachievecache In this paper, we propose a highly parallel in-memory consciousness and meanwhile support efficient update. The databaseindexbasedonalatch-freeskiplist. WecallitPI, relationship between the node size and cache performance a Parallel in-memory skip list based Index. In PI, we em- of the CSB+-tree is analyzed in [16]. Masstree [29] is a trie ployafine-grainedprocessingstrategytoavoidusinglatches. of B+-trees to efficiently handle keys of arbitrary length. Queriesareorganizedintobatchesandeachbatchprocessed Withalltheoptimizationspresentedin[10,33,34]enabled, simultaneously using multiple threads. Given a query, PI Masstree can achieve a high query throughput. However, traverses the index in a breadth-first manner to find the its performance is still restricted by the locks upon which corresponding list node. SIMD instructions are used to ac- it relies to update records. In addition, Masstree is NUMA celeratethesearchprocess. Tohandlethecaseinwhichtwo agnosticasNUMA-awaretechniqueshavebeenshowntobe threadsfindthesamelistnodeforsomekeys,PIadjuststhe not providing much performance gain [29]. As a result, ex- queryworkloadamongexecutionthreadstoensurethateach pensive remote memory accesses are incurred during query listnodethatwillbemodifiedasaresultofqueryprocessing processing. is accessed by exactly one thread, thereby eliminating the need for latches. 2.1.2 LatchandParallelizability Our main contributions include: Therehavebeenmanyworkstryingtoimprovetheperfor- • Weproposealatch-freeskiplistindexthatshowshigh manceoftheB+-treebyavoidinglatchesorenhancingpar- performanceandscalability. Itservesasanalternative allelizability. TheBlink-tree[24]isanearlyattempttowards to tree-like indices for in-memory databases and suits enhancingtheparallelizabilityofB+-treesbyaddingtoeach themany-coreconcurrentenvironmentduetoitshigh node,excepttherightmostones,anadditionalpointerpoint- degree of parallelism. ing to its right sibling node so that a node being modified does not prevent itself from being read by other concurrent • WeuseSIMDinstructionstoacceleratequeryprocess- threads. However, Lomet [28] points out that the deletion ing of the index. of nodes in Blink-trees can incur a decrease in performance • A set of optimization techniques are employed in the and concurrency, and addresses this problem through the proposed index to enhance the performance of the in- useofadditionalstateinformationandlatchcoupling. Bra- dex in many-core environment. ginskyandPetrankpresentalock-freeB+-treeimplemented withsingle-wordCASinstructions[8]. Toachieveadynamic • We conduct an extensive performance study on PI as structure,thenodesbeingsplitormergedwillbefrozen,but wellasacomparisonstudybetweenPIandMasstree[29], search operations are still allowed to perform against such astate-of-the-artindexusedinSILO[37]. Theresults frozen nodes. show that PI is able to perform up to more than 3× Duetotheoverheadincurredbylatches,alatch-freeB+- better than Masstree in terms of query throughput. tree has also attracted some attention. Sewall et al. pro- Theremainderofthispaperisstructuredasfollows. Sec- pose a latch-free concurrent B+-Tree, named PALM [35], tion 2 presents related work. We describe the design and whichadoptsbulksynchronousparallel(BSP)modeltopro- implementation of PI in Section 3 and 4, respectively, and cess queries in batches. Queries in a batch are disjointly develop a mathematical model to analyze PI’s performance distributedamongthreadstoeliminatethe synchronization ofqueryprocessinginSection5. Section6presentstheper- amongthemandthustheuseoflatches. FAST[21]usesthe formance study of PI. Section 7 concludes this work. samemodel,andachievestwicequerythroughputasPALM at a cost of not being able to make updates to the index 2. RELATEDWORK tree. The Bw-tree [26], developed for Hekaton [15, 25], is another latch-free B-tree which manages its memory layout 2.1 B+-tree in a log-structured manner and is well-suited for flash solid TheB+-tree[12]isperhapsthemostwidelyusedindexin state disks (SSDs) where random writes are costlier than database systems. However, it has two fundamental prob- sequential ones. lemswhichrenderitinappropriateforin-memorydatabases 2.2 CASInstructionandSkipList First, its hierarchical structure leads to poor cache utiliza- tionwhichinturnseriouslyrestrictsitsqueryperformance. Compare-and-swap (CAS) instructions are atomic oper- Second,itdoesnotsuitconcurrentenvironmentwelldueto ations introduced in the concurrent environment to ease itscoarsegranularityofdataaccessandheavyuseoflatches. the implementation of synchronization primitives, such as In order to solve those drawbacks, exploiting cache and re- semaphores and mutexes. Herlihy proposes a sophisticated moving latches become the core direction for implementing model to show that CAS instructions can be used in imple- in-memory B+-trees. menting wait-free data structures [17]. These instructions arenottheonlymeanstorealizeconcurrentdatastructures. NUMA-awaresort-mergejoinapproach,andleverageprefetch- Brown et al. also present a new set of primitive operations ing to further enhance the performance [3]. Lang et al. ex- for the same purpose [9]. plore how various implementation techniques for NUMA- Skip lists [32] are considered to be an alternative to B+- aware hash join [23]. Li et al. study data shuffling algo- trees. Compared with B+-trees, a skip list has approxi- rithms in the context of NUMA architecture [7]. mately the same average search performance, but requires much less effort to implement. In particular, even a latch- 3. INDEXDESCRIPTION free implementation, which is notoriously difficult for B+ Inatraditionalskiplist,sincenodeswithdifferentheights trees, can be easily achieved for skip lists by using CAS are dynamically allocated, they do not reside within a con- instructions [18]. Crain et al. propose new skip list algo- tiguousmemoryarea. Non-contiguousstorageofnodescauses rithms [14] to avoid contention on hot spots. Abraham et cachemissesduringkeysearchandlimitstheexploitationof al. combineskiplistsandB-treesforefficientqueryprocess- SIMD processing, which requires the operands to be stored ing[2]. Inaddition,skiplistscanalsobeintegratedintodis- within a contiguous memory area. We shall elaborate on tributed settings. Aspnes and Shah present Skip Graph [4] howPIovercomesthesetwolimitationsandmeanwhileachieves for peer-to-peer systems, a novel data structure leveraging latch-free query processing. skip lists to support fault tolerance. As argued earlier, skip lists are more parallelizable than 3.1 Structure B+-treesbecauseofthefine-graineddataaccessandrelaxed Like a typical skip list, PI also consists of multiple levels structure hierarchy. However, na¨ıve linked list based imple- ofsortedlinkedlists. Thebottommostlevelisalistofdata mentation of skip lists have poor cache utilization due to nodes, whose definition is given in Definition 1; an upper- the nature of linked lists. In PI, we address this problem level list is composed of the keys randomly selected with by separating the Index Layer from the Storage Layer such a fixed probability from those contained in the linked list that the layout of the Index Layer is optimized for cache of the next lower level. For the sake of expression, these utilization and hence enables an efficient search of keys. composinglinkedlistsarelogicallyseparatedintotwolayers: the storage layer and the index layer. The storage layer is 2.3 SingleInstructionMultipleData merelythelinkedlistofthebottommostlevel,andtheindex Single Instruction Multiple Data (SIMD) processing has layer is made up of the remaining linked lists. beenextensivelyusedindatabaseresearchtoboosttheper- Definition 1. A data node α is a triplet formance of database operations. Chhugani et al. [11] show howtheperformanceofmergesort,aclassicalsortalgorithm, (κ,p,Γ) can be improved, when equipped with SIMD. Similarly, it has been shown in [38, 6, 5] that the SIMD-based imple- where κ is a key, p is the pointer to the value associated mentation of many database operators, including scan, ag- with κ, and Γ is the height of κ, representing the number of gregation, indexing and join, perform much better than its linked lists where key κ is present. We say a key κ ∈ S if non-SIMD counterpart. there exists a data node α in the storage layer of the index, Recently, tree-based in-memory indices leveraging SIMD S, such that α.κ=κ. have been proposed to speed up query processing [21, 35]. ThedifferencebetweenatraditionalskiplistandPIliesin FAST[21],areadonlybinarytree,canachieveanextremely theindexlayer. Inatraditionalskiplist,thenodeofacom- high throughput of query processing as a consequence of posinglinkedlistoftheindexlayercontainsonlyonekey. In SIMDprocessingandenhancedcacheconsciousnessenabled contrast, a fixed number of keys are included in a list node by a carefully designed memory layout of tree nodes. An- of the index layer, which we shall call an entry hereafter. otherrepresentativeofSIMD-enabledB+-treesisPALM[35], The reason for this arrangement of keys is that it enables whichovercomestheFAST’slimitationofnotbeingableto SIMD processing, which requires operands to be contigu- supportupdatesattheexpenseofdecreasedquerythrough- ously stored. An instance of PI with four keys in an entry put. is shown in Figure 1, where three different operations are collectivelyprocessedbytwothreads,whicharerepresented 2.4 NUMA-awareness by the two arrows colored purple and green, respectively. NUMAarchitectureopensupopportunitiesforoptimiza- Givenaninitialdataset,theindexlayercanbeconstructed tion in terms of cache coherence and memory access, which in a bottom up manner. We only need to scan the storage cansignificantlyhindertheperformanceifnottakenintode- layer once to fill the high-level entries as well as the asso- sign [27][7]. Sinceaccessingthememoryaffiliatedwithre- ciated data structure, i.e., routing table, which will be dis- mote(non-local)NUMAnodesissubstantiallycostlierthan cussed in the next section in detail. This construction pro- accessing local memory, the major direction for NUMA- cessisO(n)wherenisthenumberofrecordsatthestorage awareoptimizationistoreducetheaccessesofremotemem- level, typically taking less than 0.2s for 16M records when ory,andmeanwhilekeeploadbalancingamongNUMAnodes[31]. running on a 2.0 GHz CPU core. Further, the construction Many systems have been proposed with NUMA aware op- processcanbeparallelized,andhencecanbespedupusing timizations. ERIS[22]isanin-memorystorageenginewhich more computing resources. employsanadaptivepartitioningmechanismtorealizeNUMA 3.2 QueriesandAlgorithms topology and hence reduce remote memory accesses. ATra- Pos[30]furtheravoidscommutativesynchronizationsamong PI, like most indexing structures, supports three types of NUMA nodes during transaction processing. queries, namely search, insert and delete. Their detailed Therearealsomanyeffortsdevotedtooptimizingdatabase descriptions are as follows, and an abstraction of them is operationswithNUMA-awareness. Albutiuetal. proposea further given in Definition 2. Entry1 Entry2 Q1 Query1: Search(8) Level 4 28 112 … … … … … … Q2 Query2: Delete(28) 1Q 2Q Q3 Query3: Insert(102) memory memory Entry3 Entry4 Entry5 Level 3 1 14 28 … 100 … … … … … … … Node 1 Node 2 Q 1 Entry6 Entry7 Entry8 Entry9 Node 3 Node 4 Level 2 1 8 14 … 28 … … … 100 112 132 … … … Index Layer memory memory 102 Level 1 HEAD 1 … 8 14 … 28 … … 100 112 … … TAIL Partition Storage Partition Core 1 Partition(node 2)Core 2 Layer Figure 1: An instance of PI • Search(κ): if there exists a data node α in the index Algorithm 1: Query processing withα.κ=κ,α.pwillbereturned,andnullotherwise. Input :S,PIindex Q,queryset • Insert(κ, p): if there exists a data node α in the index with α.κ = κ, update the this data node by t1,...,tNT,NT threads Output:R,ResultSet replacingα.pwithp;otherwiseinsertanewdatanode 1 R=∅; (κ,p,Γ) into the storage layer, where Γ is drawn from 2 fori=1→NT do a geometrical distribution with a specified probability 3 Qi=partition(Q,N); parameter. 4 /*traversetheindexlayertogetinterceptions*/ 5 foreachThread ti do • Delete(κ): if there exists a data node α in the index 6 Πi=traverse(Qi,S); with α.κ=κ, remove this data node from the storage 7 waitTillAllDone(); layer and return 1; otherwise return null. 8 /*redistributequeryworkload*/ 9 fori=1→NT do Definition 2. A query, denoted by q, is a triplet 10 redistribute(Πi,Qi,i); (t,κ,[p]) 11 /*queryexecution*/ 12 foreachThread ti do wheretandκarethetypeandkeyofq,respectively,andift 13 Ri=execute(Πi,Qi); is insert, p provides the pointer to the new value associated 14 waitTillAllDone(); with key κ. 15 R=∪Ri; 16 returnR; We now define the query set Q in Definition 3. There are two points worth mentioning in this definition. First, the queries in a query set are in non-decreasing order of the query key k, and the reason for doing so will be elabo- improve the throughput of query processing, as shown in ratedinSection 3.2.4. Second,aquerysetQonlycontains Section 6.2, the average processing time of queries are not pointqueries,andwewillshowhowsuchaquerysetcanbe much affected. constructed and leveraged to answer range queries in Sec- ThedetailedqueryprocessingofPIisgiveninAlgorithm1. tion 3.2.5. First,thequerysetQisevenlypartitionedintodisjointsub- setsaccordingtothenumberofthreads,andthei-thsubset Definition 3. A query set Q is given by is allocated to thread i for processing (line 3). The ordered setofqueriesallocatedtoathreadisalsoaquerysetdefined Q={q |1≤i≤N} i in Definition 3, and we call it a query batch in order to dif- whereN isthenumberofqueriesinQ, q isaquerydefined ferentiateitfromtheinputqueryset. Eachthreadtraverses i in Definition 2, and q .κ≤q .κ iff i<j. the index layer and generates for each query in its query i j batch an interception which is defined in Definition 4 (line Definition 4. Foraqueryq,wedefinethecorresponding 5 and 6). After this search process, the resultant intercep- interception,I ,asthedatanodewiththelargestkeyamong tionsareleveragedtoadjustquerybatchesamongexecution q those in {α|α.Γ>1,α.κ≤q.κ}. threads such that each thread is able to safely execute all the queries assigned to it after the adjustment (line 9 and PI accepts a query set as input, and employs a batch 10). Finally, each thread individually executes the queries technique to process the queries in the input. Generally, in its query batch (line 12 and 13). The whole procedure batchprocessingmayincreasethelatencyofqueryprocess- is exemplified in Figure 1, where three queries making up a ing, as queries may need to be buffered before being pro- query set are collectively processed by three threads. Fol- cessed. However, since batch processing can significantly lowing the purple arrows, thread 1 traverses downwards to Algorithm 2: Traversing the index layer 32bit 32bit 32bit 32bit va a1 a2 a3 a4 mask Next Input :S,PIindex Q,querybatch 1000 … Output:Π,interceptionset vb b1 b2 b3 b4 1100 … 1 Π=∅; 2 foreachq∈Qdo mask m1 m2 m3 m4 1110 4 34 evnbe=xt_=sigmedt_TloopaEdn(tq.rκy)(;S); 1mbiit = ( a 1ib >it= b 1i)b ?it 1 :1 0bi t 1111 3 5 whileisStorageLayer(enext)==falsedo (a) SIMD comparison (b) Routing table for Entry 1 6 va=_simd_load(enext); 7 mask=_simd_compare(va,vb); 8 /*Rnext istheroutingtableofenext */ Q1 Query1: Search(8) 9 enext=findNextEntry(enext,mask,Rnext); 10 Π=Π∪{enext}; Q3 Query3: Insert(102) Entry 1 Entry 2 11 returnΠ; 28 112 … … 0000 … … … … fetch the data node with key 8 and delete the data node 1111 1Q 1110 1100 1000 Entry 3 Entry 4 withkey26instoragelayer,andthread2movesalongwith 1 14 28 … 100 … … … the green arrows to insert the data node with key 102. 3.2.1 TraversingtheIndexLayer (c) Routing process for Entry 1 Algorithm 2 shows how the index layer is traversed to find the interceptions for queries. For each query key, the Figure 2: Querying with a routing table traversal starts from the top level of the index layer and moves forward along this level until an entry containing a Algorithm 3: Redistribute query workload larger key is encountered, upon which it moves on to the nextlevelandproceedsasitdoesinthepreviouslevel. The Input : Q={q1,q2,...},querybatch traversal terminates when it is about to leave the bottom Π={Iq1,Iq2,...},interceptionset T ,lastexecutionthread last leveloftheindexlayer,andrecordsthefirstdatanodethat Tnext,nextexecutionthread will be encountered in the storage layer as the interception 1 /*exchangethekeyofthefirstinterception*/ for the current query. 2 sendKey(Tlast,Iq1.κ); WeexploitSingleInstruction,MultipleData(SIMD)pro- 3 κ=recvKey(Tnext); cessing to accelerate the traversal. In particular, multiple 4 /*waitfortheadjustmentfromlastthread*/ keys within an entry can be simultaneously compared with 5 Q(cid:48)=recvQuery(Tlast),Q=Q(cid:48)∪Q; 6 foreachq∈Q(cid:48) do the query key using an simd compare instruction, which 7 /*Iq issameasIq1*/ significantly reduces the number of comparisons, and this 8 Π=Iq1∪Π is the main reason we put multiple keys in each entry. In 9 Q(cid:48)=∅; ourimplementation,keysinanentryexactlyoccupyawhole 10 fori=|Q|→1do SIMDvector,andcanbeloadedintoanSIMDregisterusing 11 if Iqi.κ=κthen a single simd load instruction. 12 Q(cid:48)=Q(cid:48)∪{qi},Q=Q\{qi},Π=Π\{Iqi}; Since each simd compare instruction compares multiple 13 else keys in an entry, the comparison result can be diversified, 14 break; and hence an efficient way to determine the next entry to 15 sendQuery(Tnext,Q(cid:48)); visit for each comparison result is needed. To this end, we generatearoutingtableforeachentryduringtheconstruc- tion of PI. For each possible result of SIMD comparison, the routing table contains the address of the next entry to ceptionisexclusivelyaccessedbyasinglethread,andsoare visit. Figure2(a)givesanexampleofSIMDcomparison. As the data nodes between two adjacent interceptions. To this showninthisfigure,eachSIMDcomparisonleadstoa4-bit end, each thread iterates backward over its interception set mask,representingfivepotentialresultsindicatedbythered until it finds an interception different from the first inter- arrowsshowninFigure2(c). Thismaskisthenindexedinto ception of the next thread, and hands over the queries cor- theroutingtable,whichisalsoshowninFigure2(b),tofind responding to the iterated interceptions to the next thread. the next entry for comparison. The details of this process are summarized in Algorithm 3 andexemplifiedinFigure3. Aftertheadjustment,athread 3.2.2 RedistributeQueryWorkload canindividuallyexecutetheallocatedqueryworkloadwith- GiventheinterceptionsetoutputbyAlgorithm2,athread out contending for data nodes with other threads. can find for each allocated query q the data node with the 3.2.3 QueryExecution largestkeythatislessthanorequaltoq.κbywalkingalong the storage layer, starting from I . However, it is possible The query execution process at each thread is demon- q thattwoqueriesallocatedtotwoadjacentthreadshavethe strated in Algorithm 4. For each query q, an execution same interception, leading to contention between the two threaditeratesoverthestoragelayer,startingfromthecor- threads. To handle this case, we slightly adjust the query respondinginterception,andexecutesthequeryagainstthe workloadamongtheexecutionthreadssuchthateachinter- data node with the largest key that is less than or equal to Algorithm 4: Query execution 𝜋𝑖 𝜋j 𝑞 S(90) S(92) I(99) S(100) 𝑞 I(102) D(102) S(112) S(123) Input :ΠQ=={I{qq11,,Iqq22,,......}},,aaddjjuusstteeddqinuteerrycebpatticohnesset 𝛼𝑠 A90 A90 A97 A100 𝛼𝑠 A100 A100 A112 A112 Output: R,resultset S: Search 1 R=∅; Storage Layer 90 97 100 112 130 I: Insert D: delete 2 fori=1→|Q|do 3 /*walkalongthestoragelayer,*/ 4 /*startingfromthecorrespondinginterception*/ Figure 3: Interception adjustment 5 node=getNode(qi,Iqi); 6 ri=null; 7 if qi.t==“Search”then to two reasons. First, cache utilization can be improved by 8 if node.κ==qi.κ && !node.Fdel then 9 ri=node.p; processing ordered queries in Algorithm 1, since the entries and/or data nodes to be accessed for a query may have al- 10 else if qi.t==“Insert”then ready been loaded into the cache during the processing of 11 if node.κ==qi.κthen 12 node.p=qi.p; previous queries. Second, a sorted query set leads to sorted 13 else interceptionsets(orderedbyquerykey),whichisnecessary 14 node=insertNode(node,qi.κ,qi.p,Γ); forinterceptionadjustmentandqueryexecutiontoworkas 15 node.Fdel=false; expected. 16 else 3.2.5 RangeQuery 17 if node.κ==qi.κthen 18 node.Fdel=true; Range query is supported in PI. Given a set of range queries(anormalpointquerydefinedinDefinition2isalsoa 19 R=R∪{ri} rangequerywiththeupperandlowerboundofthekeyrange 20 returnR; being the same), we first sort them according to the lower boundoftheirkeyrangeandthenconstructaquerysetde- fined in Definition 3 using these lower bounds. This query set is then distributed among a set of threads to find the q.κ. If the query type is delete, we do not remove the data corresponding interceptions, as in the case of point queries. node immediately from the storage layer, but merely set The redistribution of query workload, however, is slightly a flag F instead, which is necessary for latch-free query del different. Denote the first element in the interception set processing, as we shall explain in Section 3.2.4. For the of thread i, i.e., the interception corresponding to the first search query, the F flag of the resultant data node will del query in the query batch of thread i, by Ii . For each allo- be checked to decide the validity of its pointer. For the q1 update query, a new data node will be inserted into the cated query with a key range of [κs,κe], where κe ≥ Iqi+11, storage layer if the query key does not match that of the thread i partitions it into two queries with the key ranges resultant data node. Unlike a typical skip list, PI only allo- being [κs,Iqi+11.κ) and [Iqi+11.κ,κe], respectively, and hands catesarandomheightforthenewnode,butdoesnotupdate over the second query to thread i+1. As in Algorithm 3, the index layer immediately. With more and more updates this redistribution process must be performed in the order madetothestoragelayer,theindexlayershouldbeupdated ofthreadid inordertohandlerthecasewherethekeyrange accordingly to guarantee the performance of query process- ofarangequeryisonlycoveredbytheunionoftheintercep- ing. Currently, a background process monitors the updates tion sets of three or more threads. After the redistribution to the storage layer, and asynchronously rebuild the whole of query workload, each thread then executes the allocated index layer when the number of updates exceeds a certain queries one by one, which is quite straightforward. Start- threshold. The detail is given in Section 4.3.5. ingfromthecorrespondinginterception,PIiteratesoverthe storagelayertofindthefirstdatanodewithinthekeyrange, andthenexecutesthequeryuponitandeachfollowingdata 3.2.4 NaturallyLatch-freeProcessing nodeuntiltheupperboundofthekeyrangeisencountered. It is easy to see that query processing in PI is latch-free. Thefinalresultofanoriginalrangequerycanbeacquiredby In the traversal of the index layer, since the access to each combining the result of corresponding partitioned queries. entry is read-only, the use of latches can be avoided at this stage. For the adjustment of query workload, each thread 4. IMPLEMENTATION communicates with its adjacent threads via messages and thus does not rely on latches. In addition, each query q al- 4.1 StorageLayout located to thread i after the adjustment of query workload satisfies Ii .κ ≤ q.κ < Ii+1.κ, where Ii and Ii+1 are the As we have mentioned, PI logically consists of the index q1 q1 q1 q1 firstelementintheinterceptionsetsofthreadiandi+1,re- layer and the storage layer. The index layer further com- spectively. Consequently, thread i can individually execute prises multiple levels, each containing the keys appearing without latches all its queries except those which require at that level. Keys at the same level are organized into reading Ii+1 or inserting a new node directly before Ii+1, entriestoexploitSIMDprocessing,andeachentryisassoci- q1 q1 sincethedatanodesthatwillbeaccessedduringtheexecu- atedwitharoutingtabletoguidethetraversaloftheindex tionofthesequerieswillneverbeaccessedbyotherthreads. layer. For better cache utilization, entries of the same level Theremainingqueriescanstillbeexecutedwithoutlatches are stored in a contiguous memory area. The storage layer asthefirstinterceptionofeachthreadwillneverbedeleted, ofPIisimplementedasatypicallinkedlistofdatanodesto as described in Section 3.2.3. support efficient insertions. Since entries contained in each In our algorithm, a query set Q is ordered mainly due level of the index layer are stored compactly in a contigu- ous memory area, PI cannot immediately update the index Thread 1 Thread 2 Thread 3 Thread 4 layerupontheinsertion/deletionofadatanodewithheight h>1. Currently,weimplementasimplestrategytorealize NODE 1 these deferred updates. Once the number of insertions and deletions exceeds a certain threshold, the entire index layer Thread 4 Thread 5 Thread 6 Thread 7 is rebuilt from the storage layer in a bottom-up manner. NODE 2 Although this strategy seems to be time-consuming, it is (a) Uniform query workload highly parallelizable and the rebuilding process can thus be shortened by using more threads. Specifically, each thread can be assigned a disjoint portion of the storage layer and Thread 1 Thread 2 Thread 3 maderesponsibletoconstructthecorrespondingpartofthe index layer. The final index layer can then be obtained by NODE 1 simply concatenating these parts level by level. Thread 4 Thread 5 Thread 6 Thread 7 Thread 8 4.2 ParallelizationandSerializability NODE 2 We focus on two kinds of parallelization in the imple- (b) Skewed query workload mentation,i.e.,data-levelparallelizationandscale-upparal- lelization. It is also worth noting that serializability should Figure 4: Self-adjusted threading be guaranteed in the parallel process. Data-level parallelization is realized through the use of SIMD instructions during the traversal of the index layer. Asmentionedbefore,inourimplementation,eachentrycon- nodes such that the key ranges corresponding to the data tainsmultiplekeys,andtheircomparisonwiththequerykey nodes allocated to each NUMA node are disjoint. One or can be done using a single SIMD comparison instruction, more threads will then be spawned at each NUMA node which substantially accelerates the search process of inter- to build the index from the data nodes of the same NUMA ceptions. Moreover,SIMDinstructionscanbeintroducedin node,duringwhichonlylocalmemoryaccessesareincurred. sortingthequerysetQtoimprovetheperformance. Inour Asaresult,foreachNUMAnode,thereisaseparateindex, implementation,weuseIntelIntrinsicLibrarytoimplement which can be used to independently answer queries falling SIMD related functions. in the range of its key set. Each incoming query will be Weexploitscale-upparallelizationprovidedinmulti-core routedtothecorrespondingNUMAnode,andgetprocessed systemsbydistributingthequeryworkloadamongdifferent by the threads spawned in that node. In this way, there is cores such that the execution thread running at each core no remote memory access during query processing, which canindependentlyprocessthequeriesassignedtoit. Serial- willtranslateintosignificantlyenhancedquerythroughput, izabilityissuesmayariseasaresultofcoexistenceofsearch as shown in the experiments. Moreover, the indices at dif- and update queries in one batch, and we completely elim- ferent NUMA nodes also collectively improve the degree of inate these issues by ensuring that only one thread takes parallelism since they can be used to independently answer charge of each data node at any time. queries. This idea of parallelism is aptly illustrated in Fig- ure 1, where two threads are spawned in the first NUMA 4.3 Optimization node and the other three nodes can have multiple threads running in parallel as well. 4.3.1 NUMA-awareOptimization 4.3.2 LoadBalancing The hierarchical structure of a modern memory system, suchasmultiplecachelevelsandNUMA(Non-UniformMem- It can be inferred from Algorithm 3 that PI ensures the oryAccess)architecture,shouldbetakenintoconsideration data nodes between two adjacent interceptions are handled during query processing. In our implementation, we orga- by the same thread. As long as the queried keys of a batch nize incoming queries into batches, and process the queries arenotlimitedtoaveryshortrange,PIisabletodistribute batch by batch. The queries within one batch are sorted the query workload evenly among multiple threads (within before processing. In this manner, cache locality can be ef- a NUMA node) due to the fact that queries of a batch are fectively exploited, as the search process for a query key is sorted and the way we partition the queries. However, it likelytotraversetheentries/datanodesthathavejustbeen is more likely that the load among multiple NUMA nodes accessed during the search of the previous query and thus is unbalanced, which occurs when most incoming queries have already been loaded into the cache. should be answered by the data nodes located at a sin- A NUMA architecture is commonly used to enable the gle NUMA node. To address this problem, we use a self- performance of a multi-core system to scale with the num- adjustedthreadingmechanism. Inparticular,whenaquery ber of processors/cores. In a NUMA architecture, there batch has been partitioned and each partitioned sub-query are multiple interconnected NUMA nodes, each with sev- has been assigned to the corresponding NUMA node, we eral processors and its own memory. For each processor, spawn threads in each NUMA node to process its allocated accessinglocalmemoryresidinginthesameNUMAnodeis queries such that the number of threads in each NUMA muchfasterthanaccessingremotememoryofotherNUMA nodeisproportionaltothenumberofqueriesassignedtoit. nodes. It is thus extremely important for a system running Consequently, a NUMA node that has been allocated more over a NUMA architecture to reduce or even eliminate re- querieswillalsospawnmorethreadstoprocessqueries. An mote memory access for better performance. exampleanditsfurtherexplanationonthethreadingmech- PI evenly distributes data nodes among available NUMA anism are given in Section 4.3.3. 4.3.3 Self-adjustedthreading Table 1: Notations for Analysis In this section, we shall elaborate on our self-adjusted Symbol Description threadingmechanism,whichallocatesthreadsamongNUMA H index height nodesforqueryprocessingsuchthatqueryperformancecan P probability parameter bemaximizedwithinagivenbudgetofcomputingresource. M number of keys contained in an entry As mentioned in Section 4.3.2, if there are several NUMA L memory access latency nodesavailable,PIwillallocatethedatanodesamongthem, N number of the initial data nodes build a separate index in each NUMA node from its allo- R ratio of insert queries cated data nodes, and route arriving queries to the NUMA S entry size nodewithmatchingkeysinordernottoincurexpensivere- e mote memory accesses. Given the query workload at each Sn data node size NUMA node, our mechanism dynamically allocates execu- Sl size of a cache line tion threads among NUMA nodes such that the number of Sc size of the last level cache threads running on each NUMA node is proportional to its Tc time to read a cache line query workload, i.e., the number of queries routed to it. from the last level cache In the cases where a NUMA node has used up its hard- ware threads, our mechanism will offload part of its query workload to other NUMA nodes with available computing The key search process is to locate the data node with resource. matchingkeyatStorageLayer. Thetimespentinthispro- 4.3.4 GroupQueryProcessingandPrefetching cess is dominated by the number of entries and data nodes that need to be read for key comparison. Since entries at the same level of the index are stored in Given the number of initial data nodes, N, the height a contiguous memory area, it is thus more convenient (due of PI, H, is about (cid:100)−log N(cid:101). At each level of PI, the tofewertranslationlookasidebuffermisses)tofetchentries P numberofkeysbetweentwoadjacentonesthatalsoappear of the same level than to fetch entries locating at different at a higher level follows a geometric distribution, and has levels. In order to realize this location proximity, instead of an expected value of 1/P. Therefore, the average number processing queries one by one, PI organizes the queries of a ofentriesneededtocomparewithateachleveloftheindex batch into multiple query groups, and processes all queries layer is approximately (cid:100)1+P (cid:101), where 1+P is the average in a group simultaneously. At each level of the index layer, 2PM 2P numberofkeysthatneedtobecomparedwith. Thenumber PItraversesalongthisleveltofindthefirstentryatthenext of cache lines that need to be read at each level is thus leveltocomparewithandthenmovesdownwardtothenext (cid:100)Se(cid:101)(cid:100)1+P (cid:101). Consequently, the total number of cache lines lower level to repeat this process. Sl 2PM duringthetraversaloftheindexlayeris(H−1)(cid:100)Se(cid:101)(cid:100)1+P (cid:101). Moreover, this way of group query processing naturally Sl 2PM calls for the use of prefetching. When the entry to be com- Thisishoweveraslightover-estimate,sincethetoplevelsof paredwithatthenextlevelislocatedforaquery,PIissuesa the index layer may already have been read into L1 cache. prefetchinstructionforthisentrybeforeturningtothenext When the interception has been located in the storage query. Therefore,therequestedentriesatthenextlevelmay layer for a given query, the search process proceeds by it- havealreadybeenloadedintoL1cachebeforethecompari- erating over the data nodes from the one contained in the sonbetweenthemandthequeriedkeys,therebyoverlapping interception until the node with matching key is encoun- the slow memory latency. tered. Thenumberofdatanodesscannedduringthisphase is about half of the number of data nodes contained be- 4.3.5 BackgroundUpdating tween two adjacent ones with their key appearing at the index layer, which is given by (cid:100)1+P (cid:101), and the number of Weuseadaemonthreadrunninginbackgroundtoupdate 2PM indexlayer. Ifthenumberofupdateoperationsmeetsapre- cache lines read during this stage is (cid:100)SSnl(cid:101)(cid:100)21P+MP (cid:101). defined threshold, the daemon thread will start rebuilding Thenumberofdatanodesreadduringthescanningofthe the index layer. The rebuilding of the index layer is fairly storagelayerisnotaffectedbythedeletequeriesbecauseof straightforward. The daemon thread traverses the storage the PI query processing strategy. However, insert queries layer, put the key of each valid data node with height > 1 do impact this number. Assuming insert queries are uni- encountered into an array sequentially, and updates the as- formly distributed among the storage layer, and the aggre- sociatedroutetablewiththeaddressofthisdatanode. The gatenumberofinsertanddeletequeriesdoesnotexceedthe newindexlayerwillbeputintouseafterallrunningthreads threshold upon which the rebuilding of the index layer will complete the processing of current query batch, and mean- be triggered. The number of data nodes during key search while the old index layer will be discarded. process will become (1+iR/N)(1+P), where i is the number 2P of queries that have been processed, and R is the ratio of insert queries among the processed i queries. Inourimplementation,theprobabilityofkeyelevation,P, is 0.25 as suggested in [32], and each entry contains 4 keys, 5. PERFORMANCEMODELING eachbeinga32-bitfloatnumber. Therefore,thesizeofeach Inthissection,wedevelopaperformancemodelforquery entry,S ,is4*(4+8)=48bytes,wheretheadditional8bytes e processingofPI.Thesymbolsusedinouranalysisaresum- foreachkeyisrequiredbyroutetabletodeterminethenext marized in Table 1. entry to compare. Each data node occupies 20 bytes: 4 bytes are used for key, and the other 16 bytes compose two 5.1 KeySearch pointers,onepointingtothevalue,e.g.,atupleinadatabase Table 2: Parameter table for experiments Input/OutputHub Parameter Value Cores(2GHz) QPI Dataset size(M) 2, 4, 8, 16, 32, 64, 128, 256 C1 C2 C3 C4 C5 C6 C1 C2 C3 C4 C5 C6 Batch size 2048, 4096, 8192, 16384, 32768 yromem LL12 LL12 LL12 LL12 LL12 LL12 LL12 LL12 LL12 LL12 LL12 LL12 yromem Number of Threads 1, 2, 4, 8, 16, 32 L3 L3 Write Ratio(%) 0, 20, 40, 60, 80, 100 Local Memory(L112 8CGacBh)e(L322 CKaBc)he(256KB)L3 Cache(18MB) Zipfian parameter θ 0, 0.5, 0.9 C1 C2 C3 C4 C5 C6 C1 C2 C3 C4 C5 C6 yromem LL12 LL12 LL12L3LL12 LL12 LL12 LL12 LL12 LL12L3LL12 LL12 LL12 yromem 3450 PPII__isnesaerrcth Input/OQutPpIutHub M queries/sec) 223050 MMaassssttrreeee__isnesaerrcth Figure 5: CPU architecture for the experiments Throughput ( 1105 5 0 table, and the other for the next data node. Consider an 2M 4M 8M 16M 32M 64M 128M 256M instanceofPIwith512Kkeys. Itssizecanbecomputedby size of dataset 20∗512K +1/3∗48∗512K = 18M, and the whole index Figure 6: Query throughput vs dataset size can be kept in the last level cache of most servers (e.g. the one we use for the experiments). Therefore, the processing of each search/delete query fetches about 12 cache lines, 9 skewed query workloads, and finally examine PI’s perfor- for the index layer and 3 for the storage layer. The total mance with respect to range query. For comparison, the time cost is thus 12T , where T is the time to read a cache c c resultofMasstree[29]underthesameexperimentsettingis line from the last level cache. also given, whenever possible. We choose Masstree as our 5.2 RebuildingtheIndexLayer baseline mainly due to its high performance and maturity which is evidenced by its adoption in a widely recognized For the sake of simplicity, we only focus on the indices system, namely SILO [37]. The Masstree code we use is with a lot of data nodes, in which case the data nodes are retrieved from github [1], and its returned results are con- unlikely to be cached, and hence require to be read from sistent with (or even better than) those presented in the memory during the rebuilding of the index layer. In addi- original paper [29], as shown in the following sections. tion, a data node occupies 48 bytes, as mentioned in last If not otherwise specified, we use the following default section, and hence can be fetched within a single memory settings for the experiments. The key length is four bytes. access, which normally brings a 64-byte cache line into the ThereareeightexecutionthreadsrunningonthefourNUMA cache. Therefore,foranindexwithN datanodes,scanning nodes, and the number of threads running on each node is the storage layer costs a time of NL. In addition, there are proportional to the query workload for this node, as men- NP/(1−P)entriesandroutingtablesthatneedtobewrit- tioned in Section 4.1. Three datasets, each with a different ten back into memory, which costs another 2NP/(1−P) number of keys, are used. The small and medium datasets memory accesses. Therefore, the total time of rebuilding have 2M and 16M keys, respectively, and the large dataset the index layer can be approached by (1+P)NL/(1−P). has128Mkeys. Theindexbuiltfromthedatasetareevenly However, with more threads participating in the rebuild- distributed among the four NUMA nodes. Each NUMA ingprocess,thistimecanbereducedalmostlinearlybefore node holds a separate index accounting for approximately bounded by memory bandwidth. 1/4 of the keys in the dataset. All the parameters for the experiments are summarized in Table 2, where the default 6. PERFORMANCEEVALUATION value is underlined when applicable. WeevaluatetheperformanceofPIonaplatformwith512 ThequeryworkloadisgeneratedfromYahoo! CloudServ- GBofmemoryevenlydistributedamongfourNUMAnodes. ing Benchmark (YCSB) [13], and the keys queried in the EachNUMAnodeisequippedwithanIntelXeon7540pro- workloadfollowazipfiandistributionwithparameterθ=0, cessor, which supports 128-bit wide SIMD processing, and i.e., a uniform distribution. A query batch, i.e., the set hasaL3cacheof18MBandsixon-chipcores,eachrunning of queries allocated to a thread after partitioning in Al- at 2 GHz. The CPU architecture is described in Figure 5, gorithm 1, contains 8192 queries, and a discussion on how where QPI stands for Intel QuickPath Interconnect. The to tune this parameter is given in Section 6.2. The whole operating system installed in the experimental platform is index layer is asynchronously rebuilt after a fixed number Ubuntu 12.04 with kernel version 3.8.0-37. (15% of the original dataset size) of data nodes have been The performance of PI is extensively evaluated from var- inserted/deleted into/from the index. ious perspectives. First, we show the adaptivity of PI’s 6.1 DatasetSize performance of query processing by varying the size of the dataset and batch, and then adjust the number of execu- Figure6showstheprocessingthroughputofPIandMasstree tionthreadstoinvestigatethescalabilityofPI.Afterwards, for search and insert queries. For this experiment, we vary we study how PI performs in the presence of mixed and thenumberofkeysinthedatasetfrom2Mto256M,andex- aminethequerythroughputforeachdatasetsize. Thewhole 50 index and query workload is evenly distributed among the sseeaarrcchh((126MM)) feoaucFrhroNNmUUMFMiAgAurnneood6de,eso,tnoaenpcdraotncheseserseeqtauhreeartitewtsh.oetthhrreoaudgshrpuuntnfionrgboovtehr M queries/sec) 3400 siiinnnesssaeeerrrrcttth(((211(1M622M8)8M)M)) sdreealataratcsihveetalysnidzsteainibnslceerrfetoarselexasprgefrreoiremdnac2etasMsaettomsi6zoe4dsMe.r,aTtahenisddvetachrrieeanatsieboneactsormetnhedes Throughput ( 1200 inthethroughputisnatural. Forthedatasetwith2Mkeys, as we have explained in Section 5, the entire index can be 0 1024 2048 4096 8192 16384 32768 accommodated in the last level caches of the four NUMA size of batch nodes, and the throughput is hence mainly determined by the latency to fetch the entries and data nodes from the Figure 7: Query throughput vs batch size cache. As the dataset size increases, more and more entries anddatanodesarenolongerabletoresideinthecacheand hencecanonlybeaccessedfrommemory,resultinginhigher Figure6. Therefore,forsmallerdatasets,theadditionaltime latency and lower query throughput. spentininterceptionadjustmentandwarmingupthecache, The throughput of insert queries of PI is not as high as which is similar across the three datasets, plays a more im- that of search queries. The reason is two-fold. First, as portant role than for larger datasets. As a result, smaller insert queries are processed, more and more data nodes are datasets benefit more from the increase in query batch size inserted into the storage layer of the index, resulting in an thanlargerdatasetsdo. ItcanbeseenfromFigure7thatPI increaseinthetimetoiterateoverthestoragelayer. Second, performs reasonably well under the default setting of 8192 the creation of data nodes leads to the eviction of entries for batch size, but there still remains space of performance and data nodes from the cache, and a reduced cache hit improvementforsmalldatasets. Hence,thereisnoone-size- rate. This also explains why the performance gap between fits-all optimal setting for batch size, and we leave behind insert and search queries gradually shrinks with the size of theautomaticdeterminationofoptimalbatchsizeasfuture dataset. work. As can be observed from Figure 6, the throughput of bothsearchandinsertqueriesinMasstreeismuchlessthan 6.3 Scalability that of PI. In particular, PI is able to perform 34M search Figure 8 shows how the query throughput of PI and of queries or 29M insert queries in one second when the index Masstree varies with the number of execution threads. The can be accommodated in the cache, which are respectively threadsandtheindexarebothevenlydistributedamongthe five and four times more than the corresponding through- four NUMA nodes. For the case in which there are n < 4 put Masstree achieves for the same dataset size. For larger execution threads, threads and the whole index are evenly datasets, PI can still consistently perform at least 1.5x and distributedamongthesamenumberofNUMAnodeswhich 1x better than Masstree in terms of the search throughput are randomly selected from the available four. We use the and insert throughout, respectively. threedatasets,i.e.,2M,16Mand256Mrespectively,forthis experiment. 6.2 BatchSize ItcanbeseenfromFigure8thatapparently,bothPIand Wenowexaminetheeffectofthesizeofquerybatches,on Masstree can get their query throughput to increase signif- thethroughputofPI.Forthisexperiment,thethreedefault icantly with more computing resources, but there exists a datasets, i.e., the small, medium and large datasets with substantial gap in the rate of improvement between PI and 2M, 16M and 128M keys, respectively, are used. Masstree, especially in the cases with a small number of Figure7showstheresultofquerythroughputwithrespect threads. InPI,whenthenumberofthreadschangesfrom1 toquerybatchsizeforthethreedatasets. Itcanbeseenthat to4,thethroughputundergoesasuper-linearincrease. This thesizeofquerybatchesindeedaffectsquerythroughput. In is because when the number of threads is no larger than 4, particular,asthesizeofquerybatchincreases,thethrough- thewholeindexisevenlydistributedamongthesamenum- put first undergoes a moderate increase. This is due to the ber of NUMA nodes. Consequently, the index size, i.e., the fact that the queries contained in a batch are sorted based number of entries and data nodes, halves when the number onthekeyvalue,andalargerbatchsizeimpliesabetteruti- of threads doubles. With a smaller index size, cache can lizationofCPUcaches. Inaddition,thereisaninterception be more effectively utilized, leading to an increased single- adjustmentstagebetweenkeysearchandqueryexecutionin thread throughout, and hence a super-linear increase in ag- theprocessingofeachquerybatch,whosecostonlydepends gregate throughout. Since smaller indices are more cache onthenumberofrunningthreads,andthusissimilaracross sensitive than larger ones, they benefit more from the in- different batch sizes. Consequently, with more queries in a crease of the number of threads in terms of the throughput singlebatch,thenumberofinterceptionadjustmentscanbe of query processing, as can be observed from Figure 8. reduced, which in turn translates into an increase in query Whenthenumberofthreadscontinuestoincrease,thein- throughput. creaserateinquerythroughputofPIgraduallyslowsdown, Figure 7 also demonstrates that the effect of batch size but is still always better than or equal to that of Masstree. exertingonquerythroughputismoresignificantforsmaller This flattening can be attributed to the adjustment of in- datasetsthanforlargerdatasets. Thereasonsareasfollows. terceptions which take place when there are more than one For query batches of the same size, the processing time in- threadservicingthequeriesinaNUMAnode. Andsincethe creaseswiththesizeofdataset,aswehavealreadyshownin costofinterceptionadjustmentonlydependsonthenumber