Concurrent Hash Tables: Fast and General(?)! Tobias Maier1, Peter Sanders1, and Roman Dementiev2 1 Karlsruhe Institute of Technology, Karlsruhe, Germany {t.maier,sanders}@kit.edu 2 Intel Deutschland GmbH [email protected] Abstract 1 Introduction 6 Concurrent hash tables are one of the most impor- A hash table is a dynamic data structure which 1 stores a set of elements that are accessible by their tant concurrent data structures which is used in 0 key. Itsupportsinsertion,deletion,findandupdate numerous applications. Since hash table accesses 2 inconstantexpectedtime. Inaconcurrenthashta- can dominate the execution time of whole applica- p ble,multiplethreadshaveaccesstothesametable. tions, we need implementations that achieve good e This allows threads to share information in a flex- S speedup even in these cases. Unfortunately, cur- ible and efficient way. Therefore, concurrent hash rently available concurrent hashing libraries turn 6 tables are one of the most important concurrent out to be far away from this requirement in partic- ] ular when adaptively sized tables are necessary or data structures. See Section 4 for a more detailed S contention on some elements occurs. discussion of concurrent hash table functionality. D To show the ubiquity of hash tables we give a Our starting point for better performing data . short list of example applications: A very sim- s structures is a fast and simple lock-free concurrent c ple use case is storing sparse sets of precom- hash table based on linear probing that is however [ puted solutions (e.g. [27], [3]). A more compli- limited to word sized key-value types and does not 2 cated one is aggregation as it is frequently used support dynamic size adaptation. We explain how v in analytical data base queries of the form SELECT to lift these limitations in a provably scalable way 7 FROM...COUNT...GROUP BYx[25]. Suchaqueryse- 1 and demonstrate that dynamic growing has a per- lects rows from one or several relations and counts 0 formance overhead comparable to the same gener- for every key x how many rows have been found 4 alization in sequential hash tables. 0 (similarqueriesworkwithSUM,MIN,orMAX).Hash- . We perform extensive experiments comparing ing can also be used for a data-base join [5]. An- 1 theperformanceofourimplementationswithsixof othergroupofexamplesistheexplorationofalarge 0 6 themostwidelyusedconcurrenthashtables. Ours combinatorial search space where a hash table is 1 are considerably faster than the best algorithms used to remember the already explored elements : withsimilarrestrictionsandanorderofmagnitude (e.g.,indynamicprogramming[36],itemsetmining v i faster than the best more general tables. In some [28],achessprogram,orwhenexploringanimplic- X extreme cases, the difference even approaches four itly defined graph in model checking [37]). Simi- r orders of magnitude. larly, a hash table can maintain a set of cached ob- a jectstosaveI/Os[26]. Furtherexamplesaredupli- Category: [D.1.3] Programming Techniques cateremoval,storingtheedgesetofasparsegraph Concurrent Programming [E.1] Data Structures in order to support edge queries [23], maintaining Tables [E.2] Data Storage Representation Hash- the set of nonempty cells in a grid-data structure table representations usedingeometryprocessing(e.g. [7]),ormaintain- Terms: Performance, Experimentation, Mea- ing the children in tree data structures such as van surement, Design, Algorithms Emde-Boas search trees [6] or suffix trees [21]. Keywords: Concurrency, dynamic data struc- Manyoftheseapplicationshaveincommonthat tures, experimental analysis, hash table, lock- – even in the sequential version of the program – freedom, transactional memory hash table accesses constitute a significant fraction 1 2 Related Work of the running time. Thus, it is essential to have highlyscalableconcurrenthashtablesthatactually deliver significant speedups in order to parallelize Thispublicationfollowsuponourpreviousfindings these applications. Unfortunately, currently avail- aboutgeneralizingfastconcurrenthashtables[18]. ablegeneralpurposeconcurrenthashtablesdonot In addition to describing how to generalize a fast offer the needed scalability (see Section 8 for con- linear probing hash table, we offer an extensive crete numbers). On the other hand, it seems to be experimental analysis comparing many concurrent folklore that a lock-free linear probing hash table hash tables from several libraries. where keys and values are machine words, which is There has been extensive previous work on con- preallocatedtoaboundedsize,andwhichsupports current hashing. The widely used textbook “The no true deletion operation can be implemented us- Art of Multiprocessor Programming” [12] by Her- ing atomic compare-and-swap (CAS) instructions lihyandShavitdevotesanentirechaptertoconcur- [36]. Find-operationscanevenproceednaivelyand rent hashing and gives an overview over previous without any write operations. In Section 4 we ex- work. However, it seems to us that a lot of previ- plain our own implementation (folklore) in detail, ous workfocuses moreon concepts andcorrectness after elaborating on some related work, and intro- but surprisingly little on scalability. For example, ducing the necessary notation (in Section 2 and most of the discussed growing mechanisms assume 3 respectively). To see the potential big perfor- that the size of the hash table is known exactly mance differences, consider an exemplary situation without a discussion that this introduces a perfor- withmostlyreadonlyaccesstothetableandheavy mance bottleneck limiting the speedup to a con- contention for a small number of elements that are stant. Similarly, the actual migration is often done accessed again and again by all threads. folklore sequentially. actuallyprofitsfromthissituationbecausethecon- Stivala et al. [36] describe a bounded concurrent tended elements are likely to be replicated into lo- linear probing hash table specialized for dynamic cal caches. On the other hand, any implementa- programming that only support insert and find. tion that needs locks or CAS instructions for find- Theirinsertoperationstartsfromscratchwhenthe operations, will become much slower than the se- CAS fails which seems suboptimal in the presence quentialcodeoncurrentmachines. Thepurposeof of contention. An interesting point is that they our paper is to document and explain performance need only word size CAS instructions at the price differences, and, more importantly, to explore to of reserving a special empty value. This technique whatextentwecanmakefolklore moregeneralwith couldalsobeadaptedtoportourcodetomachines an acceptable deterioration in performance. without 128-bit CAS. Kim and Kim [14] compare this table with a cache-optimizedlocklessimplementationofhashing These generalizations are discussed in Section 5. withchainingandwithhopscotchhashing[13]. The We explain how to grow (and shrink) such a table, experiments use only uniformly distributed keys, andhowtosupportdeletionsandmoregeneraldata i.e., there is little contention. Both linear prob- types. InSection6weexplainhowhardwaretrans- ing and hashing with chaining perform well in that actionalmemorycanbeusedtospeedupinsertions case. The evaluation of find-performance is a bit and updates and how it may help to handle more inconclusive: chaining wins but using more space general data types. than linear probing. Moreover it is not specified whether this is for successful (use key of inserted After describing implementation details in Sec- elements) or mostly unsuccessful (generate fresh tion7,Section8experimentallycomparesourhash keys) accesses. We suspect that varying these pa- tables with six of the most widely used concurrent rameters could reverse the result. hash tables for microbenchmarks including inser- Gao et al. [10] present a theoretical dynamic lin- tion, finding, and aggregating data. We look at ear probing hash table, that is lock-free. The main both uniformly distributed and skewed input dis- contribution is a formal correctness proof. Not all tributions. Section 9 summarizes the results and details of the algorithm or even an implementation discusses possible lines of future research. isgiven. Thereisalsonoanalysisofthecomplexity 2 3 Preliminaries of the growing procedure. Shun and Blelloch [34] propose phase concurrent hash tables which are allowed to use only a sin- Weassumethateachapplicationthreadhasitsown gleoperationwithinagloballysynchronizedphase. designated hardware thread or processing core and They show how phase concurrency helps to im- denotethenumberofthesethreadswithp. Adata plement some operations more efficiently and even structure is non-blocking if no blocked thread cur- deterministically in a linear probing context. For rently accessing this data structure can block an example, deletions can adapt the approach from operationonthedatastructure byanotherthread. [15] and rearrange elements. This is not possible A data structure is lock-free if it is non-blocking in a general hash table since this might cause find- and guarantees global progress, i.e., there must al- operationstoreportfalsenegatives. Theyalsoout- ways be at least one thread finishing its operation line an elegant growing mechanism albeit without in a finite number of steps. implementingitandwithoutfillinginallthedetail like how to initialize newly allocated tables. They HashTables storeasetof(cid:104)Key,Value(cid:105)pairs(ele- propose to trigger a growing operation when any ments).1 Ahashfunctionhmapseachkeytoacell operation has to scan more than klogn elements of a table (an array). The number of elements in where k is a tuning parameter. This approach is thehashtableisdenotednandthenumberofoper- tempting since it is somewhat faster than the ap- ations is m. For the purpose of algorithm analysis, proximate size estimator we use. We actually tried we assume that n and m are (cid:29)p2 – this allows us that but found that this trigger has a very high to simplify algorithm complexities by hiding O(p) variance–sometimesittriggerslatemakingopera- termsthatareindependentofnandmintheover- tions rather slow, sometimes it triggers early wast- all cost. Sequential hash tables support the inser- ingalotofspace. Wealsohavetheoreticalconcerns tion of elements, and finding, updating, or delet- sincetheboundklognonthelengthofthelongest ing an element with given key – all of this in con- probe sequence implies strong assumptions on cer- stantexpectedtime. Furtheroperationscomputen tainpropertiesofthehashfunction. ShunandBlel- (size), build a table with a given number of initial lochmakeextensiveexperimentsincludingapplica- elements, and iterate over all elements (forall). tionsfromtheproblembasedbenchmarksuite[35]. Linear Probing is one of the most popular se- Li et al. [17] use the bucket cuckoo-hashing quential hash table schemes used in practice. An method by Dietzfelbinger and Weidling [8] and de- element (cid:104)x,a(cid:105) is stored at the first free table entry velop a concurrent implementation. They exploit followingpositionh(x)(wrappingaroundwhenthe that using a BFS-based insertion algorithm, the end of the table is reached). Linear probing is at number of element moves for an insertion is very the same time simple and efficient – if the table is small. Theyusefinegrainedlockswhichcansome- nottoofull,asinglecachelineaccesswillbeenough timesbeavoidedusingtransactionalmemory(Intel most of the time. Deletion can be implemented by TSX). As a result of their work, they implemented rearranging the elements locally [15] to avoid holes the small open source library libcuckoo, which we violatingtheinvariantmentionedabove. Whenthe measure against (which does not use TSX). This table becomes too full or too empty, the elements approach has the potential to achieve very good can be migrated to a larger or smaller table re- space efficiency. However, our measurements indi- spectively. The migration cost can be charged to cate that the performance penalty is high. insertionsanddeletionscausingamortizedconstant overhead. The practical importance of concurrent hash ta- bles also leads to new and innovative implementa- tions outside of the scientific community. A good example of this is the Junction library, that was 1Much of what is said here can be generalized to the publishedbyPreshing[31]inthebeginningof2016, case when Elements are black boxes from which keys are shortly after our initial publication [19]. extractedbyanaccessorfunction. 3 4 Concurrent Hash Table In- prefixandusetheabbreviationCASforbothsingle terface and Folklore Imple- and double word CAS operations. mentation Initialization Theconstructorallocatesanarray ofsizecconsistingof128-bitalignedcellswhosekey Although it seems quite clear what a hash table is is initialized to the empty values. andhowthisgeneralizestoconcurrenthashtables, there is a surprising number of details to consider. Modifications We propose, to categorize all Therefore,wewillquicklygooversomeofourinter- changes to the hash table content into one of the face decisions, and detail how this interface can be followingthreefunctions, thatcanbeimplemented implemented in a simple, fast, lock-free concurrent very similarly (does not cover deletions). linear probing hash table. insert(k,d): Returns false if an element with This hash table will have a bounded capacity the specified key is already present. Only one op- c that has to be specified when the table is con- eration should succeed if multiple threads are in- structed. It is the basis for all other hash table serting the same key at the same time. variants presented in this publication. We call this table the folklore solution, because variations of it update(k,d,up): Returns false, if there is no areusedinmanypublicationsanditisnotclearto value stored at the specified key, otherwise this us by whom it was first published. function atomically updates the stored value to The most important requirement for concurrent new = up(current,d). Notice, that the resulting datastructuresis, thattheyshouldbelinearizable, value can be dependent on both the current value i.e., it must be possible to order the hash table op- and the input parameter d. erationsinsomesequence–withoutreorderingtwo insertOrUpdate(k,d,up): Thisoperationupdates opperations of the same thread – so that executing the current value, if one is present, otherwise the them sequentially in that order yields the same re- given data element is inserted as the new value. sults as the concurrent processing. For a hash ta- Thefunctionreturnstrue,if insertOrUpdateper- bledatastructure,thisbasicallymeansthatallop- formedaninsert(keywasnotpresent),andfalse erations should be executed atomically some time if an update was executed. between their invokation and their return. For ex- We choose this interface for two main reasons. ample, it has to be avoided, that a find returns It allows applications to quickly differentiate be- an inconsistent state, e.g. a half-updated data field tween inserting and changing an element – this is thatwasneveractuallystoredatthecorresponding especiallyusefullsincethethreadwhofirstinserted key. a key can be identified uniquely. Additionally it Our variant of the folklore solution ensures the allows transparent, lockless updates that can be atomicity of operations using 2-word atomic CAS morecomplex,thanjustreplacingthecurrentvalue operations for all changes of the table. As long as (think of CAS or Fetch-and-Add). the key and the value each only use one machine The update interface using an update function word,wecanuse2-wordCASopearationstoatom- deserves some special attention, as it is a novel ap- ically manipulate a stored key together with the proachcomparedtomostinterfacesweencountered corresponding value. There are other variants that duringourresearch. Mostimplementationsfallinto avoid need 2-word compare and swap operations, one of two categories: They return mutable refer- but they often need a designated empty value (see ences to table elements – forcing the user to imple- [31]) . Since, the corresponding machine instruc- ment atomic operations on the data type; or they tionsarewidelyavailableonmodernhardware, us- offeranupdatefunctionwhichusuallyreplacesthe ing them should not be a problem. If the target currentvaluewithanewone–makingitveryhard architecture does not support the needed instruc- to implement atomic changes like a simple counter tions, the implementation can easily be switched (find + increment + overwrite not necessarily to use a variant of the folklore solution which does atomic). not use 2-word CAS. As it can easily be deduced In Algorithm 1 we show the pseudocode of the by the context, we will usually omit the “2-word” insertOrUpdate function. The operation com- 4 ALGORITHM 1: Pseudocode for the insertOrUpdate operation Input: Key k, Data Element d, Update Function up: Key×Val×Val→Val Output: Boolean true when a new key was inserted, false if an update occurred 1 i = h(k); 2 while true do 3 i = i % c; 4 current = table[i]; 5 if current.key == empty key then // Key is not present yet ... 6 if table[i].CAS(current,(cid:104)k,d(cid:105)) then 7 return true 8 else 9 i--; 10 else if current.key == k then // Same key already present ... 11 if table[i].atomicUpdate(current, d, up) then // default: atomicUpdate(·) = CAS( current, up( k,current.data, d)) 12 return false 13 else 14 i--; 15 i++; putes the hash value of the key and proceeds to constant expected running time. look for an element with the appropriate key (be- ginning at the corresponding position). If no ele- Lookup Since this folklore implementation does ment matching the key is found (when an empty not move elements within the table, it would be space is encountered), the new element has to be possible for find(k) to return a reference to the inserted. This is done using a CAS operation. A corresponding element. In our experience, return- failed swap can only be caused by another inser- ing references directly tempts inexperienced pro- tion into the same cell. In this case, we have to grammers to opperate on these references in a way revisit the same cell, to check if the inserted el- that is not necessarily threadsafe. Therefore, our ement matches the current key. If a cell storing implementationreturnsacopyofthecorresponding the same key is found, it will be updated using the cell ((cid:104)k,d(cid:105)), if one is found ((cid:104)empty key,·(cid:105) other- atomicUpdate function. This function is usually wise). Thefindoperationhasaconstantexpected implementedbyevaluatingthepassedupdatefunc- running time. tion(up)andusingaCASoperation,tochangethe Our implementation of find somewhat non- cell. In the case of multiple concurrent updates, at trivial, because it is not possible to read two ma- least one will be successful. chine words at once using an atomic instruction2. In our (C++) implementation, partial template Therefore it is possible for a cell to be changed in- specialization can be used to implement more ef- betweenreadingitskeyanditsvalue–thisiscalled ficient atomicUpdate variants using atomic opera- atorn read. Wehavetomakesure, thattornreads tions – changing the default line 11, e.g. overwrite cannot lead to any wrong behavior. There are two (using single word store), increment (using fetch kinds of interesting torn reads: First an empty key and add). is read while the searched key is inserted into the The code presented in Algorithm 1 can easily be samecell,inthiscasetheelementisnotfound(con- modified to implement the insert (return false sistent since it has not been fully inserted); Second when the key is already present – line 10) and update (return true after a successful update – 2The element is not read atomically, because x86 does not support that. One could use a 2-word CAS to achieve line 12 and false when the key is not found – thesameeffectbutthiswouldhavedisastrouseffectsonper- line5)functions. Allmodificationfunctionshavea formancewhenmanythreadstrytofindthesameelement. 5 the element is updated between the key being read solve most shortcomings of the folklore implemen- andthedatabeingread, sincethedataisreadsec- tation (especially deletions and adaptable size). ond, only the newer data is read (consistent with a finished update). 5.1 Storing Thread-Local Data By itself, storing thread specific data connected to Deletions The folklore solution can only han- ahashtabledoesnotofferadditionalfunctionality, dledeletionsusingdummyelements–calledtomb- but it is necessary to efficiently implement some of stones. Usually the key stored in a cell is replaced our other extensions. Per-thread data can be used with del key. Afterwards the cell cannot be used in many different ways, from counting the number anymore. This method of handling deleted ele- of insertions to caching shared resources. mentsisusuallynotfeasible,asitdoesnotincrease From a theoretical point of view, it is easy to the capacity for new elements. In Section 5.4 We store thread specific data. The additional space is will show, how our generalizations can be used to usually only dependent on the number of threads handle tombstones more efficiently. (O(p) additional space), since the stored data is often constant sized. Compared to the hash table Bulk Operations While not often used in prac- this is usually negligible (p(cid:28)n<c). tice, the folklore table can be modified to sup- Storing thread specific data is challenging from portoperationslikebuildFrom(·)(seeSection5.5) a software design and performance perspective. – using a bulk insertion which can be more effi- Some of our competitors use a register(·) func- cientthanelement-wiseinsertion–orforall(f)– tion that each thread has to call before accessing which can be implemented embarrassingly parallel the table. This allocates some memory, that can by splitting the table between threads. be accessed using the global hash table object. Our solution uses explicit handles. Each thread hastocreateahandle,beforeaccessingthehashta- Size Keepingtrackofthenumberofcontainedel- ble. These handles can store thread specific data, ementsdeservesspecialnoticeherebecauseitturns since they are not shared between threads. This is outtobesignificantlyharderinconcurrenthashta- not only in line with the RAII idiom (resource ac- bles. Insequentialhashtables,itistrivialtocount quisition is initialization [24]), it also protects our the number of contained elements – using a single implementationfromsomeperformancepitfallslike counter. This same method is possible in parallel unnecessary indirections and false sharing3. More- tablesusingatomicfetchandaddoperations,butit over,thedatacaneasilybedeletedoncethethread introduces a massive amount of contention on one does not use the hash table anymore (delete the single counter creating a performance bottleneck. handle). Because of this we did not include a counting method in folklore implementation. In Section 5.2 5.2 Approximating the Size we show how this can be alleviated using an ap- proximate count. Keeping an exact count of the elements stored in the hash table can often lead to contention on one count variable. Therefore, we propose to support 5 Generalizations and Exten- only an approximative size operation. sions To keep an approximate count of all elements, eachthreadmaintainsalocalcounterofitssuccess- In this section, we detail how to adapt the concur- ful insertions (using the method desribed in Sec- rent hash table implementation – described in the tion 5.1). Every Θ(p) such insertions this counter previous section – to be universally applicable to is atomically added to a global insertion counter all hash table workloads. Most of our efforts have I and then reset. Contention at I can be provably gone into a scalable migration method that is used 3Significant slow down created by the cache coherency to move all elements stored in one table into an- protocol due to multiple threads repeatedly changing dis- other table. It turns out that a fast migration can tinctvalueswithinthesamecacheline. 6 madesmallbyrandomizingtheexactnumberoflo- of linear probing and our scaling function, there is cal insertions accepted before adding to the global a surprisingly simple way to migrate the elements counter, e.g., between 1 and p. I underestimates fromtheoldtabletothenewtableinparallelwhich the size by at most O(cid:0)p2(cid:1). Since we assume the results in exactly the same order a sequential algo- size to be (cid:29)p2 this still means a small relative er- rithm would take and that completely avoids syn- ror. By adding the maximal error, we also get an chronization between threads. upper bound for the table size. Lemma1. Considerarangea..bofnonemptycells Ifdeletionsarealsoallowed,wemaintainaglobal in the old table with the property that the cells counter D in a similar way. S = I −D is then a a−1modc and b+1modc are both empty – call good estimate of the total size as long as S (cid:29)p2. such a range a cluster (see Figure 1a). When When a table is migrated for growing or shrink- migrating a table, sequential migration will map ing (see Section 5.3.1), each migration thread lo- the elements stored in that cluster into the range cally counts the elements it moves. At the end of (cid:98)γa(cid:99)..(cid:98)γ(b+1)(cid:99) in the target table, regardless of the migration, local counters are added to create the rest of the source array. the initial count for I (D is set to 0). This method can also be extended to give an Proof. Letxbeanelementstoredintheclustera..b exact count – in absence of concurrent inser- at position p(x)=hc(x)+d(x). Then hc(x) has to tions/deletions. To do this, a list of all handles be in the cluster a..b, because linear probing does has to be stored at the global hash table object. A not displace elements over empty cells (hc(x) = thread can now iterate over all handles computing (cid:98)h(x)c(cid:99)≥a), and therefore, h(x)c(cid:48) ≥ac(cid:48) ≥γa. U U c the actual element size. Similarly, from (cid:98)h(x)c(cid:99) ≤ b follows h(x)c < U U b+1, and therefore, h(x)c(cid:48) <γ(b+1). U 5.3 Table Migration Therefore, two distinct clusters in the source ta- ble cannot overlap in the target table. We can ex- WhileGaoetal.[10]haveshownthatlock-freedy- ploit this lemma by assigning entire clusters to mi- namiclinearprobinghashtablesarepossible,there gratingthreadswhichcanthenprocesseachcluster is no result on their practical feasibility. Our focus completelyindependently. Distributingclustersbe- is geared more towards engineering the fastest mi- tween threads can easily be achieved by first split- gration possible, therefore, we are fine with small ting the table into blocks (regardless of the tables amountsoflocking,aslongasitimprovestheover- contents)whichweassigntothreadsforparallelmi- all performance. gration. A thread assigned block d..e will migrate thoseclustersthatstartwithinthisrange–implic- 5.3.1 Eliminating Unnecessary Contention itlymovingtheblockborderstofreecellsasseenin from the Migration Figure1b). Sincetheaverageclusterlengthisshort If the table size is not fixed, it makes sense to as- and c = Ω(cid:0)p2(cid:1), it is sufficient to deal out blocks sume that the hash function h yields a large pseu- of size Ω(p) using a single shared global variable dorandom integer which is then mapped to a cell andatomicfetch-and-addoperations. Additionally position in 0..c−1 where c is the current capacity eachthreadisresponsibleforinitializingallcellsin c.4 We will discuss a way to do this by scaling. If its region of the target table. This is important, hyieldsvaluesintheglobalrange0..U−1wemap because sequentially initializing the hash table can key x to cell h (x) := (cid:98)h(x)c(cid:99). Note that when quickly become infeasible. c U both c and U are powers of two, the mapping can Note that waiting for the last thread at the end be implemented by a simple shift operation. ofthemigrationintroducessomewaiting(locking), butthisdoesnotcreatesignificantworkimbalance, since the block/cluster migration is really fast and Growing Now suppose that we want to migrate clusters are expected to be short. thetableintoatablethathasatleastthesamesize (growing factor γ ≥ 1). Exploiting the properties Shrinking Unfortunately, the nice structural 4Weusex..y asashorthandfor x,...,y inthispaper. Lemma1nolongerapplies. Wecanstillparallelize { } 7 γa a b γ(b+1) ab00 γa0 γ(b0+1) (a) Two neighboring clusters and their non- (b)Left: tablesplitintoevenblocks. Right: resulting overlapping target areas (γ =2). cluster distribution (moved implicit block borders). Figure 1: Cluster migration and work distribution the migration with little synchronization. Once the capacity will be increased by a factor of γ ≥ 1 more, we cut the source table into blocks that we (Usuallyγ =2). Thedifficultyisensuringthatthis assign to threads for migration. The scaling func- operation is done in a transparent way without in- tion maps each block a..b in the source table to a troducing any inconsistent behavior and without blocka(cid:48)..b(cid:48) inthetargettable. Wehavetobecare- incurring undue overheads. ful with rounding issues so that the blocks in the Tohidethemigrationprocessfromtheuser,two targettablearenon-overlapping. Wecanthenpro- problems have to be solved. First, we have to find ceed in two phases. First, a migrating thread mi- threads to grow the table, and second, we have to grates those elements that move from a..b to a(cid:48)..b(cid:48). ensure, that changing elements in the source table These migrations can be done in a sequential man- willnotleadtoanyinconsistentstatesinthetarget ner, since target blocks are disjoint. The majority table (possibly reverting changes made during the of elements will fit into the target block. Then, af- migration). Each of these problems can be solved ter a barrier synchronization, all elements that did in multiple ways. We implemented two strategies not fit into their respective target blocks are mi- for each of them resulting in four different variants grated using concurrent insertion i.e., using atomic of the hash table (mix and match). operations. This has negligible overhead since el- ements like this only exist at the boundaries of Recruiting User-Threads A simple approach blocks. The resulting allocation of elements in the to dynamically allocate threads to growing the ta- target table will no longer be the same as for a ble, is to “enslave” threads that try to perform sequential migration but as long as the data struc- table accesses that would otherwise have to wait ture invariants of a linear probing hash table are for the completion of the growing process anyway. fulfilled, this is not a problem. This works really well when the table is regularly accessedbyalluser-threads,butisinefficientinthe 5.3.2 Hiding the Migration from the Un- worstcasewhenmostthreadsstopaccessingtheta- derlying Application bleatsomepoint,e.g.,waitingforthecompletionof a global computation phase at a barrier. The few To make the concurrent hash table more general threads still accessing the table at this point will and easy to use, we would like to avoid all explicit need a lot of time for growing (up to Ω(n)) while synchronization. The growing (and shrinking) op- most threads are waiting for them. One could try erationsshouldbeperformedasynchronouslywhen toalsoenslavewaitingthreadsbutitlooksdifficult needed, without involvement of the underlying ap- todothisinasufficientlygeneralandportableway. plication. The migration is triggered once the ta- ble is filled to a factor ≥ α (e.g. 50%), this is estimated using the approximate count from Sec- Using a Dedicated Thread Pool A provably tion 5.2, and checked whenever the global count is efficientapproachistomaintainapoolofpthreads updated. When a growing operation is triggered, dedicated to growing the table. They are blocked 8 untilagrowingoperationistriggered. Thisiswhen isnotfeasibletoacquireacountingpointerforeach they are awoken to collectively perform the migra- operation. Insteadacopyofthesharedpointercan tion in time O(n/p) and then get back to sleep. be stored locally, together with the increasing ver- Duringamigration,applicationthreadsmighthave sionnumberofthecorrespondinghashtable(using to sleep until the migration threads are finished. the method from Section 5.1). At the beginning of This will increase the CPU time of our migration eachoperation,wecanusethelocalversionnumber threads making this method nearly as efficient as to make sure that the local counting pointer still the enslavement variant. Using a reasonable com- pointstothenewesttableversion. Ifthisisnotthe putation model, one can show that using thread case, a new pointer will be acquired. This happens pools for migration increases the cost of each table only once per version of the hash table. The old accessbyatmostaconstantinagloballyamortized table will automatically be freed once every thread sense (over the non-growing folklore solution). We has updated its local pointer. Note that counting omit the relatively simple proof. pointers cannot be exchanged in a lock-free man- To remain fair to all competitors, we used ex- nerincreasingthecostofchangingthecurrenttable actly as many threads for the thread pool as there (usingalock). Thislockcouldbeavoidedbyusing were application threads accessing the table. Ad- a hazard pointer. We did not do this ditionally each migration thread was bound to a core, that was also used by one corresponding ap- plication thread. Prevent Concurrent Updates to ensure Con- sistency (synchronized) We propose a simple protocol inspired by read-copy-update protocols Marking Moved Elements for Consistency [22]. Thethreadttriggeringthegrowingoperation (asynchronous) During the migration it is im- sets some global growing flag using a CAS instruc- portant that no element can be changed in the old tion. A thread t performing a table access sets a tableafterithasbeencopiedtothenewtable. Oth- local busy flag when starting an operation. Then erwise, it would be hard to guarantee that changes it inspects the growing flag, if the flag is set, the are correctly applied to the new table. The easiest local flag is unset. Then the local thread waits for solutiontothisproblemis,tomarkeachcellbefore the completion of the growing operation, or helps it is copied. Marking each cell can be done using with migrating the table depending on the current a CAS operation to set a special marked bit which growing strategy. Thread t waits until all busy is stored in the key. In practice this reduces the flags have been unset at least once before starting possible key space. If this reduction is a problem, the migration. When the migration is completed, see Section 5.6 on how to circumvent it. To ensure the growing flag is reset, signaling to the waiting that no copied cell can be changed, it suffices to threads that they can safely continue their table- ensure that no marked cell can be changed. This operations. Because this protocol ensures that no can easily be done by checking the bit before each thread is accessing the previous table after the be- writingoperation,andbyusingCASoperationsfor ginning of the migration, it can be freed without each update. This prohibits the use of fast atomic using reference counting. operations to change element values. We call this method (semi-)synchronized, be- After the migration, the old hash table has to cause grow and update operations are disjoint. be deallocated. Before deallocating an old table, Threads participating in one growing step still ar- we have to make sure that no thread is currently rive asynchronously, e.g. when the parent applica- using it anymore. This problem can generally be tion called a hash table operation. Compared to solvedbyusingreferencecounting. Insteadofstor- the marking based protocol, we save cost during ing the table with a usual pointer, we use a ref- migration by avoiding CAS operations. However, erence counted pointer (e.g. std::shared ptr) to this is at the expense of setting the busy flags for ensure that the table is eventually freed. every operation. Our experiments indicates that The main disadvantage of counting pointers overall this is only advantageous for updates using is that acquiring a counting pointer requires an atomic operations like fetch-and-add that cannot atomicincrementonasharedcounter. Therefore,it coexist with the marker flags. 9 5.4 Deletions tions, deletions, and updates. We outline a simple algorithmforbulk-insertionthatworkswithoutex- For concurrent linear probing, we combine tomb- plicit sorting albeit does not avoid contention. Let stoning (see Section 4) with our migration algo- a denote the old size of the hash table and b the rithm to clean the table once it is filled with too numberofinsertions. Thena+bisanupperbound many tombstones. for the new table size. If necessary, grow the table Atombstone isanelement,thathasadel keyin to that size or larger (see below). Finally, in paral- place of its key. The key x of a deleted entry (cid:104)x,a(cid:105) lel, insert the new elements. is atomically changed to (cid:104)del key,a(cid:105). Other ta- More generally, processing batches of size m = bleoperationsscanoverthesedeletedelementslike Ω(n) in a globally synchronized way can use the over any other nonempty entry. No inconsistencies same strategy. We outline it for the case of bulk canarisefromdeletions. Inparticular,aconcurrent insertions. Generalization to deletions, updates, find-operationswithatornreadwillreturntheele- or mixed batches is possible: Integer sort the ele- ment before the deletion since the delete-operation ments to be inserted by their hash key in expected willleavethevalue-slotauntouched. Aconcurrent timeO(m/p). Amongelementswiththesamehash insert (cid:104)x,b(cid:105) might read the key x before it is over- value, remove all but the last. Then “merge” the written by the deletion and return false because batch and the hash table into a new hash table it concludes that an element with key x is already (thatmayhavetobelargertoprovidespaceforthe present. This is consistent with the outcome when new elements). We can adapt ideas from parallel the insertion is performed before the deletion in a merging [11]. We co-partition the sorted insertion linearization. array and the hash table into corresponding pieces This method of deletion can easily be imple- of size O(m/p). Most of the work can now be done mentedinthefolkloresolutionfromSection4. But on these pieces in an embarrassingly parallel way – the starting capacity has to be set dependent on eachpieceoftheinsertionarrayisscannedsequen- the number of overall insertions, since this form of tiallybyonethread. Consideranelement(cid:104)x,a(cid:105)and deletion does not free up any of the deleted cells. previous insertion position i in the table. Then we Even worse, tombstones will fill up the table and start looking for a free cell at position max(h(x),i) slow down find queries. Both of these problems can be solved by migrat- 5.6 Restoring the Full Key Space ing all non-tombstone elements into a new table. The decision when to migrate the table should be Our table uses special keys, like the empty key made solely based on the number of insertions I (empty key) and the deleted key (del key). El- (= number of nonempty cells). The count of all ements that actually have these keys cannot be non-deleted elements I −D is then used to decide stored in the hash table. This can easily be fixed whether the table should grow, keep the same size by using two special slots in the global hash table (noticeγ =1isaspecialcaseforouroptimizedmi- data structure. This makes some case distinction gration), or shrink. Either way, all tombstones can necessarybutshouldhaveratherlowimpactonthe be removed in the course of the element migration. overall performance. Oneofourgrowingvariants(asynchronous)uses 5.5 Bulk Operations a marker bit in its key field. This halves the possi- blekeyspacefrom264to263. Toregainthelostkey Building a hash table for n elements passed to the space, we can store the lost bit implicitly. Instead constructorcanbeparallelizedusingintegersorting of using one hash table that holds all elements, we by the hash function value. This works in time use the two subtables t and t . The subtable t 0 1 0 O(n/p) regardless how many times an element is holds all elements whose key does not have its top- inserted, i.e., sorting circumvents contention. See most bit set. While t stores all elements whose 1 the work of Mller et al.[25] for a discussion of this key does have the topmost bit set, but instead of phenomenon in the context of aggregation. storing the topmost bit explicitly it is removed. Thiscanbegeneralizedforprocessingbatchesof Eachelementcanstillbefoundinconstanttime, sizem=Ω(n)thatmayevencontainamixofinser- because when looking for a certain key, it is imme- 10