Scalable Multi-Database Privacy-Preserving Record Linkage using Counting Bloom Filters Dinusha Vatsalan, Peter Christen, and Erhard Rahm† ResearchSchoolofComputerScience,TheAustralianNationalUniversity CanberraACT0200,Australia †Universita¨tLeipzig,Institutfu¨rInformatik,04109,Leipzig,Germany {dinusha.vatsalan, peter.christen}@anu.edu.au,†[email protected] 7 1 ABSTRACT that allow the early detection of infectious diseases before 0 they spread widely around a country or even worldwide. Privacy-preservingrecord linkage(PPRL) aims at integrat- 2 Such an application requires data to be integrated across ing sensitive information from multiple disparate databases several sources, including human health data, travel data, n of different organizations. PPRL approaches are increas- consumed drug data, and even animal health data [6]. A a inglyrequiredinreal-worldapplicationareassuchashealth- J care, national security, and business. Previous approaches second contemporary motivating example is national secu- rity applications that integrate data from law enforcement 5 havemostlyfocusedonlinkingonlytwodatabasesaswellas agencies, Internet service providers, businesses, as well as the use of a dedicated linkage unit. Scaling PPRL to more financialinstitutionstoenabletheaccurateidentificationof ] databases (multi-party PPRL) is an open challenge since B crime and fraud, or of terrorism suspects[25]. privacythreatsaswellasthecomputationandcommunica- D tion costs for record linkage increase significantly with the Intheabsenceofuniqueentityidentifiersinthedatabases that are to be linked, personal identifying attributes (such . numberofdatabases. Wethusproposetheuseofanewen- s as names and addresses) need to be used for the linkage. coding method of sensitive data based on Counting Bloom c Known as quasi-identifiers (QIDs) [41], such attribute val- Filters (CBF)toimproveprivacyformulti-partyPPRL.We [ ues are in general assumed to besufficiently well correlated alsoinvestigateoptimizationstoreducecommunicationand 1 computation costs for CBF-based multi-party PPRL with with entities toallow accuratelinkage. Usingsuch personal v and without the use of a dedicated linkage unit. Empirical information across different organizations, however, often 2 evaluations conducted with real datasets show the viability leads to privacyand confidentialityconcerns. This problem 3 of the proposed approaches and demonstrate their scalabil- has been addressed through the development of ‘privacy- 2 ity,linkage quality,and privacy protection. preserving record linkage’ (PPRL) [43] techniques. PPRL 1 aims to conductlinkage usingonly masked (encoded)QIDs 0 without requiring any sensitive or confidential information Keywords . tobeexchangedandrevealed betweentheorganizations in- 1 Recordlinkage,similarity,privacy,multi-party,communica- volved in the linkage. Generally, masking is conducted on 0 tion patterns,secure summation QIDs to transform the original values such that a specific 7 functional relationship exists between the original and the 1 : 1. INTRODUCTION masked values [41]. While there have been many different v approaches proposed for PPRL (as reviewed in [43]), most Awiderangeofreal-worldapplications,includinginhealth- i workthusfarhasconcentratedonlinkingrecordsfromonly X care, government services, crime and fraud detection, na- two sources (or parties). As the healthcare and national r tional security, and businesses, require person-related data security examples described above show, linking data from a from multiple sources held by different organizations to be several sources is however commonly required in practical integrated or linked. Integrated data can then be used for applications. data mining and analytics to empower efficient and qual- ThedrawbackofthesmallnumberofexistingPPRLsolu- ity decision making with rich data. Integrating data helps tions that can link data from multipleparties is that either improving the quality of data by identifying and resolving (1) they only support exact matching (which classifies sets conflicts in data values, enriching data with additional de- of records as matches if their masked QIDs are exactly the tailed information, and dealing with missing values[2]. sameandasnon-matchesotherwise)[14,17]or(2)theyare Theanalysisandminingofdataintegratedacrossorgani- applicable to QIDs of categorical data only [12, 22]. How- zationscanbeused,forexample,inhealthoutbreaksystems ever, in many PPRL applications QIDsof stringdata, such as names and addresses, are required. These QIDs often contain errors and variations which necessitates the use of approximate comparison functions (that are computation- ally expensive in terms of the number of comparisons) for comparing QIDs. In this paper, we tackle the multi-party PPRLproblembydevelopinganefficientprivacy-preserving Thisisanextended versionofanarticle publishedinIEEEInternational approach for approximate matching of (masked) QIDs of Conference on Data Mining (ICDM) International Workshop on Privacy andDiscriminationinDataMining(PDDM)2016[42]. string data from multiplerecords. . D D D D D D D D 1 2 3 4 1 2 3 4 p p p p p p p p 1 2 3 4 1 2 3 4 b (n / b) b (n2 / b2 ) b (n3 / b3 ) b (n4 / b4 ) b (n / b) b (n / bb) (n4 / bb4 )(n / b) bCCr oe(ancmno /pdr dabidr )saiestotesn b (n4 / b4 ) Candidaterecord sets Comparison Linkage Unit (LU) Linkage Unit (LU) Figure 1: An overview of traditional na¨ıve comparison of candidate record sets masked using BFs (left) and CBFs (right, as will be described in detail in Section 3) from p = 4 parties using a LU. Databases are indexed/blocked to reduce the number of candidate record sets such that only records in the same blocks are compared and classified. Blocks are illustrated by different patterns in Di, with 1 ≤ i ≤ p. n denotes the number of records in the databases (assuming all databases are of equal size) and b denotes the number of blocks generated (assuming blocks of equal size). Independent of the masking function and the communication pattern used (i.e. BFs and direct one-to-one communication in the left figure, and CBFs and ring-based communication in the right figure), the na¨ıve approach results in exponential complexity of b (n4/b4) candidate sets each consisting of p=4 records (one from each party). We propose the use of Counting Bloom Filter (CBF) en- pairs haveto becompared andclassified [43]. Compared to coding, which is a variation of Bloom filter (BF) encod- thequadraticnumberofrecordpairswhenlinkingonlytwo ing[43],toenableefficientandapproximateprivacy-preserving databases(O(n2)),inmulti-partyPPRLthenumberofcan- linkageofmultipledatabases. BFsarebitvectorsintowhich didate record sets increases exponentially with the number valuesarehash-mappedusinghashfunctions(aswedescribe of parties (O(np)), and thus using existing private blocking in Section 2.1). CBFs, on the other hand, are integer vec- techniqueswouldnotsufficientlyreducethenumberofcom- torsthatcontaincountvaluesineachposition. MultipleBFs parisons, as has been empirically studied in several recent can be summarized as a single CBF using the vector addi- approaches [26, 27, 39]. tion operation between BFs. Previous BF encoding-based Figure 1 overviews the na¨ıve computation and commu- PPRLapproaches[8,33]suggestusingalinkageunit(LU), nication of (masked) candidate record sets from multiple which is a dedicated external party that can perform link- parties (p=4) using a LU. Independentof the used mask- age by comparing candidate record sets (masked into BFs) ing function (BFs in the left figure and CBFs in the right from all database owners and calculating their similarities figure) and the communication pattern (direct one-to-one to classify them as matches or non-matches. Our hypoth- communication between each party and the LU in the left esis is that, rather than sending BFs from all parties to a figure and ring-based communication among the parties in linkageunit(LU)tocalculatetheirsimilarity,asingleCBF therightfigure,aswillbedescribedinSection2),thena¨ıve foreachcandidaterecordsetgeneratedoverppartiescanbe approach results in exponential complexity. Efficient com- used to calculate their similarity, as illustrated in Figure 1 munication patterns and advanced filtering approaches for (left for BFsand right forCBFs). SinceCBFs contain only multi-partyPPRLthereforeneedtobedevelopedinorderto thesummaryinformation(countvalues)ofmultiplerecords reduce the potentially huge number of comparisons. More- intheirpositionsratherthantheactualindividualbitvalues over, with multiple parties the privacy risk of collusion in- of a single record as in BFs, they provideincreased privacy creases, where a sub-set of parties collude among them in comparedtoBFs,aswediscussinSection6. Tothebestof order tolearn about anotherparty’s(or sub-set of parties’) our knowledge, this privacy aspect of CBFs has so far not privatedata. Bothexamplesforthena¨ıvemethoddescribed been utilized in PPRL. in Figure 1are highly susceptibleto collusion. An additional challenge with multi-party PPRL is that In order to overcome these scalability and privacy chal- complexity increases significantly with multiple parties in lengesofmulti-partyPPRL,weintroducetwoefficientCBF- termsofbothcomputationaleffortsandcommunicationvol- based communication patterns that either use a LU or op- ume. Abasicapproachwouldbetosendallmaskedrecords eratesymmetricallywithoutaLU (whereatrustedexternal from allppartiestoaLU thatcancalculatepair-wisesimi- party is not available to act as a LU). The proposed ap- larities betweenmaskedrecords,whichisofO(p2·n2)com- proachescansignificantlyreducethenumberofcomparisons plexity,wherenisthesizeofdatabasesassumingalldatabases required between records in contrast to the na¨ıve all-to-all are of equal size. However, identifying a matching set of comparisons, andtherebyimprovethescalability whilealso records from all p parties is not possible with such a ba- improvingtheprivacy(reducingthelikelihoodofcollusions) sic pair-wise comparison approach. On the other hand, the byarrangingpartiesintoseveralgroupsandbydistributing numberofna¨ıve(all-to-all)comparisonsbetweenrecordsre- computations among parties. quired across all p databases (D , D , ···, D ) is equal to 1 2 p theproductofthesizeofthedatabases(i.e.n×n×···n= Contributions: Ourcontributionsinthispaperare: (1) np). Addressing this complexity challenge, two-step algo- a novel multi-party PPRL protocol based on CBFs and se- rithmshavebeendevelopedwhereinthefirststepaprivate curesummationforefficient,approximate,andprivatelink- indexing/blockingtechniqueisusedtoreducethenumberof age;(2)twovariationsofextendedsecuresummationproto- candidaterecordsetsfromnp tob·np/bp,assumingbblocks cols for improved privacy against collusion among the data ofequalsize. Inthesecondsteponlythesecandidaterecord base owners: (a) homomorphic encryption-based and (b) salting-based (usingrandomseedintegers); (3)twoefficient pe et te er communication patterns (with and without a LU) for re- x = 7 1 ducing the comparison space and risk of collusion between x = 5 parties and thereby improving scalability and privacy, re- BFs b1 1 0 1 0 0 0 1 1 1 0 0 1 0 1 z2 = 5 spectively,in multi-partyPPRL;(4) an analysis of thepro- a) b2 1 0 1 0 0 0 1 1 0 0 0 1 0 0 Dice_sim = 2 x 5 tocol in terms of the three properties of PPRL: scalability ( (7+5) = 0.83 (complexity), linkage quality, and privacy; and (5) an em- pe et te pirical evaluation and comparison of our protocol with two z = 5 baseline approaches using large North Carolina Voter Reg- CBF c 2 0 2 0 0 0 2 2 1 0 0 2 0 1 x1+x2= 12 istration (NCVR)[4] datasets. b) Dice_sim = 21 x2 5 Outline: Inthefollowingsectionwedescribetheprelim- ( = 0.83 inaries. InSection3weproposeourprotocolformulti-party PPRLbasedonCBFsandsecuresummation,whereinSec- Figure 2: Similarity calculation of QIDs of two records tion4weweproposetwoextendedsecuresummationproto- maskedusing (a)BFsand(b) CBFs, wherel=14, k=2, colstoimproveprivacyofourapproach,andinSection5we and q=2. introduce two efficient communication patterns to improve scalability and privacy. We analyze our protocol in terms ofcomplexity,linkagequality,andprivacyinSection6,and more expensive with regard to the computation and com- in Section 7 we conduct an empirical study on the NCVR munication complexities though it provides strong privacy datasets to validate these analyses. We provide a review of guarantees and high accuracy [18]. The latter uses efficient relatedworkinSection8. Finally,wesummarizeanddiscuss techniques and, as opposed to SMC techniques, these tech- future research directions in Section 9. niques aim to hide (mask) information about the original values (to preserve privacy) while still allowing to perform 2. PRELIMINARYCONCEPTSANDBUILD- approximatematchingbetweenthemaskedvaluesusingthe functional relationship between original and masked data. INGBLOCKS Weproposeanefficientprotocolformulti-partyPPRLus- Inthissection,wedefinetheproblemofmulti-partyPPRL ingperturbation-basedmasking. Inthissection,wedescribe andexplainhowCBFscanbeusedforefficientlycalculating thefourbuildingblocksofourprotocol,andinSection3we similarities (approximate matching) of QID values between presentouralgorithmindetail. Inthefollowingtwosections a set of multiple (two or more) records (held by different we assume a linkage unit (LU) is available to conduct the parties) in PPRL. linkage,andinSection5weproposeavariationwhereaLU WeassumepdatabaseownersP1,P2,···,Ppwiththeirre- is not required to conduct thelinkage usingour protocol. spectivedatabasesD ,D ,···,D (containingsensitiveor 1 2 p confidential identifying information) participate in the pro- 2.1 Bloom filter encoding cess under the honest-but-curious(HBC) [43]. In the HBC Bloom filter (BF) encoding has been used as an efficient model, parties are assumed to follow the protocol without masking(encoding)techniqueinseveralPPRLsolutions[9, deviatingorsendingfalseinformationwhilebeingcuriousto 17, 26, 34, 35, 38, 39, 40]. A BF b is a bit array data i learn about other parties’ data. However, the HBC model structure of length l bits where all bits are initially set to doesnotassumethatthepartiesdonotcolludeamongthem 0. k independent hash functions, h ,h ,...,h , each with 1 2 k tolearnaboutotherparties’data[19]. Wequantifytherisk range 1,...l, are used to map each of the elements s in a ofcollusioninmulti-partyPPRLandthereductionofriskby set S into the BF by setting the bit positions h (s) with j ourcommunicationpatternsinSection6.2. Wealsoassume 1≤j ≤k to1. asetofQIDattributesA,whichwillbeusedforthelinkage, Schnell et al. [34] were the first to propose a method for is common to all these databases. We formally define the approximatematchinginPPRLoftwodatabasesusingBFs. problem of PPRL of multipledatabases as follows. Intheirwork,asinourprotocol,thecharacterq-grams(sub- Definition 2.1. Multi-partyPPRL:AssumeP ,...,P stringsoflengthq)ofQIDvaluesinAofeachrecordinthe 1 p databases to be linked are hash-mapped into a BF using k are the p owners (parties) of the databases D ,...,D , re- 1 p independent hash functions. This method of BF encoding spectively. They wish to determine which of their records is known as Cryptographic Long term Key (CLK) encod- R ∈ D , R ∈ D , ..., R ∈ D match based on the 1,i 1 2,j 2 p,k p ing [34]. (masked)QIDsoftheserecordsaccordingtoadecisionmodel TheseBFsaretheneithersenttoaLU thatcalculatesthe C(R ,R , ..., R ) that classifies record sets (R ,R , 1,i 2,j p,k 1,i 2,j similarity ofpairsofBFs,assuggestedbySchnelletal.[34] ...,R )intooneofthetwoclassesMofmatchesandUof p,k andDurhametal.[9],ortheyarepartiallyexchangedamong non-matches. Assuming the HBC adversary model, parties thedatabaseowners todistributivelycalculate thesimilari- P ,...,P arehonest,inthattheyfollowthelinkageprotocol 1 p tiesofBFpairs/sets,asproposedbyLaietal.[17]andVat- steps, while they do not wish to reveal their actual records salan and Christen for two-party [38] and multi-party [39, R ,...,R with any other party. They however are pre- 1,i p,k 40] approaches. Figure 2(a) illustrates the encoding of bi- pared to disclose to each other, or to an external party, the grams (q = 2) of two QID values ‘peter’ and ‘pete’ into actual values of some selected attributes of the record sets l=14 bits long BFs usingk=2 hash functions. that are in class M to allow analysis. 2.2 Dicecoefficient Maskingfunctionsusedforprivacy-preservingalgorithms can be categorized into two: cryptographic-based secure Any set-based similarity function (such as overlap, Jac- multi-partycomputation(SMC)techniquesandperturbation- card,andDicecoefficient)canbeusedtocalculatethesimi- based techniques [41]. The former approach is generally larityofpairsorsetsof(multiple)BFs. TheDicecoefficient has been used for matching of BFs since it is insensitive to Proof. The Dice coefficient similarity of p BFs (b ,b , 1 2 manymatchingzeros(bitpositionstowhichnoq-gramsare ···, b )isdetermined bythesum of1-bits(Pp x )in the p i=1 i hash-mapped) in long BFs [34]. In future, we aim to inves- denominator of Eq. (1) and the number of common 1-bits tigate how other approximate string comparison functions, (z) in all p BFs in the nominator of Eq. (1). The number suchaseditdistance[30]andJaroandWinkler[10,45],can of 1-bits in a BF b is x = b [1]+b [2]+···+b [l], with i i i i i be extended to calculate the similarity of more than two 1 ≤ i ≤ p. The sum of 1-bits in all p BFs is therefore values. Pp x = Pp b [1]+b [2]+···+b [l]. The value in a i=1 i i=1 i i i bit position β (1 ≤ β ≤ l) of the CBF of these p BFs is Definition 2.2. WedefinetheDicecoefficientsimilarity c[β]=b1[β]+b2[β]+···+bp[β]. Thesumofvaluesinallbit of p BFs (b ,··· ,b ) as: positions of the CBF is Pl c[β] = Pl b [β]+b [β]+ 1 p β=1 β=1 1 2 ···+b [β] which is equal to Pp x =Pp b [1]+b [2]+ p×z p i=1 i i=1 i i Dice sim(b1,···,bp) = Ppi=1xi (1) ·in··a+llbpi[lB].FFsu,ri.teh.e∀r,pi=if1abi[bβit] =po1si,titohnenβc([1β]≤=βP≤pil=)1cboin[βta]i=nsp1. where z is the number of common bit positions that are set Therefore, the common 1-bits (z) that occur in all p BFs to 1 in all p BFs (common 1-bits), and x is the number of canbecalculatedbycountingthenumberofpositionsβ ∈c bit positions set to 1 in bi (1-bits), 1≤i≤i p. wherec[β]=p,whilethesumofnumberof1-bits(Ppi=1xi) is calculated by summing the values in bit positions β ∈ c, Pl c[β]. Figure2(a)illustratestheDicecoefficientsimilaritycalcu- β=1 lation of the two QID values ‘peter’ and ‘pete’ masked into IftheLU gets onlytheCBFcthat containsthesummed BFs. values in the bit positions of p BFs instead of the actual p BFs, then the LU can calculate the Dice coefficient of p 2.3 Secure summation BFs using Eq. (2) without learning any information about A secure summation protocol [7] can be used to securely the individual bit positions of each party. As an example, sum the input values of multiple parties (p≥ 3), such that theDicecoefficientcalculation ofthetwoBFsfrom thetwo no party learns the individual values of other parties, but parties P and P in Figure 2(a) is extended by using a 1 2 only the summed value. The input can either be a single CBF and secure summation to calculate the similarity by numericvalueoravectorofnumericvalues. Forexample,p theLU,asshowninFigure2(b). Theinformationgainedby numeric values (v1,v2,··· ,vp) can be securely summed by theLU (and/or databaseowners) from asingle CBF is less using a random numeric value r′ which is sent by a LU. thanpBFs(i.e.CBFsprovideincreasedprivacycomparedto ThefirstpartyP1 thatreceivesr′ calculatesthesummation BFs), as theoretically provenin Section 6.2 and empirically of r′+v1 and sends to P2. This process is repeated until validated in Section 7. the last party sends r′ +v+1+v +···+v to the LU 2 p whichthensubtractsr′ fromthesummedvaluetocalculate 3. MULTI-PARTY PPRLALGORITHM the summation of p values. The protocol employs a ring- Our protocol allows efficient, approximate, and private based communication pattern over all parties which allows theLU tolearn thefinalvalues(Pp v )butnopartywill linkingofrecordsbasedontheirmaskedQIDvaluesinmul- i=1 i tiple databases from p (≥ 2) sources/parties. We first de- learn the values v of the other parties. This basic secure i scribeabasicna¨ıveprotocolbasedonCBFs,whichwename summation (BSS) protocol is susceptible to collusion risk as‘NAI’,andinSection5wethenproposeimprovedcommu- amongtheparties. InSection4weproposeextendedsecure nicationpatternsforthisprotocoltomakeitmorescalable. summationprotocolsforimprovedprivacyagainstcollusion. The stepsof ourprotocol are listed below. 2.4 Counting Bloom filters (CBFs) • Step 1: The parties agree upon the following pa- In this section we propose a novel method of calculat- rameter values: the BF length l; the k hashing func- ing thesimilarity of p BFs usingaCBF, which will provide tions h ,...,h to be used; the length (in characters) 1 k increased privacy compared to using BFs for similarity cal- of grams q; a minimum similarity threshold value s t culation, as we will discuss in Section 6.2. A CBF c of p (0 ≤ s ≤ 1), above which a set of records is classi- t (p > 1) BFs is an integer array data structure of the same fied as a match; a private blocking function block(·); length as the BFs (l). It contains the counts of values in theblockingkeys[2]B used for blocking;anda set of each bit position β,1 ≤ β ≤ l over a set of p BFs in its QIDattributes A used for thelinkage. corresponding position, such that c[β] = Pp b [β], where i=1 i • Step 2: Each party P (1 ≤ i ≤ p) individually c[β]isthecountvalueintheβbitpositionoftheCBFcand i applies a private blocking function [43] block(·) to re- b [β] ∈ [0,1] provides the value in the bit position β of BF i ducethenumberof candidate sets of records C (from b . Given p BFs (bit vectors) b with 1 ≤ i ≤ p, the CBF i i Qpn , where n = |D | is the number of records in c can be generated by applying a vector addition operation i i i i D held by party P ), which otherwise becomes pro- between thebit vectors such that c=P b . i i i i hibitiveevenformoderaten andp. Theblock(·)func- i tiongroupsrecordsaccordingtotheirblockingkeyval- Theorem 2.1. The Dice coefficient similarity of p BFs ues (BKVs) [3] and only records with the same BKV can be calculated given only a CBF as: (i.e. records in the same block) from different parties p×|{β :c[β]=p,1≤β ≤l}| are then compared and classified using our protocol. Dice sim(b ,··· ,b )= 1 p Pl c[β] A phonetic-based blocking [2] or any of the existing β=1 multi-partyblockingtechniquesforPPRL[26,27]can (2) beused as theblock(·) function. Algorithm 1: SecuresummationofBFvectors. Algorithm2: Similaritycalculationofrecordsets. Input: Input: -DMi : PartyPi’srecordsmaskedintoBFsbi,1≤i≤p -R′: ListofrandomvectorsusedbypartyP1 ortheLU for -R′: ListofrandomvectorsusedbypartyP1 ortheLU securesummationofBFs forsecuresummationofBFs -C: CandidaterecordsetswithCBFs Output: (includingrandomvectors)frompartyPp -C: CandidaterecordsetswithCBFs -st: Minimumsimilaritythresholdtoclassifyrecordsets Output: 1: C={} //initializeC -M: Matchingrecordsets 2: for1≤i≤pdo: 3: if i=1then: //firstparty 1: M={} //initializeM 54:: forcr=ecR∈′[rDecM1]+doD:M[rec] //summation 23:: Cfo.rreccse∈iveCfdroo:m(Pp) //receiveCBFs 6: C[rec]=c 1 4: c=C[cs]−R′[cs] //subtract 78:: elsCe:.sendto(P2) ////soetnhderCpatrotiPes2 5: Dicesim(cs)= p×|{β:Pc[lββ=]=1pc,[1β≤]β≤l}| //Eq.(2) 9: C.receivefrom(Pi−1) //receivefromPi−1 76:: ifMDi.caeppseimnd((c[sc)s,≥Dsictethsiemn:(cs)]) //amatch 10: forcs∈Cdo: 11: forrec∈DM do: 8. for1≤i≤pdo: 12: c=C[cs]i+DM[rec] //addition 9: M.sendto(Pi) i 13: cs=cs.append(rec) 13: C[cs]=c 14: if i6=pthen: 15: C.sendto(Pi+1) //sendCtoPi+1 16: else: tors (i.e. c=R′[rec]+Ppi=1bi−R′[rec]), which is a 17: C.sendto(P1 orLU) //sendCtoP1/LU special caseof vectoraddition,asoutlinedin lines2-4 in Algorithm 2. As shown in lines 5-7, the LU or P 1 then calculates the Dice coefficient similarity of each resulting CBF c following Eq. (2) to classify the com- • Step 3: Each party P hash-mapsthe q-gram values paredsetsofrecords(BFs)withinablockintomatches i of QIDsA of each of its ni records in their respective andnon-matchesbased on thesimilarity threshold st. databases D into n BFs (one BF per record in D ) The final similarities of matching sets of records will i i i of length l using the hash functions h1,...,hk. It is besenttoallpartiesPi,with1≤i≤p,inlines8-9in crucial to set the BF related parameters in an opti- Algorithm 2. mal way that balances all three properties of PPRL The basic secure summation protocol (BSS) used in this (complexity,quality,and privacy). Wefurther discuss ’NAI’protocolisvulnerabletocollusionriskamongthepar- parametersettingforBFsusedinourprotocolinSec- ties. Further,the numberof candidatesets to be compared tion 6. for multi-party linkage in this na¨ıve method (NAI) is ex- • Step 4: Inthenextstep,thepartiesinitiateasecure ponential in the number of parties and their dataset sizes, summation protocol to securely perform vector addi- which is prohibitively large to be practically feasible (even tionbetweentheirBFsinordertogenerateaCBFfor with the existing private blocking and filtering approaches each set of candidate records C. This secure summa- employed) [39]. Efficient communication patterns among tion can be initiated by an external linkage unit LU the parties therefore need to be employed in order to make that provides a vector (of length l) of random values multi-party PPRL scalable and viable in real applications orbyoneoftheparties(aswillbediscussedfurtherin thatrequiredataofverylargesizesfrommanypartiestobe Section 5). This process is outlined in Algorithm 1. integrated. In the following sections we propose extended secure summation protocols and two efficient communica- Inlines3-6,theLU orthepartythatinitiatedthecom- tion patterns that not only improve the scalability of our munication(weassumeP )sendsoruses,respectively, 1 multi-party PPRL protocol but also make it more secure a random vector R′[rec] for each record rec∈DM to 1 (with less possible collusion among theparties). sum with the party’s BF vector b ∈ DM (i = 1) by i i applying a vector addition operation. The summed 4. EXTENDED SECURE SUMMATION vectorsR′[rec]+b ∈CarethensenttopartyP in i i+1 line7. PartyP ,2≤i≤preceivesthesummedvectors The basic secure summation protocol (BSS) described in i from P (line9) and addsits BF vectorb ∈DM to Section 2.3 is susceptible to collusion risk by the parties i−1 i i eachcandidatesetcs∈Candsendsthesummedvec- involved,where if two ormore parties collude theyare able tors to the next party P . This process is repeated toinfertheinputofanotherparty. Inordertoovercomethis i+1 untilthelast party(i.e. P ) addsits b vectorto each risk (to improve privacy), we propose two extended secure p p received summedvectorR′[rec]+Pp−1b from party summation protocols. i=1 i P and sends the final summed vector back to the p−1 • Homomorphic-basedsecuresummation(HSS):Thepar- LU orP foreachcandidatesetcs,asshowninlines8- 1 tiallyhomomorphicPailliercryptosystem[24]isareli- 17 in Algorithm 1. ablesecuremulti-partycomputation(SMC)technique • Step 5: Finally, either theLU or the first party,P , for performing secure joint computation among sev- 1 that has provided the random vectors R′, subtracts eral parties. In the HSS approach a pair of private R′[rec]fromthefinalsummedvectorR′[rec]+Pp b and public keys is used for encrypting and decrypt- i=1 i as received from the last party P for each candidate ing the individual BF values. Successive encryption p set cs ∈ C. This is achieved by subtraction of vec- ofthesamevalueusingthesamepublickeygenerates Ring 1 Ring 2 Ring 3 Ring 4 p1 2 p2 Matchesring 1 6 p3 7 p4 Matchesring 1,2 11 p5 12 p6 Matchesring 1,2,3 14 p7 16 p8 Matches 8 13 17 3 1 4 5 9 10 14 15 18 Rvaecntdoorms Comparison Rvaecntdoorms Comparison Rvaecntdoorms Comparison Rvaecntdoorms Comparison Linkage Unit (LU) Figure3: Sequential(SEQ)communicationpatternforcomparisonofCBFsfromp=8partiesusingaLU,asdescribed in Section 5.1. In this example, the minimum number of parties per ring is set to rm =2. Communication steps are shown as circled numbers. different encrypted values with high probability, and promisingtodeterminepartialmatchesforasub-setofpar- decrypting the encrypted values using a private key ties and consider additional parties only for these partial returns the correct original value. The public key is matches. knowntoallpartieswhiletheprivatekeyisknownonly The parties are first arranged into rings of size r (with to the LU. Each party P receives summed vectors r ≤ p) based on the value for the parameter r , the mini- i m containing encrypted values to which P adds its en- mum numberof parties perring (r≥r ). The value of r i m m cryptedb vector(usingthepublickey)andsendsthe needs to be carefully chosen, as it has a trade-off between i encryptedsummedvectorstothenextparty. Without scalability (complexity) and privacy. The higher the value knowingtheprivatekeyapartyP cannotdecryptthe forr is,thebettertheprivacyoftheprotocol becausethe i m received vectors and therefore colluding with a party resultingCBFsaremoredifficulttoexploitwithaninference to learn another party’s b (with i6=j) would be im- attack (by mapping the CBFs to known values in a global j possible. databasetoinfertheirunderlyingunencodedvalues),aswill be discussed in Section 6.2. On the other hand, higher val- • Salting-based(usingrandomseedintegers)securesum- uesofr resultsinlowerscalabilitytolargedatasetsacross m mation(SSS):TheHSS approachprovidesasecureso- many parties because the number of comparisons required lutioncomparedtotheBSS approachagainstcollusion per ringexponentially increases with thering size r. attacks at the cost of an excessive computation over- head, making it not scalable to linking multiple large 5.1 Sequential communication databases. Therefore, we propose the SSS approach Inthisfirstproposedcommunicationpattern(SEQ),which to provide security against collusion attacks in an ef- requiresa LU, thematchesfound in onering are compared ficient way. Salting has been used to defend against withthecandidatesetsinthenextringresultinginaset of dictionary attacks on one-way hash functions where matches from both rings which will then be compared with anadditionalstringisconcatenatedwithavaluetobe the candidate sets in the following ring, and so on. This encrypted[32]. WeadoptasimilarconceptintheSSS communicationiscarriedoutsequentiallyuntilthematches approachwherethesaltingkeyisanadditionalrandom from all rings are found by theLU. Figure 3illustrates the integer used by each party P individually when per- i SEQ approachforfourringswithr=2partiesineachring. formingthesecuresummation. Thesaltingkeygener- Algorithm 3 details thesteps of the SEQ communication atedandusedbyeachpartyissentonlytotheLU and pattern. First the parties are grouped in rings using the therefore a party P ’s BF values cannot be identified i function group(·)(line2 in Algorithm 3). Different number by means of collusion among the parties, as without of parties (≥ r ) can be grouped into different rings. The knowing the salting key of P its BF values cannot m i valueforr canbeagreeduponbythepartiestoanyvalue be learned. Since the salting keys are random integer m r ≥ 2 at the trade-off between privacy and scalability, as values, performing secure summation of BFs with the m will bediscussed in Section 6. salting keysdoes not add any additional computation Tominimizethenumberofcomparisonsthatarerequired, and communication overhead, except thecommunica- the grouping of parties into rings is ideally done in an as- tion of salting keysfrom parties tothe LU. cending order of the size of their datasets. In this way, the first ring will generate a smaller number of matches, which 5. COMMUNICATIONPATTERNS then have to be compared with the candidate sets in the In this section, we propose two variations of improved following rings. Thisreduces thecomputationalcomplexity communicationpatternsforourprotocolbasedonCBFsfor compared to initially larger number of matches being gen- multi-partyscenarioswithandwithoutalinkageunit(LU). eratedbythefirstringifthepartiesinthefirstringarethe Themainideaoftheseimprovedcommunicationpatternsis ones with thelargest databases. to exploit the facts that most candidate sets are true non- A loop is iterated over rings in line 3 of Algorithm 3 and matches(duetotheclassimbalanceproblemofrecordlink- then the parties in rings are iterated in line 4. Each party age [3]), and that a true matching set must have a high in ring retrieves its records in line 5 and a loop is iterated similarity betweenanysub-setofrecordsinthatset. Hence over these records in line 7 to append each of them to ev- for multi-party PPRL applications with many parties it is erycandidaterecordset(frompreviousparty,exceptforthe Ring 1 Ring 2 Ring 3 p 2 p p 2a p p 2b p Phase 1 11 32 14a 35a 17b 3b8 p p p 3 6 9 4 5 hase 2 Mraintcgh 1es Mraintcgh 2es Mraitncgh e3s P 6 Matches Figure 4: Ring by ring (RBR) communication pattern for comparison of CBFs from p= 9 parties without a LU, as described in Section 5.2. In this example, the minimum number of parties per ring is rm =3. Communication steps are shown as circled numbers. Algorithm 3: ComparisonofCBFsusingSEQ (byLU). Algorithm4: ComparisonofCBFsusingRBR(withoutLU). Input: Input: -DMi :PartyPi’srecordswithRIDsandBFs,1≤i≤p -DMi :PartyPi’srecordswithRIDsandBFs,1≤i≤p -rm: Minimumringsize,withrm≥2 -rm:Minimumringsize,withrm≥3 -st: Minimumsimilaritythresholdtoclassifyrecordsets -st: Minimumsimilaritythresholdtoclassifyrecordsets Output: Output: -M: Matchingrecordsets -M:Matchingrecordsets 1: C={[]};M={[]} //initializeC,M 1: rings=group([P1,P2,···,Pp],rm) //groupparties 2: rings=group([P1,P2,···,Pp],rm) //groupparties 2: forring∈ringsdo: //phase1 3: forring∈ringsdo: //iteraterings 3: Cring={[]};Mring={[]};r=len(ring) 4: fori∈ringdo: //iterateparties 4: fori∈ringdo: //iterateparties 5: records=DMi .values() //partyPi’srecords 5: records=DMi .values() 6: forrecset∈Cdo: 6: forrecset∈Cring do: 7: forrec∈recordsdo: 7: forrec∈recordsdo: 8: recset.append(rec) //Cofthering 8: recset.append(rec) //Cinthering 9: C′=secsum(C,ring) //summation 9: C′ring=secsum(Cring,ring) //summation 10: forcs∈C′ do: 10: forcs∈C′ring do: 11: c=C′[cs] 11: c=C′ring[cs] 12: Dicesim(cs)= p×|{β:c[β]=p,1≤β≤l}| //Eq.(2) 12: Dicesim(cs)= r×|{β:c[β]=r,1≤β≤l}| //Eq.(2) Plβ=1c[β] Plβ=1c[β] 13: if Dicesim(cs)≥st then: //amatch 13: if Dicesim(cs)≥st then: //amatch 14: M.append(cs) 14: Mring.append(cs) //Minthering 15: C=M 15: C={[]};M={[]} //initialize 16: forring∈ringsdo: //phase2 17: matches=Mring.values() 18: formatchset∈Cdo: 19: formatch∈matchesdo: 20: matchset.append(match) //Cinallrings firstpartythatappendstoemptysets)storedinC(line8). 21: C′=secsum(C,[P1,P2,···,Pp]) //summation 22: forcs∈C′ do: Next,asecuresummationprotocolisappliedinline9using 23: c=C′[cs] sec sum(·) function on the candidate sets of BFs C identi- 24: Dicesim(cs)= p×|{β:c[β]=p,1≤β≤l}| //Eq.(2) fied in ring in order to generate CBFs for each candidate Plβ=1c[β] set. In lines 10-12, each CBF c generated is then used to 2265:: ifMDi.caeppseimnd((ccss))≥st then: ////Mamiantcahllrings calculatetheDicesimilarity(Dice sim)ofthecandidateset (cs) andif theDice sim(cs)≥s (line13) thencs is added t into the list of matches M identified in the ring (line 14), which will then be used as an input (line 15) in the next ring. ing CBFs from multiple parties without using a LU. This The risk of collusion between parties in order to identify method is illustrated in Figure 4 for three rings with r =3 data about another party can be reduced in this approach parties in each ring, while Algorithm 4 outlines the steps by using different BF encodings in different iterations. For of RBR in detail. Similar to SEQ, parties are grouped into example, if the encoding used in ring 1 in Figure 3 is un- rings using the group(·) function (line 1 in Algorithm 4). known to parties in ring 2, then the collusion between the Thevalueforr inRBR shouldbesettor ≥3,asamin- LU andpartiesinring2wouldnotrevealsufficientinforma- m m imum of three parties are required in each ring to perform tion to infer the actual values masked in the BFs in ring 1. secure summation without a LU. Hence,parties might wish tobegrouped with certain other The RBR method consists of two phases. In the first parties in thesame ring for better privacyprotection. phase (lines 2-14 in Algorithm 4), the parties in each ring (lines 2-4) individually perform secure summation among 5.2 Ring by ring communication them using sec sum(·) (line 9) on their sets of candidate IntheabsenceofatrustedLU,asisrequiredbytheprevi- recordsC (generatedinlines5-8)togeneratetheCBFs ring ously described SEQ communication approach, we propose C′ , and calculate their similarities to identify matches ring a ring by ring communication pattern (RBR) for compar- M in each ring (as shown in lines 10-14). Inthe second ring phase in lines 16-26, all parties then perform secure sum- The communication patternsSEQ and RBR proposed in mation among them on the matches identified in each ring Sections5.1and5.2,respectively,improveSteps4and5sig- M in order to identify matches M from all p parties. nificantly (depending on the number of parties per ring, r) ring Every ring in the first phase can employ a different set byreducingthecomputationandcommunicationcomplexi- of BF parameters to reduce the possibility of collusion be- ties. Ingeneralthecomplexitiesarereducedfrom theexpo- tweenasetofpartiesindifferentrings. Inthesecondphase, nentialgrowthwithpdowntor+1inSEQ andmax(r,p/r) all parties then have to agree on another set of parameters in RBR.With thesimplified assumption that each ring has for BF encodings of the matches identified in rings in the rparties,thecomputationandcommunicationcomplexities first phase. In addition, the rings in the first phase can be of the SEQ and RBR methodsare as follows. processed independentlyandin parallelin adistributeden- IntheSEQ method,b(n/b)r candidatesetsareprocessed vironment making the RBR more scalable (than the SEQ) in the first ring and b(n/b)r+1 in each of the remaining with larger dataset sizes. rings (i.e. b(n/b) matching sets, in the worst case, from the previous ring are compared with the b(n/b)r sets in each 6. ANALYSISOFTHE PROTOCOL ring), resulting in total computation and communication complexitiesofO(Pr b(n/b)i +(Pr+1b(n/b)i·(p/r−1)) Inthissectionweanalyzeourmulti-partyPPRLprotocol i=1 i=1 (with r < p). The computation and communication com- in terms of complexity,privacy,and linkage quality. plexities of the RBR method is O(Pr b(n/b)i ·(p/r) + i=1 6.1 Complexity analysis Pp/rb(n/b)i), where the first phase requires b(n/b)r total i=1 candidate sets to be processed in each of the p/r rings and Weassumeppartiesparticipateintheprotocol,eachhav- the second phase compares the b(n/b) matching sets from ing a database of n records. We assume a private block- ing/indexingtechniqueemployedintheblockingstepforms each ring. b ≤ n blocks for each party. In Step 1 of our protocol, Overall,thecomplexityoftheNAI methodisO(b(n/b)p), theSEQ methodisO(b(n/b)r+1·p/r),andtheRBRmethod theagreementofparametershasaconstantcommunication complexity. BlockingthedatabasesinStep2hasO(n)com- isO(max(b(n/b)r·p/r,b(n/b)p/r). Thistheoreticalanalysis putationcomplexityat eachparty,andfindingtheintersec- shows that the two proposed communication methods SEQ tionofblocksfromallpartieshasacommunicationcomplex- andRBRarecomputationallyefficientcomparedtotheNAI ityofO(p·b)andacomputationcomplexityofO(b·logb)at method. Dependingonthevaluesforpandr,theSEQ and eachparty,asp·bBKVsneedtobesecurelycommunicated, RBR methodsoutperform each other. andforeachofthebBKVsasearchoperationofO(log b)is ThememorysizerequiredforaCBFis2x,whichis2x ≥p, required in order to identify the intersection set. Assuming bits for every position in the CBF. If the length of CBF the average number of q-grams in the QID attributes A of is l, the total memory consumption is l ×⌈log2(p)⌉. For each record is g, the masking of QID values of records into p BFs the memory required is 1 bit for every position in BFsoflengthlusingkhashfunctionsfornrecordsinStep3 the BF, and therefore the total memory consumption of is O(n·g·k) at each party. 1×l×p. CBF requires relatively more memory than us- Steps4and5consist of thesecuresummation ofBFvec- ing BFs when p is small, however with increasing num- tors to calculate the CBFs of candidate sets. Ourextended ber of p in multi-party PPRL, a CBF requires significantly secure summation protocols HSS and SSS aim to improve lower memory compared to BFs. For example, when p=5 privacy at the cost of complexity overhead. The extended and l = 1,000 the respective memory sizes of a CBF and HSS protocolrequiresn·l·pencryptedvalues(longintegers corresponding BFs are 1000 ×⌈log2(5)⌉ = 3000 bits and of 4 bytes each) to be exchanged among the parties, while 1×1000×5=5000bits,whileforp=10andl=1,000they thebasicsecuresummation(BSS)andSSS requireexchang- are1000×⌈log2(10)⌉=4000bitsand1×1000×10=10000 ing n·l·p short integer values (of 2 bytes each), which is bits, respectively. more efficient compared toHSS.Further,thehomomorphic 6.2 Privacy analysis encryption and decryption functions used in HSS are com- putationally expensive compared to simple vector addition As with most of the existing PPRL approaches, we as- and subtraction operations [19]. sume that all parties follow the honest-but-curious (semi- Withthesimplifiedassumptionthatallblocksareofequal honest) adversary model [43], where the parties follow the size n/b, i.e. contain n/b BFs at each party, then using the protocolwhilebeingcurioustolearntheotherparties’data NAI communicationmethodineachblock(n/b)p candidate by means of inference attacks on input data [41] or collu- setsofBFs(i.e.allcandidatesetsofrecordsinablock)have sion [43]. To analyze the privacy against inference attacks, to be generated and their CBFs calculated. The first party we discuss what the parties can learn through inference at- performsb(n/b)summations,thesecondpartyb(n/b)2,and tacksorcollusionduringtheprotocol. Weassumeatrusted finallyb(n/b)p summationsareperformedbythelast party, LU isavailable, which doesnotcolludewith anyparties, as leadingtoatotalofO(Pp b(n/b)i)summationsinStep4. iscommonlyusedinmanypracticalPPRLapplications[29]. i=1 InStep5,eitherthelinkageunitLU orthefirstpartythat However, collusion among the database owners is a pri- initiated the secure summation protocol performs a vector vacy risk in the basic secure summation protocol where a subtraction operation (subtracts the random vectors from set of parties can collude to learn the BF of another party the summed vectors) on all the candidate sets, resulting in using their received summation values. To overcome this b(n/b)p CBFs. The similarities of these CBFs are then cal- problem, in Section 4 we have proposed two extended se- culated and matches classified, which requires O(b(n/b)p) curesummationprotocols,HSS andSSS.TheHSS protocol computations. Thiscombinatorialcomplexitycurrentlylim- uses homomorphicencryptionsfor securesummation which its the NAI linkage to a small number of parties or a large makes the protocol more secure because without knowing numberof small blocks(i.e. p or n/b has tobe small). theprivatekey(whichisonlyknowntotheLU)identifying aparty’sBFvaluesbymeansofcollusionwillnotbesuccess- against an inference attack in Section 7.5. A larger number ful. However, this protocol encryptseach integer valuein a ofpartiesinaringr(i.e.thelargerthevalueforx)resultsin BF(intotallvaluesforeachBF)intoahashkey(longinte- anincreaseinthedifficultyofaninferenceattack(asmaller gers) and thus incurs a very large communication overhead probability of suspicion 1/x) by the adversary (LU or the making the protocol not viable for linkage of multiple large first party in each ring for SEQ and RBR, respectively) at databases. The SSS protocol similarly makes the protocol thecost of more candidateset comparisons. more secure by adding additional integer values as salting Further,usingdifferenthash encodingsh ,··· ,h bydif- 1 k keys by each party individually (known only to the LU) in ferentringsinourSEQ andRBRmethodsimprovesprivacy the secure summation, such that without knowing the salt- compared to the NAI method by reducing the possibilities ingkeyvaluecollusionamongpartiestolearnaparty’sBFis ofcollusions,asdiscussedinSection5. Sincethehashfunc- notpossible. ComparedtoHSS,theSSS approachdoesnot tions used by parties in ring 1, for example, are not known incur any expensivecommunication overhead as thesalting topartiesinotherrings,acollusionbetweenpartiesinother keysare small integer values. ringsand/ortheLU will notbesuccessfulin inferringthe CommunicationoccursamongthepartiesinStep1ofour original values of parties in ring 1. A careful grouping of protocol (as described in Section 3) where they agree on parties is therefore required to improve privacy in a multi- parameter settings, and in Steps 4 and 5 where they par- partysetting(forexample,LU randomlygroupsorchanges ticipate in a secure summation protocol. The agreement of the grouping for different blocks). More specifically, when parameter settings in Step 1 does not reveal any sensitive p parties are involved in the linkage, the maximum num- information about the underlying data. Secure summation berofpossiblecombinationsforcollusion inNAI methodis involves partial (masked) data exchange where the parties Pp (p−1),whileinSEQ andRBRitisPp/rPr (r−1). i=1 i=1 j=1 communicate the partial and full summations of their BFs For example, with p = 9 parties the NAI method has 72 (1) among them in Step 4 and (2) to the LU or the first possibilities to colludewhile groupingintoequalsized rings party that initiated the communication in Step 5, respec- (r=3) leads to only 18 different collusion possibilities. tively,tocalculatetheCBFsofthecandidatesetsandtheir Thevaluesforthenumberofhashfunctionsused(k)and similarities. thelengthoftheBF(l)provideatrade-offbetweenthelink- The LU (in SEQ or NAI) or the first party in each ring age quality and privacy [34]. The higher the value for k/l, (in RBR) receives the CBFs of candidate sets in each ring. the higher the privacy and the lower the quality of linkage, Compared tocalculating similarities of setsusingtheirBFs becausethenumberofq-gramsmappedtoasingle bit(and directly,usingaCBFmakestheinferenceattackonindivid- thereforethenumberofresultingcollisions)increases,which ualBFsandthustheirq-grams(strings) mappedintothem leadstolowerlinkagequalitybutmakesitmoredifficultfor more difficult. An inference attack allows an adversary to anadversarytolearnthepossibleq-gramcombinations[16]. map a list of known values from a global dataset (e.g. q- The CLK encoding method (as discussed in Section 2.1) of gramsorattributevaluesfromapublictelephonedirectory) hash-mappingseveralQIDvaluesfrom eachrecordintoone totheencodedvalues(BFsorCBF)usingbackgroundinfor- compound BF [34, 38] makes it even more difficult for an mation (such as frequency) [41, 16]. The only information adversarytolearnindividualQIDvaluesthatcorrespondto that can be learned from such an inference attack using a a revealed bit pattern in a BF. CBF c of a set of x BFs (summed over x parties, where ei- 6.3 Linkage quality analysis therx=p in NAI and RBR methodsorx=r in SEQ and RBR methods)isifabitposition inciseither0orxwhich OurprotocolsupportsapproximatematchingofQIDval- means it is set to 0 or 1, respectively,in theBFs from all x ues, in that data errors and variations are taken into ac- parties. count depending on the minimum similarity threshold s t used. The quality of BF-based masking dependson theBF Proposition 6.1. Theprobabilityofidentifyingtheorig- parameterization [34, 37]. For a given BF length, l, and inal (unencoded) values of x (x > 1) individual records R i thenumberof elementsg (e.g. q-grams) tobeinserted into (with 1 ≤ i ≤ x) given a single CBF c is smaller than the theBF,theoptimalnumberofhashfunctions,k,thatmini- probability of identifying the original (unencoded values) of mizesthefalsepositiveratef (ofacollision oftwodifferent R given x individual BFs b , 1≤i≤x. i i q-grams being mapped to the same bit position), is calcu- ∀x Pr(R |c)<Pr(R |b ) latedas[21]k=l/g ln(2),leadingtoafalsepositiverateof i=1 i i i f =(1/2ln(2))l/g. (3) Whilek andldeterminethecomputationalaspectsofBF Proof. Assumethenumberoforiginal (unencoded)val- masking, linkage quality and privacy will be determined by uesthatcanbemappedtoamaskedBFpatternb froman i the false positive rate f. A higher value for f will mean a inference attack is n . n = 1 in the worst case, where a g g larger numberof false matchesandthuslower linkagequal- one-to-one mapping exists between the masked BF b and i ity [21, 34]. In our experimental evaluation we will set the theoriginalunencodedvalueofR . Theprobabilityofiden- i BFparametersforourapproachaccordingtothediscussion tifying the original value given a BF in the worst case sce- presented here and following earlier BF work in PPRL [9, nario is therefore Pr(R|b )=1/n =1.0. However, a CBF i i g 34, 38, 37]. represents x BFs and thus at least (in the worst case) x original (unencoded) values, which leads to a maximum of 7. EXPERIMENTAL EVALUATION Pr(R |c) = 1/x with x > 1 (when x = 1, c ≡ b ). Hence, i 1 ∀x Pr(R |c)<Pr(R |b ). Inthissection,weempiricallyevaluatetheperformanceof i=1 i i i our protocol (which we refer as AM-CBF for Approximate We will empirically evaluate and compare the amount of MatchingwithCountingBFs)withtheSEQ,RBR,andNAI privacy provided by masking records into CBFs and BFs communication patterns in terms of the three properties of PPRL,scalability(complexity),linkagequality,andprivacy. morethantwopartiesthatwouldallowustoevaluatemulti- WedescribethecompetingbaselinemethodsinSection7.1, party PPRL approaches. We therefore generated, based on the datasets used in Section 7.2, the evaluation measures the real NCVR database, a series of sub-sets for multiple in Section7.3,andtheexperimentalsettingsinSection 7.4. parties, as will be described next. We thendiscuss theexperimental results in Section 7.5. To allow the evaluation of our approach with different number of parties, with different dataset sizes, and with 7.1 Baseline methods data of different quality, we used and modified a recently We use Lai et al. [17]’s exact matching BF-based PPRL proposeddatacorruptor[5,36]togeneratevariousdatasets approach(referredasEM-BF forExactMatchingwithBFs) with different characteristics based on randomly selected and Vatsalan and Christen [39, 40]’s approximate match- records from the NCVR database. The identifiers of the ing BF-based PPRL approach (AM-BF for Approximate selected and modified records were kept the same, which MatchingwithBFs)ascompetingbaselinemethodstocom- allows us to identify true and false matches and therefore pare with our proposed approach. Since other existing ap- evaluate linkage quality, as described below. Specifically, proaches for multi-party PPRL (as reviewed in Section 8) we extracted sub-sets of 5,000, 10,000, 50,000, 100,000, are either based on expensive cryptographic techniques or 500,000, and 1,000,000 records from the NCVR to gener- applicabletocategoricaldataonly,wedonotcomparethem atedatasetsfor3,5,7,and10parties,wherethenumberof with our approach. matching records is set to 50% (i.e. half of selected records Lai et al.’s EM-BF approach [17] performs exact match- occur in thedatasets of all parties). ing of QIDs across multiple parties using BFs. In their ap- To evaluate how the approaches work with ‘dirty’ data, proach, the QID values of all records in a dataset are first wecreatedseveralseriesofdatasetsforeachofthedatasets converted into one BF. Each party then partitions its BF generatedabove,whereweincludedavaryingnumberofcor- intosegmentsaccordingtothenumberofpartiesinvolvedin ruptedrecordsintothesetsofoverlappingrecords(0%,20%, the linkage, and sends these segments to the corresponding and 40%). We applied various corruption functions on ran- other parties. The segments received by a party are com- domlyselectedattributevalues,includingcharactereditop- bined using a conjunction (logical AND) operation. The erations (insertions, deletions, substitutions, and transposi- resulting conjuncted BF segments are then exchanged be- tions),andopticalcharacterrecognition andphoneticmod- tweenthepartiestoconstructthefullconjunctedBF.Each ifications based on look-up tables and corruption rules [5]. party checks its own full BF of each record with the con- Thismeansthatacertainpercentageofrecordsintheover- juncted BF, and if the membership test is successful then lap were modified for randomly selected parties, while the the record is considered to be a match. Though the cost original values were kept for the other parties. Therefore, of this approach is low since thecomputation is completely some of these records are exact duplicatesacross some par- distributedbetweenthepartiesandtheprocessingofBFsis tiesinaset,butareonlyapproximatelymatchingduplicates fast, theapproach can only perform exact matching. acrosstheotherpartiesintheset. Thissimulates,forexam- VatsalanandChristenproposedAM-BF [39,40]byadapt- ple, the situation where three out of five hospitals have the ingtheideausedinEM-BF ofdistributivelycomputingthe correctandcompletecontactdetails(likenameandaddress) conjunction of a set of BFs from multiple parties to per- of a certain patient, while in the fourth and fifth hospitals form privacy-preserving approximate matching for multi- some of thedetails of the same patient are different. party PPRL. Once the conjuncted BF segments are com- putedbytherespectiveparties,asecuresummationprotocol 7.3 Evaluationmeasures is initiated among the parties to securely sum the number We evaluate the three properties of PPRL using the fol- of common 1-bitsin theconjunctedBF segments as well as lowingevaluation measures. Thecomplexity(scalability) of the total number of 1-bits in each party’s BF. These two linkageismeasuredbyruntime,communicationsize,andthe sumsarethenused tocalculatetheDicecoefficient similar- number of comparisons required for the linkage. The qual- ity of the set of BFs. A filtering approach is employed to ityoftheachievedlinkageismeasuredusingtheF-measure, reduce the number of comparisons based on segment simi- calculated on classified matches and non-matches, that has larity, such that if a sub-set of BF segments of a candidate widelybeenusedinrecordlinkage,informationretrievaland set(ascalculatedbyarespectiveparty)haslowersimilarity data mining[2]. then the BFs do not have to be compared with any of the Inlinewithotherwork inPPRL[26,39,40,41],weeval- BFs from theotherparties. uate privacy using disclosure risk (DR) measures based on BothEM-BF andAM-BF approachesusetheNAI method theprobabilityofsuspicion,i.e.thelikelihoodamasked(en- for comparing and classifying candidatesets of records. coded) database record in DM can be matched with one or several (masked) record(s) in a publicly available global 7.2 Datasets database G. The probability of suspicion for a masked Toprovidearealisticevaluationofourapproach,webased value/record RM, Pr(RM), is calculated as 1/n where n g g allourexperimentsonalargereal-worlddatabase,theNorth is the number of possible matches in GM to the masked Carolina Voter Registration (NCVR) database as available valueRM. Weconductedafrequencylinkageattack[41]by fromftp://alt.ncsbe.gov/data/. Thisdatabasehasbeen mapping the exchanged bit information in the BFs gener- usedfortheevaluationofvariousotherPPRLapproaches[9, atedfromDM totheBFsgeneratedfromGM. Weusedthe 26, 27, 39, 40, 41]. We have downloaded this database worst case scenario where G ≡ D, because when G ≡ D every second month since October 2011 and built a com- there will be a one-to-one exact matching of a global value bined temporal dataset thatcontains over8 million records foreachvalueinD. Basedonsuchlinkageattack,wecalcu- of voters’ names and addresses [4]. We are not aware of late the following disclosure risk measures, as proposed by any available real-world dataset that contains records from Vatsalan et al. [41].